firefox-translations-training/pipeline/clean/fixes
Ben Hearsum (he/him) 3f023cf1c9
Add initial work on Taskcluster training pipeline (#115)
* Adjust ci config for staging repo

* Use bhearsum's taskgraph repo for now

It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment.

* Get rid of hello kind now that we know that Taskcluster works

* Add worker type for b-linux-large, for more CPU intensive tasks; reformat yaml in ci/config.yml

* Add yamllint config for taskcluster files

* Add toolchain tasks for things that we depend on to train language models.

Most of these are straight forward download and compiles, but there's a few callouts:
- The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though?
- Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks.
- CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain.

This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine.

* Bump decision task image

* Add tasks to fetch a few dataset types

I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these.

As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names.

This revision also builds out a couple of transforms that are used here, and will be used elsewhere:
* One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc.
* Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task.

* Add configuration for black and ruff for python formatting

* Add `clean` stage of the training pipeline

This is largely built on the earlier work done on the `dataset` kind.

* Update pipeline scripts to work with Taskcluster

These is a few things:
1) Mark all the shell scripts as +x
2) Switch out pigz/gz for zstdmt/zst (in progress)
3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress).
4) Use `curl` instead of `wget` (in progress)

* Add treeherder symbol for decision task

We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have.

* Add a `train` action task to support kicking off the training pipeline

This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements.

* Add bicleaner pack fetches

These are an additional dependency for the `bicleaner` stage of the pipeline.

* Implement `bicleaner` pipeline stage

Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are:
- The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...)
- At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers.
- An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest
- A similar enchancement to the substitution transform allow substituting parameters

* Raise taskgraph level for pushes, cron, and actions to level 3

Most of this diff is just indentation changes.

* Re-adjust ci-config.yml for production repository

* Don't set treeherder routes for pull requests

* Use standard cache prefixes

* Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster changes

* Bump taskgraph version; re-enable pip hash checking

* Switch cache attributes to be nested, instead of multiple top level attributes.

* Override compression scheme in pipeline steps.

* Use setdefault in parameters.py

* Update clean-mono.sh to use abstracted compression commands and artifact extensions
2023-05-10 19:16:12 -04:00
..
detok.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_JW300.mt.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_JW300.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_DOGC_v2.ca.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_DOGC_v2.es.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_DOGC_v2.sh Add initial work on Taskcluster training pipeline (#115) 2023-05-10 19:16:12 -04:00
mtdata_OPUS_ECB_v1.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_SETIMES_v2.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_UNPC_v1_0.en.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_OPUS_UNPC_v1_0.fr.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_neulab_tedtalksv1_train.ro.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00
mtdata_neulab_tedtalksv1_train.sh Quality improvements (#29) 2021-12-06 15:03:35 -08:00