* Add a run_pipeline utility
* Add more tests for training
* Rewrite train.sh into train.py
* Add the pipeline to the PYTHONPATH
* Ensure that the W&B tracker throws errors in CI
* Add the Taskcluster environment variables so test-fast works on the train test
* Address review comments
* Add hplt test data
* Add a way to hook into when read_lines switches locations
* Provide a way to create a test fixture file with a list of strings
* Add a min-fluency-threshold to monolingual datasets
* Add an HPLT monolingual data importer
* Add support for HPLT and OPUS monolingual data in the config generator
* Unify the hash_line uses with a WeakStringSet implementation
This is accomplished with a new transform used by the `evaluate` tasks, which avoids yielding any tasks whose `stage` matches a `pretrained-models` stage.
* Add an action to rebuild pipeline toolchains
* Rename action to rebuild-docker-images-and-toolchains and include docker-image and fetch tasks
* add documentation on how to manually rebuild cached tasks
---------
Co-authored-by: Ben Hearsum <ben@mozilla.com>
* Add fast Moses tokenizer
* Tokenize corpus and remap alignments
* Use moses tokenizer in taskcluster
* Add tests for index mapping
* Add packaged to build fast moses tokenizer
* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer
* Rename tokenization function
* Rename chunking parameter
* Relock poetry
* Rerun linter
* Publish experiments from the CI
* Disable cache for CI runs
* Revert "Disable cache for CI runs"
This reverts commit ca4593a39846a1a5cddf5ebf41a02fc698e23bea.
This was an early hack I put in to avoid, eg: retraining entire models because a comment changed in a Dockerfile. We still want to avoid this sort of thing, but these days we do so explicitly with `previous_group_ids` and other techniques in the training config.
Getting rid of this will improve our ability to cache issues in PRs. Eg: https://github.com/mozilla/firefox-translations-training/pull/738 had to be tricked into running things on a second iteration that only changed a Dockerfile.
* chore: bump taskgraph to 10.0.1
This picks up some fixes that are expected to fix#680.
I'm picking up other dependency updates as well, most notably to redo (2.x -> 3.x). That major bump is just because it's dropping Python 2.x support, which doesn't affect us.
* fix: ensure mkdir /builds always succeeds in base docker image
This may be implicitly done because it is referenced in a `VOLUME`. See https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/migrations.html#x-10-x.
* fix: don't try to decompress fetched python wheels or npz files
* fix: shorten non-URL dataset names that are more than 50 characters
* fix: add a lengthy dataset to default parameters
This ensures that we hit the dataset shortening code in CI. Updating to a newer mtdata is required to have a dataset that is lengthy enough.
* Override the existing_tasks explicitly provided in the action's input
* Log the labels and task ids of any existing tasks that are overridden
---------
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
* fix: remove unused 'reasons_created' in train taskcluster tests
* fix: Don't abort a training task entirely if a 4xx error is encountered when fetching an artifact from a previous run
* chore: Upgrade to taskgraph 8
* Adjust all-pr kind to use all-pr in task names explicitly
Prior to taskgraph 8, taskgraph was doing this itself based on the `kind`.
* Use optimization to implement skip_unless_pipeline_changed
This is in line with what upstream Taskgraph does, and has the added benefit of still generating task definitions in earlier phases even if they will be removed during optimization (makes for easier diffing/testing).
* Create a util to automatically generate configs
* Add the generated configs
* Update the config generation script
* Update the configs
* Update the configs
* Address review comments for the config generator
* Fix find_corpus test
* Configure evaluation tasks
* Extract w&b code into module
* Do not check taskcluwter when publication is disabled
* Publish evaluation metrics to W&B
* Fix running eval tracking on CI
* Use args.wandb_run_name instead of default teacher
* Remove duplicated arguments
* Retrieve dataset from Taskcluster directly
* Add missing calls to publisher and logging
* Allow publishing metrics as a table on existing runs (i.e. previous trainings)
* Update regex to parse labels ending with '-1'
* Generic support for train/eval different naming
* Update tests
* Support disabled publication
---------
Co-authored-by: Bastien Abadie <bastien@nextcairn.com>
Co-authored-by: Bastien Abadie <abadie@teklia.com>
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
* Update bicleaner
* Lock requirements
* Use larger multilingual models
* Fix requirements
* Fix typo
* Add libs for hunspell
* Fix model downloading
* Fix model downloading
* Use a toolchain for cyhunspell
* Remove a hard github dependency
* Fix test
* Add COMET to the evaluation (#587)
* Custom cleaning (#547)
* Update default config
* Pre-download fast text model
* Add custom filters
* Add unit tests for config generation
* Make using custom filtering configs configurable
* Fix substitution
* Parse tasks with label finetune-student (#609)
* Add test case
* Update regex
* Add support for automatically continuing training from earlier runs of a Task (fixes#270) (#580)
* Add support for automatically continuing training from earlier runs of a Task.
* Automatically retry training tasks on exit code 17
This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.
* Support news-crawl importer (#608)
* Ensure all of the task labels can be parsed in the task graph (#612)
* Show the stack trace when there is an error in tracking
* Rename parse_tag to parse_task_label
* Document the regexes
* Turn the parsed label into a NamedTuple for better typing hints
* Get the tag parsing working with the full task graph
* Allow for 3 letter language codes
* Temporarily disable an evaluate task that is failing
* Update code docs a bit
* Fix tests for finetune-student
* Use shorter names for URLs (#611)
* Change the caching strategy for teachers (#606)
* Fix comment
---------
Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
Co-authored-by: Valentin Rigal <rigal@teklia.com>
Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>