* Add a run_pipeline utility
* Add more tests for training
* Rewrite train.sh into train.py
* Add the pipeline to the PYTHONPATH
* Ensure that the W&B tracker throws errors in CI
* Add the Taskcluster environment variables so test-fast works on the train test
* Address review comments
* restrict github-push taskcluster events to `main`, `release*`, and `dev*`
In https://bugzilla.mozilla.org/show_bug.cgi?id=1907217 we're becoming more explicit about scopes we grant to branches in Github, which means branches that do not show up in the explicit list in fxci-config (https://github.com/mozilla-releng/fxci-config/blob/main/projects.yml) will not be able to start tasks.
* Update documentation and training helper script to take into account supported branches for Taskcluster
* Add an action to rebuild pipeline toolchains
* Rename action to rebuild-docker-images-and-toolchains and include docker-image and fetch tasks
* add documentation on how to manually rebuild cached tasks
---------
Co-authored-by: Ben Hearsum <ben@mozilla.com>
* Update default config
* Pre-download fast text model
* Add custom filters
* Add unit tests for config generation
* Make using custom filtering configs configurable
* Fix substitution
* Add the ability to run starting from a specific task (fixes#227)
A couple of example runs with this:
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`.
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this).
Big thanks to @gabrielBusta for suggesting this implementation!
* Update poetry dependencies to pull in newer taskgraph version
* Adjust opus trainer settings
* Fix optimizer delay
* Use default learning rate
* Enable back translations
* Report learning rate for teacher
* Remove old link
* Match validation and save frequency
* Roll back learning rate
* Disable snakemake dry run
* Add a note about optimizer delay
* Add a link to opus trainer paper
integrated OpusTrainer in train.sh
added dataset importer that can augment datasets for evaluation
removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
removed merge-augmented step
adjusted pipeline settings to work with a higher amount of data
modified the Snakemake pipeline accordingly but didn't test
updated browsermt marian
added docs
added unit tests
* Initial integration of opus cleaner
* Support custom filters
* Use opus cleaner in pipeline
* Fix env
* Fix filter generation
* Add more rules
* Fix elrc filter
* Fix env
* Fix frequent patterns filter
* Switch to reading from stdin
* Add a feature flag for opus cleaner
* Fix condition
* Add extra test for non empty files
* Integrate with TC
* Run linter
* Fix step config
* Fix step config
* Fix step config
* Fix step config
* Fix command
* Fix path
* Update OpusCleaner
* Remove warning
* Log filtered length
* Add opuscleaner logs
* Add comments
* Fix using custom filters
* Extract function
* Change the CI target back
* Fix file path
* Replace conda with poetry
* Add doc
* Add more comments
* Rename example filter
* Test corpus
* Fix filter name
* Use opus dataset instead of mtdata
* Make CI faster
* Add sections to makefile
* Fix custom filter search
* Redirect stderr to stdout
* Fix usage of custom config
* Fix config name
* Change back to all