* chore: bump taskgraph to 10.0.1
This picks up some fixes that are expected to fix#680.
I'm picking up other dependency updates as well, most notably to redo (2.x -> 3.x). That major bump is just because it's dropping Python 2.x support, which doesn't affect us.
* fix: ensure mkdir /builds always succeeds in base docker image
This may be implicitly done because it is referenced in a `VOLUME`. See https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/migrations.html#x-10-x.
* fix: don't try to decompress fetched python wheels or npz files
* Create a util to automatically generate configs
* Add the generated configs
* Update the config generation script
* Update the configs
* Update the configs
* Address review comments for the config generator
* Fix find_corpus test
* Add missing mtdata dependency
* Remove pipeline python dependencies from pyproject.toml
* Use the requirements.txt file for run_task
* Add venv support to the CI Dockerfile for the testing image
* Add timing information to the taskgraph generation and a flag to disable the generation
* Add eflomal based aligner
* Use new aligner for shortlist
* Remove old aligner
* Add Taskcluster steps for whitespace tokenized alignments
* Move file to a renamed directory
* Use Tags modifier in training
* Update tests for alignments and shortlist
* Add support of inline noise augmentation in data importer
* Do not use slow inline noise augmentation in devset on CI
* Remove the old alignments task
* Add a test for student alignments
* Fix alignments in training tests
* Return matplotlib module after merge
* Rename functions
* Add more comments in the code
* Remove compression env
* Relock poetry
* Compile marian server
* Add Marian server for testing
* Reformat
* Update utils/marian_client.py
Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
* Make port configurable
* Relock poetry
---------
Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
* Base taskcluster task group publication
* Move tag parser to utils module
* Support metrics
* Support multiple teacher training
* Fix parsing for evaluation folder
* Generic group logs parser
* Parse extra evaluation tasks and publish group_logs fake run
* Publish Marian config on runs
* Publish marian config on runs instead of experiment config
* Rebase vrigal:publish-experiment-config
* Publish experiment config on group_logs
* Fix linting pythonpath for tracking
* Add pythonpath to the rest of the commands
* Remove pythonpath
* Update lockfile
* Fix wandb directory in tests
* Add the ability to run starting from a specific task (fixes#227)
A couple of example runs with this:
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`.
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this).
Big thanks to @gabrielBusta for suggesting this implementation!
* Update poetry dependencies to pull in newer taskgraph version
* Always split corpus to a fixed number of parts
* Fix splitting
* Rewrite corpus splitting in Python
* Replace in taskcluster
* Add tests
* Unify compression tool with Taskcluster
* Move zstd installation to docker image
* Disable opuscleaner in CI
* Compress chunks
* Fix file names
* Remove zeros from file index
* Start file index with 1
* Fix corpus splitting
* Add a link to an issue
* Generate script description from doc
* Use new test dir
* Use new test dir
* Test command line args
* Clarify expected files
* Add logging
* Add pytest-clarity for better text diffs in tests
* Add requests_mock for tests
* Add the test_data artifact to the .gitignore
* Use an underscore with find_corpus.py
* Update the find corpus tool to provide more information
* Add humanize to the dependency list
integrated OpusTrainer in train.sh
added dataset importer that can augment datasets for evaluation
removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
removed merge-augmented step
adjusted pipeline settings to work with a higher amount of data
modified the Snakemake pipeline accordingly but didn't test
updated browsermt marian
added docs
added unit tests
* Initial integration of opus cleaner
* Support custom filters
* Use opus cleaner in pipeline
* Fix env
* Fix filter generation
* Add more rules
* Fix elrc filter
* Fix env
* Fix frequent patterns filter
* Switch to reading from stdin
* Add a feature flag for opus cleaner
* Fix condition
* Add extra test for non empty files
* Integrate with TC
* Run linter
* Fix step config
* Fix step config
* Fix step config
* Fix step config
* Fix command
* Fix path
* Update OpusCleaner
* Remove warning
* Log filtered length
* Add opuscleaner logs
* Add comments
* Fix using custom filters
* Extract function
* Change the CI target back
* Fix file path
* Replace conda with poetry
* Add doc
* Add more comments
* Rename example filter
* Test corpus
* Fix filter name
* Use opus dataset instead of mtdata
* Make CI faster
* Add sections to makefile
* Fix custom filter search
* Redirect stderr to stdout
* Fix usage of custom config
* Fix config name
* Change back to all
* Add snakemake test run to CI
* Add toolchain
* Add docker image
* Reduce datasets
* Move ci to a separate config
* Add utils to poetry
* Fix config
* Fix config
* Disable docker
* Use test docker image
* Fix artifacts dir
* Fix tests
* Fix profile setting
* Fix root dir
* Faster translation
* Expose artifacts
* Change default TC config
* Fix default TC config
* Disable snakemake run
* Enable running on PR
* Fix ci config
* Add vocab size argument
* Retrigger CI
* Add a comment on snakemake run
* Use a smaller teacher model for CI
* Try to retrigger downloading
* Use the same year for mono src and trg
* Revert changes [skip ci]
* Revert test config [skip ci]
* Fix comment [skip ci]
* Add python's black formatter
* Apply black formatting
* Add the ruff linter
* Run make lint-fix
* Suppress or fix lint issues
* Add a fix-all make command