* Configure Taskcluster secret for w&b through taskgraph transform
* Direct secret usage
* Support recursive join on tags
* Do not use taskcluster filtering with --from-stream
* Enable verbose mode
* More debugging lines
* Redirect opustrainer stderr to stdout
* Log marian lines in verbose output
* Fix test
* Log raw lines & add prefix to our own logging
* Correct project name
* Convert Taskcluster trigger from transform to to kind
* Fix lint
* Setup tracking on all training tasks
* Set WandB group & run name
* Skip publication on unit tests
* Make perplexity optional
* Update test fixture
* Add training parameter to control publication
* Move trigger control to python
* Use task config to get wandb names
* Bashism
* Use taskcluster group logic to build task name
* Run on WANDB_PUBLICATION=false, but do not publish
* Expose weight & biases tags
We're hitting some odd issues with caches that need to be worked out. Eg:
error: cache /builds/worker/checkouts is not empty and is missing a .cacherequires file; the cache names for this task are likely mis-configured or TASKCLUSTER_CACHES is not set properly
(from https://firefox-ci-tc.services.mozilla.com/tasks/IvbeCQBuRuKIOaeOIGEfHg/runs/7)
I made this in #533 to ensure I got a full test run in that PR. However, there was no need for this to make it to main. Let's back this out to avoid changing cache digests there.
* Switch CPU tasks to generic-worker/d2g images (fixes#473)
This switches us from the deprecated docker-worker to generic-worker. generic-worker provides a translation layer for docker-worker tasks that avoids the need to change any payloads. (It will download specified images and run payload commands in them, rather than on the host machine.)
Upgrading to this new image will give us memory monitoring capabilities on the CPU workers because the new image has the GCP Ops Agent installed on it.
* Invalidate docker image cache to force rebuilds of all docker tasks on d2g workers
* Add eflomal based aligner
* Use new aligner for shortlist
* Remove old aligner
* Add Taskcluster steps for whitespace tokenized alignments
* Move file to a renamed directory
* Use Tags modifier in training
* Update tests for alignments and shortlist
* Add support of inline noise augmentation in data importer
* Do not use slow inline noise augmentation in devset on CI
* Remove the old alignments task
* Add a test for student alignments
* Fix alignments in training tests
* Return matplotlib module after merge
* Rename functions
* Add more comments in the code
* Remove compression env
* Relock poetry
* Compile marian server
* Add Marian server for testing
* Reformat
* Update utils/marian_client.py
Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
* Make port configurable
* Relock poetry
---------
Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
* Consistent evaluation tags parsing
* Add test
* Support backwards training task label
* Support evaluation task with suffix
* Support suffixes with form -1/2
---------
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
* Skip checking existing runs for new projects
Follow up of #484
Otherwise a value Error was raised by the wandb client
* Suggestion
---------
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
* Install OpenBLAS to run Marian
* Fix Marian's version
* Support running training with run_task
* Add extra args for Marian
* Make train.sh compatible with CPU
* Remove redundant export
* Add tests for training
* Fix formatting
* Fetch cuda libs
* Document regex
* Compile marian to use on CPU for tests
* Fix formatting
* Fix comment
* Make the file names consistent
* Add tests for alignments
* Use toolchain in CI tests
* Compile toolchain locally under Docker
* Add commands to build and run docker locally
* Trigger CI
* Install missing libs
* Clarify exporting variables for ARM processors
* Add a workaround for poetry not seeing python
* Clarify the reason for not installing packages in docker image
* Drive-by: Fix the logging of parameters in the downloads
* Update utility to provide a default output path, and list all download modes
* Add a taskcluster downloader for models
* Allow group traversal from training tasks dependencies during Taskcluster task group publication
* Fix lint + Apply Valentin's suggestions
* Comment to clarify the usage of a shared set variable
* Do not publish empty groups
* Base taskcluster task group publication
* Move tag parser to utils module
* Support metrics
* Support multiple teacher training
* Fix parsing for evaluation folder
* Generic group logs parser
* Parse extra evaluation tasks and publish group_logs fake run
* Publish Marian config on runs
* Publish marian config on runs instead of experiment config
* Rebase vrigal:publish-experiment-config
* Publish experiment config on group_logs
* Fix linting pythonpath for tracking
* Add pythonpath to the rest of the commands
* Remove pythonpath
* Update lockfile
* Fix wandb directory in tests
* Add the ability to run starting from a specific task (fixes#227)
A couple of example runs with this:
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`.
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this).
Big thanks to @gabrielBusta for suggesting this implementation!
* Update poetry dependencies to pull in newer taskgraph version
At the moment we end up with files that sit beside the output directory rather than in it. In CI, this means that we don't get the diffs uploaded as artifacts.