Граф коммитов

171 Коммитов

Автор SHA1 Сообщение Дата
Evgeny Pavlov 36e56b7bdb
Fix pretraining (#485) 2024-03-20 14:45:46 -07:00
Evgeny Pavlov a359723e41
Fix compatibility (#480) 2024-03-20 14:28:02 -07:00
Evgeny Pavlov 1f12387866
Fix downloading evals (#479) 2024-03-20 14:16:10 -07:00
Evgeny Pavlov a4d25cb760
Run tests with toolchain binaries under Docker (#478)
* Add tests for alignments

* Use toolchain in CI tests

* Compile toolchain locally under Docker

* Add commands to build and run docker locally

* Trigger CI

* Install missing libs

* Clarify exporting variables for ARM processors

* Add a workaround for poetry not seeing python

* Clarify the reason for not installing packages in docker image
2024-03-20 12:57:42 -07:00
Valentin Rigal fb9531f0b5
Avoid duplicated runs during W&B publication (#484)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-03-19 12:37:42 -07:00
Valentin Rigal 8646fc6b02
Avoid duplicated backward runs publication (#483) 2024-03-19 12:32:31 -07:00
Valentin Rigal 97f7d00fbd
Support failures retrieving task artifacts (#482) 2024-03-18 14:27:20 -07:00
Greg Tatum 89c24d892f
Add a taskcluster downloader for models (#475)
* Drive-by: Fix the logging of parameters in the downloads

* Update utility to provide a default output path, and list all download modes

* Add a taskcluster downloader for models
2024-03-11 17:06:54 -05:00
Greg Tatum 65ca580a16
Add support for custom corpora through remote URLs (#420) 2024-03-06 13:03:40 -06:00
Ben Hearsum (he/him) a17ed9db12
Add non-spot 1tb CPU worker option (#471) 2024-03-05 13:28:48 -05:00
Ben Hearsum (he/him) e227bf8292
add 1tb cpu only workers (#470) 2024-03-04 18:44:06 -05:00
Valentin Rigal 8012cd30cf
Update parser documentation (#462)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-29 14:57:22 -08:00
EvaBardou 62a74a9241
Allow group traversal from training tasks dependencies during Taskcluster task group publication (#461)
* Allow group traversal from training tasks dependencies during Taskcluster task group publication

* Fix lint + Apply Valentin's suggestions

* Comment to clarify the usage of a shared set variable

* Do not publish empty groups
2024-02-29 09:04:17 -08:00
Ben Hearsum (he/him) ed41ca2cc1
Set TERM for docker worker images (#464) 2024-02-27 21:30:09 -05:00
Greg Tatum 19e46e5120
Add shufflers as utilities (#467) 2024-02-27 16:11:27 -06:00
Valentin Rigal 3706913c88
Link metrics from labels in addition to TC dependencies (#465)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-26 14:58:53 -08:00
Ben Hearsum (he/him) 9ae9d9ad5f
Use spot instances for PRs (#457) 2024-02-22 19:20:01 -05:00
Valentin Rigal 3f135aa115
Taskcluster task group publication (#406)
* Base taskcluster task group publication

* Move tag parser to utils module

* Support metrics

* Support multiple teacher training

* Fix parsing for evaluation folder

* Generic group logs parser

* Parse extra evaluation tasks and publish group_logs fake run

* Publish Marian config on runs

* Publish marian config on runs instead of experiment config

* Rebase vrigal:publish-experiment-config

* Publish experiment config on group_logs
2024-02-16 09:05:01 -08:00
Evgeny Pavlov 58cce071ef
Support typos and noise modifiers (#428)
* Update opustrainer

* Adjust configs

* Add evaluation modifiers

* Reduce noise

* Add tests for typos and noise

* Fix typos augmenter

* Fix linting issues

* Update docs

* Update opustrainer

* Adjust configs

* Add evaluation modifiers

* Reduce noise

* Add tests for typos and noise

* Fix typos augmenter

* Fix linting issues

* Update docs

* Fix test

* Update opus trainer

* Remove noise parameters from config

* Update opustrainer with fixes

* Run linter

* Fix tests after merge

* Disable noise for student

* Update lockfile

* Fix formatting

* Disable typos for student

* Rename assert functions

* Switch back to faster validation

* Document decision on using augmentations

* Fix typo
2024-02-15 15:33:24 -08:00
Evgeny Pavlov 092fd98deb
Fix random seed (#445)
* Use different random seeds for the teachers

* Fix substitution

* Pass random seed to Marian
2024-02-15 12:49:33 -08:00
Evgeny Pavlov e8ac81d9b8
Work aroud fast text model downloading failures (#435)
* Decrease max run time

* Add retry

* Remove todo

* Fix indentation
2024-02-15 10:37:08 -08:00
Evgeny Pavlov 190358a923
Fix linting for tracking (#441)
* Fix linting pythonpath for tracking

* Add pythonpath to the rest of the commands

* Remove pythonpath

* Update lockfile

* Fix wandb directory in tests
2024-02-15 09:21:33 -08:00
Ben Hearsum (he/him) 34c4e01bd6
Enable 'train' action for PRs against a supported repository (#447)
* Enable 'train' action for PRs against a supported repository

* Fix scope repo url for actions
2024-02-15 11:13:22 -05:00
Ben Hearsum (he/him) e6ec0d5474
Add support for triggering actions from PR decision tasks. (#442) 2024-02-14 19:09:54 -05:00
Ben Hearsum (he/him) 70fede467f
Add the ability to run starting from a specific task (fixes #227) (#377)
* Add the ability to run starting from a specific task (fixes #227)

A couple of example runs with this:
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`.
* https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this).

Big thanks to @gabrielBusta for suggesting this implementation!

* Update poetry dependencies to pull in newer taskgraph version
2024-02-14 09:07:07 -05:00
Ben Hearsum (he/him) 4a5fc1f8c7
fix: set --output-file correctly in taskgraph-diff (#413)
At the moment we end up with files that sit beside the output directory rather than in it. In CI, this means that we don't get the diffs uploaded as artifacts.
2024-02-13 13:19:51 -05:00
Greg Tatum ffa6d77902
Add a structured logging script (#437)
* Add a structured logger to bicleaner

* Adjust the pythonpath for the download_pack.py script
2024-02-12 16:08:02 -06:00
Valentin Rigal 9ebfd13903
Publish YAML configuration to group_logs run (#386)
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-12 09:16:36 -08:00
Greg Tatum 90d3279714
Check for the file existence rather than status code (#433) 2024-02-12 11:05:28 -06:00
Greg Tatum e0e580b116
Add newscrawl tests, and update some testing infrastructure (#434)
* Use zst in the data importer test

* Add a test for news crawl importer

* Add a tree printing method

* Drive-by: Output stderr in run_task

* Temporarily disable newscrawl test

* Merge the two tests
2024-02-12 11:04:43 -06:00
Evgeny Pavlov 31311927ef
Move snakemake to a separate folder (#431)
* Move snakemake code to a separate folder

* Small fixes

* Run linter

* Revert formatting

* Fix readme
2024-02-09 09:46:52 -08:00
Evgeny Pavlov afad4f4cad
Tune sentencepiece alphas (#421)
* Increase sp alpha and move to configs

* Add docs

* Update docs/training-guide.md

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>

* Update docs/training-guide.md

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>

* Update docs/training-guide.md

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>

---------

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
2024-02-06 12:23:18 -08:00
Ben Hearsum (he/him) 74a5a1751c
fix: adjust dependencies, fetches, and cache digests when using the `use` training continuation mode (#418)
Dependencies and fetches are unused, so they should be removed.

Cache digests should _only_ be influenced by the pretrained teacher parameters (file resources and other parameters are unused - so they do not influence the outcome of the task).

Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-02-05 15:45:59 -08:00
Greg Tatum 4a33069423
Rewrite eval.sh to python and add a json output for evaluation metrics (#401)
* Convert the evaluation script to python

* Remove the old eval scripts

* Simplify evaluate-teacher-ensemble

* Simplify evaluate-quantized

* Simplify taskcluster/kinds/evaluate/kind.yml

* Add assertions for the json evaluation metric

* Use the sacrebleu python api
2024-02-05 15:17:58 -06:00
Evgeny Pavlov d14881ea71
Download taskcluster live logs for training (#416)
* Download taskcluster live logs for training

* Refactor code
2024-02-01 15:28:45 -08:00
Greg Tatum 3d81ca5d11
Add wget to mock out downloads for importers (#410)
* Use run_task for dataset importing

* Add a wget mock for downloads
2024-01-30 16:22:13 -06:00
Greg Tatum cb4231e650
Use a run_task abstraction for eval-tests.py (#393)
* Change ca to ru in the test assertions

* Use a test defined config file

This fixes an issue where if you modify the tc.prod.yml it will break
the taskcluster tests. It will also allow for tests to share the same
config between runs and not have a dirty artifacts folder.

* Add a run_task test utility that excercises the full taskgraph

* Parameterize the eval test
2024-01-30 09:52:32 -06:00
Amit Moryossef 225b30df9a
fix(snakefile): correctly locate files to translate (#402) 2024-01-29 11:41:35 -06:00
Ben Hearsum (he/him) 4db5fd73c0
Make cleaning_type in find_upstreams only required when the cleaning-type attribute is set (#385) 2024-01-29 12:28:46 -05:00
Ben Hearsum (he/him) 437ceac078
Add documentation on how to monitor CPU, GPU, etc. on training instances (#398) 2024-01-29 11:35:22 -05:00
Amit Moryossef 6df9a73d61
fix(snakemake): remove unused packs (#404)
* fix(snakemake): remove unused packs

* Update Snakefile
2024-01-29 10:30:15 -06:00
Greg Tatum f4ded7d07f
Add huggingface to the find_corpus (#397) 2024-01-26 15:01:19 -06:00
Gabriel Bustamante 26752d0e28
Use standard compute engine instances for training (#376)
* Disable spot instances for training
* Add worker-configuration for standard VMs
2024-01-26 14:35:56 -06:00
Gabriel Bustamante 4218387361
[skip ci] add docs on `pretrained-models` configuration parameter (#349) 2024-01-25 13:52:07 -08:00
Evgeny Pavlov 96b695d4c4
Fix Bicleaner-ai GPU usage (#392)
* Fix bilceaner processes

* Use cuda 11 for bicleaner
2024-01-25 13:15:17 -08:00
Greg Tatum bbe7f6ec3f
Add tests for evaluation (#364) 2024-01-25 07:36:58 -06:00
Greg Tatum 7f43bd0c7d
Point to the docs for marian args (#381) 2024-01-25 07:31:58 -06:00
Greg Tatum 7ccb5eba7d
Add some docs to what dedupe is (#379) 2024-01-25 07:21:10 -06:00
Evgeny Pavlov 99f2397ebf
Always use Bicleaner AI (#367)
* Use only bicleaner ai

* Remove test command

* Disable hard rules for multilingual model

* Change taskcluster kinds

* Remove bilcleaner

* Fix bicleaner model step

* remove bicleaner

* Fix find upstream

* Add toolchain

* Fix arg type

* Don't delete tmp dir

* Fix artefacts

* Fix artifacts

* Fix linter issue

* Fix path

* Rename pack dir

* Add tests

* Fix typo

* Replace rename to move

* Bump max run time

* Remove expiration

* Fix docs and clarify caching strategy

* Fix doc

* Revert order

* Small fixes

* Fix typo

* Use data dir fixture

* Fix comment

* Remove unused item
2024-01-24 11:46:44 -08:00
Valentin Rigal d5f4291f16
Parse missing evaluation results (#374)
* Publish metrics at the end

* Add missing steps

* Publish all metrics to a group table

* Support more metrics formats

* Update tests

* Publish runs for extra metrics

* Prefix metrics with group

* Improve metrics publication

* Remove unused Metric.model_name

* Update tests
2024-01-23 11:19:09 -08:00