Граф коммитов

151 Коммитов

Автор SHA1 Сообщение Дата
Greg Tatum f1668c1a1c
Merge corpus rewrite to python (#851) 2024-10-17 13:54:31 -05:00
Greg Tatum a8883aaedc
Add environment variables to all artifacts and fetches (#852) 2024-09-24 11:14:06 -05:00
Ben Hearsum (he/him) be8bef73cd
feat: bump to taskgraph 11.1.0 (#858)
Most notably, this includes a change that allows us to have up to 10,000 dependencies per task, which will fix #653.
2024-09-24 09:06:39 -04:00
Greg Tatum 78b92326b2
Use the config.ci.yml for the training defaults (#856) 2024-09-23 13:59:01 -05:00
Greg Tatum 9d355d82fe
Rewrite train.sh to train.py (#842)
* Add a run_pipeline utility

* Add more tests for training

* Rewrite train.sh into train.py

* Add the pipeline to the PYTHONPATH

* Ensure that the W&B tracker throws errors in CI

* Add the Taskcluster environment variables so test-fast works on the train test

* Address review comments
2024-09-18 09:04:48 -05:00
Greg Tatum 34b9c1d76c
Add an HPLT data importer (#837)
* Add hplt test data

* Add a way to hook into when read_lines switches locations

* Provide a way to create a test fixture file with a list of strings

* Add a min-fluency-threshold to monolingual datasets

* Add an HPLT monolingual data importer

* Add support for HPLT and OPUS monolingual data in the config generator

* Unify the hash_line uses with a WeakStringSet implementation
2024-09-16 11:43:18 -05:00
Greg Tatum 98f8f1cff2
Splitter compression (#836)
* Split run_task commands on ampersand

* Remove the compression commands from splitter.py

* Add split corpus test
2024-09-09 13:31:02 -05:00
Ben Hearsum (he/him) 603bf18349
chore: update to taskgraph 11 (#834)
This includes https://github.com/taskcluster/taskgraph/pull/569, which is the upstreamed version of the fix for https://github.com/taskcluster/taskgraph/pull/569.
2024-09-06 09:56:06 -04:00
Ben Hearsum (he/him) fe815e126f
fix: don't run evaluate tasks on pretrained models (#781)
This is accomplished with a new transform used by the `evaluate` tasks, which avoids yielding any tasks whose `stage` matches a `pretrained-models` stage.
2024-09-04 19:08:20 -04:00
Greg Tatum 92c2b45263
Remove the compression command configuration, and only use zstandard (#824)
* Remove compression options - Update all kind.yml

* Remove compression options - Update .sh files

* Remove compression options - Update .py files

* Rename test_read_lines to test_common_downloads

* Add unit tests for compress_file and decompress_file

* Add requirements files for python code that uses requests
2024-09-04 14:40:57 -05:00
Gabriel Bustamante 68aa0a7377
Add an action to rebuild pipeline toolchains and docker images (#798)
* Add an action to rebuild pipeline toolchains

* Rename action to rebuild-docker-images-and-toolchains and include docker-image and fetch tasks

* add documentation on how to manually rebuild cached tasks

---------

Co-authored-by: Ben Hearsum <ben@mozilla.com>
2024-09-04 14:35:29 -04:00
Greg Tatum cbd844db4b
Rename all commands to use $TASK_WORKDIR/artifacts rather than an absolute path (#825) 2024-09-04 08:35:51 -05:00
Greg Tatum 3c54f4d6d8
Ensure we always export the VCS_PATH into the PYTHONPATH (#826) 2024-09-04 08:35:01 -05:00
Greg Tatum 06001d6f8c
Rewrite merge mono and add support for an OPUS monolingual importer (#787) 2024-08-30 10:35:02 -05:00
Evgeny Pavlov 15152efaaa Bump disk for cefilter (#807) 2024-08-29 13:03:01 -07:00
Gabriel Bustamante 87859b3a45 Change base image to trigger toolchain rebuilds (#797) 2024-08-29 13:03:01 -07:00
Gabriel Bustamante 9e82e005b9 Change base image to trigger toolchain rebuilds (#794) 2024-08-29 13:03:01 -07:00
Gabriel Bustamante 5effecd4bf Change base image to trigger toolchain rebuilds (#791) 2024-08-29 13:03:01 -07:00
Evgeny Pavlov 012a45c54b Bump disk for student (#786)
* Bump disk for student

* Fix alias
2024-08-29 13:03:01 -07:00
Ben Hearsum (he/him) 82efecebd4 add 2tb gpu workers (#782)
We're seeing out of disk issues in https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1, which is suspected to be an OpusTrainer bug. In the short term, we're going to workaround this by adding 2tb workers. In the medium term we'll fix the root cause and remove these.
2024-08-29 13:03:01 -07:00
Evgeny Pavlov 3021955fcf Process alignments in chunks (#763)
* Process alignments in chunks

* Use chunking in tests

* Switch to a smaller machine

* Document chunk naming
2024-08-29 13:03:01 -07:00
Evgeny Pavlov 7f8e545a64 Use larger machine (#762) 2024-08-29 13:03:01 -07:00
evgeny pavlov 78408b8837 Optimize alignments (#703)
* Add fast Moses tokenizer

* Tokenize corpus and remap alignments

* Use moses tokenizer in taskcluster

* Add tests for index mapping

* Add packaged to build fast moses tokenizer

* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

* Rename tokenization function

* Rename chunking parameter

* Relock poetry

* Rerun linter
2024-08-29 13:03:01 -07:00
Valentin Rigal 9db196cc30
Publish experiments to W&B from the CI (#817)
* Publish experiments from the CI

* Disable cache for CI runs

* Revert "Disable cache for CI runs"

This reverts commit ca4593a39846a1a5cddf5ebf41a02fc698e23bea.
2024-08-29 09:13:43 -07:00
Ben Hearsum (he/him) f66a7b67fa
feat: add scaffolding and basic tests for taskgraph generation (#776)
This is prep work for https://github.com/mozilla/firefox-translations-training/issues/628, where I'd like to add some tests to avoid regressing that again in the future.

The fixtures here are based on similar tests from Gecko: https://searchfox.org/mozilla-central/source/taskcluster/test. There's a bit of a terrible hack to make optimized task graphs testable, described more in the comments.
2024-08-07 13:13:07 -04:00
Greg Tatum 30adda4c10
Simplify translate mono kind files (#770)
* Simplify translate-mono-src

* Simplify translate-mono-trg
2024-08-01 13:57:23 -05:00
Greg Tatum 8ba3952136
Add a check that there are visible GPUs (#722) 2024-07-29 17:00:12 -05:00
Ben Hearsum (he/him) 27f95c9849
fix: invalidate caches when fetch, docker, or toolchain tasks change (#761)
This was an early hack I put in to avoid, eg: retraining entire models because a comment changed in a Dockerfile. We still want to avoid this sort of thing, but these days we do so explicitly with `previous_group_ids` and other techniques in the training config.

Getting rid of this will improve our ability to cache issues in PRs. Eg: https://github.com/mozilla/firefox-translations-training/pull/738 had to be tricked into running things on a second iteration that only changed a Dockerfile.
2024-07-25 09:39:42 -04:00
Ben Hearsum (he/him) c01034ee78
chore: bump taskgraph to 9.2.0 (#738)
* chore: bump taskgraph to 10.0.1

This picks up some fixes that are expected to fix #680.

I'm picking up other dependency updates as well, most notably to redo (2.x -> 3.x). That major bump is just because it's dropping Python 2.x support, which doesn't affect us.

* fix: ensure mkdir /builds always succeeds in base docker image

This may be implicitly done because it is referenced in a `VOLUME`. See https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/migrations.html#x-10-x.

* fix: don't try to decompress fetched python wheels or npz files
2024-07-24 19:25:02 -04:00
Evgeny Pavlov 6b6b64999e
Update ci config (#709) 2024-07-08 10:54:42 -07:00
Ben Hearsum (he/him) 8f1a068cdc
fix: shorten non-URL dataset names that are more than 50 characters (fixes #654) (#712)
* fix: shorten non-URL dataset names that are more than 50 characters

* fix: add a lengthy dataset to default parameters

This ensures that we hit the dataset shortening code in CI. Updating to a newer mtdata is required to have a dataset that is lengthy enough.
2024-07-08 13:18:58 -04:00
Bastien Abadie 5d35e4a30c
Expose Takscluster task owner as author for Weight & Biases publication (#704)
* Expose Takscluster task owner as author for Weight & Biases publication

* Publish author tag
2024-07-01 10:07:26 -07:00
Gabriel Bustamante 44d81c1c07
Configure generic-worker (d2g) pools for translation alignment tasks (#702) 2024-06-26 10:46:59 -05:00
Evgeny Pavlov ce17c62d84 Fix task expiration issue (#692) 2024-06-25 15:38:23 -05:00
Gabriel Bustamante a64a30cad8 Override the existing_tasks explicitly provided in the action's input (#683)
* Override the existing_tasks explicitly provided in the action's input

* Log the labels and task ids of any existing tasks that are overridden

---------

Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-06-25 15:38:23 -05:00
Evgeny Pavlov 524378b80e Bump memory for shortlist (#685) 2024-06-25 15:38:23 -05:00
Evgeny Pavlov 9a467171d6 Make mono shuffling deterministic to fix caching issues (#677)
* Make shuffling deterministic

* Retrigger CI
2024-06-25 15:38:23 -05:00
Ben Hearsum (he/him) 78341eb398 fix: Don't abort a training task entirely if a 4xx error is encountered when fetching an artifact from a previous run (#673)
* fix: remove unused 'reasons_created' in train taskcluster tests

* fix: Don't abort a training task entirely if a 4xx error is encountered when fetching an artifact from a previous run
2024-06-25 15:38:23 -05:00
Evgeny Pavlov 0da64ba7cf Bump disk for scoring (#668) 2024-06-25 15:38:23 -05:00
Ben Hearsum (he/him) 2b9e53e0c5
chore: upgrade to Taskgraph 9 (#665)
This is primarily to pick up https://github.com/taskcluster/taskgraph/pull/514, which will be needed for #466.
2024-06-19 10:15:13 -04:00
Ben Hearsum (he/him) 9ad9abaacf
Refactor b-cpu-xlargedisk worker pools to allow for experimentation with different configurations (#686)
Same as #674, except for main. At the moment the pipeline doesn't run there because there's no `b-linux-large-gcp-1tb` workers anymore.
2024-06-19 09:40:16 -04:00
Ben Hearsum (he/him) a5dd406ff4
Bump taskcluster-taskgraph to 8.2.0 in poetry (#672)
* Bump taskcluster-taskgraph to 8.2.0 in poetry

* fix: run tests when taskcluster configs changes

Some of these files influence test outcomes
2024-06-19 09:11:46 -04:00
Ben Hearsum (he/him) cb6c76f02c
Upgrade to Taskgraph 8 (#664)
* chore: Upgrade to taskgraph 8

* Adjust all-pr kind to use all-pr in task names explicitly

Prior to taskgraph 8, taskgraph was doing this itself based on the `kind`.

* Use optimization to implement skip_unless_pipeline_changed

This is in line with what upstream Taskgraph does, and has the added benefit of still generating task definitions in earlier phases even if they will be removed during optimization (makes for easier diffing/testing).
2024-06-11 12:43:12 -04:00
evgeny pavlov fac7511f0f
Merge branch 'release' 2024-05-27 11:09:54 -07:00
Evgeny Pavlov d41f07a561
Revert CPU generic workers (#629)
* Revert CPU generic workers

* Trigger CI
2024-05-27 09:44:15 -07:00
Greg Tatum 56040c94b9
Automatically generate training config files with the `task config-generator` (#620)
* Create a util to automatically generate configs

* Add the generated configs

* Update the config generation script

* Update the configs

* Update the configs

* Address review comments for the config generator

* Fix find_corpus test
2024-05-24 16:09:05 -05:00
Ben Hearsum (he/him) 207a77379b
Retry tasks on exit code 128 to work around intermittent issues with DNS failures (#624)
This should help make #549 less painful. I suggest we back it out once we get to the bottom of that.
2024-05-23 16:08:06 -04:00
Valentin Rigal 8a1d8ef2c3
Publish evaluation metrics (#598)
* Configure evaluation tasks

* Extract w&b code into module

* Do not check taskcluwter when publication is disabled

* Publish evaluation metrics to W&B

* Fix running eval tracking on CI

* Use args.wandb_run_name instead of default teacher

* Remove duplicated arguments

* Retrieve dataset from Taskcluster directly

* Add missing calls to publisher and logging

* Allow publishing metrics as a table on existing runs (i.e. previous trainings)

* Update regex to parse labels ending with '-1'

* Generic support for train/eval different naming

* Update tests

* Support disabled publication

---------

Co-authored-by: Bastien Abadie <bastien@nextcairn.com>
Co-authored-by: Bastien Abadie <abadie@teklia.com>
Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-05-22 11:35:29 -07:00
Evgeny Pavlov 6411267b3c
Upgrade to Bicleaner 3 (#604)
* Update bicleaner

* Lock requirements

* Use larger multilingual models

* Fix requirements

* Fix typo

* Add libs for hunspell

* Fix model downloading

* Fix model downloading

* Use a toolchain for cyhunspell

* Remove a hard github dependency

* Fix test

* Add COMET to the evaluation (#587)

* Custom cleaning (#547)

* Update default config

* Pre-download fast text model

* Add custom filters

* Add unit tests for config generation

* Make using custom filtering configs configurable

* Fix substitution

* Parse tasks with label finetune-student (#609)

* Add test case

* Update regex

* Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Support news-crawl importer (#608)

* Ensure all of the task labels can be parsed in the task graph (#612)

* Show the stack trace when there is an error in tracking

* Rename parse_tag to parse_task_label

* Document the regexes

* Turn the parsed label into a NamedTuple for better typing hints

* Get the tag parsing working with the full task graph

* Allow for 3 letter language codes

* Temporarily disable an evaluate task that is failing

* Update code docs a bit

* Fix tests for finetune-student

* Use shorter names for URLs (#611)

* Change the caching strategy for teachers (#606)

* Fix comment

---------

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
Co-authored-by: Valentin Rigal <rigal@teklia.com>
Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
2024-05-22 10:49:28 -07:00
Greg Tatum f5a6af8384
Change the caching strategy for teachers (#606) 2024-05-20 23:42:44 -05:00