firefox-translations-training

Граф коммитов

Автор	SHA1	Сообщение	Дата
Greg Tatum	f1668c1a1c	Merge corpus rewrite to python (#851 )	2024-10-17 13:54:31 -05:00
Greg Tatum	a8883aaedc	Add environment variables to all artifacts and fetches (#852 )	2024-09-24 11:14:06 -05:00
Ben Hearsum (he/him)	be8bef73cd	feat: bump to taskgraph 11.1.0 (#858 ) Most notably, this includes a change that allows us to have up to 10,000 dependencies per task, which will fix #653.	2024-09-24 09:06:39 -04:00
Greg Tatum	78b92326b2	Use the config.ci.yml for the training defaults (#856 )	2024-09-23 13:59:01 -05:00
Greg Tatum	9d355d82fe	Rewrite train.sh to train.py (#842 ) * Add a run_pipeline utility * Add more tests for training * Rewrite train.sh into train.py * Add the pipeline to the PYTHONPATH * Ensure that the W&B tracker throws errors in CI * Add the Taskcluster environment variables so test-fast works on the train test * Address review comments	2024-09-18 09:04:48 -05:00
Greg Tatum	34b9c1d76c	Add an HPLT data importer (#837 ) * Add hplt test data * Add a way to hook into when read_lines switches locations * Provide a way to create a test fixture file with a list of strings * Add a min-fluency-threshold to monolingual datasets * Add an HPLT monolingual data importer * Add support for HPLT and OPUS monolingual data in the config generator * Unify the hash_line uses with a WeakStringSet implementation	2024-09-16 11:43:18 -05:00
Greg Tatum	98f8f1cff2	Splitter compression (#836 ) * Split run_task commands on ampersand * Remove the compression commands from splitter.py * Add split corpus test	2024-09-09 13:31:02 -05:00
Ben Hearsum (he/him)	603bf18349	chore: update to taskgraph 11 (#834 ) This includes https://github.com/taskcluster/taskgraph/pull/569, which is the upstreamed version of the fix for https://github.com/taskcluster/taskgraph/pull/569.	2024-09-06 09:56:06 -04:00
Ben Hearsum (he/him)	fe815e126f	fix: don't run evaluate tasks on pretrained models (#781 ) This is accomplished with a new transform used by the `evaluate` tasks, which avoids yielding any tasks whose `stage` matches a `pretrained-models` stage.	2024-09-04 19:08:20 -04:00
Greg Tatum	92c2b45263	Remove the compression command configuration, and only use zstandard (#824 ) * Remove compression options - Update all kind.yml * Remove compression options - Update .sh files * Remove compression options - Update .py files * Rename test_read_lines to test_common_downloads * Add unit tests for compress_file and decompress_file * Add requirements files for python code that uses requests	2024-09-04 14:40:57 -05:00
Gabriel Bustamante	68aa0a7377	Add an action to rebuild pipeline toolchains and docker images (#798 ) * Add an action to rebuild pipeline toolchains * Rename action to rebuild-docker-images-and-toolchains and include docker-image and fetch tasks * add documentation on how to manually rebuild cached tasks --------- Co-authored-by: Ben Hearsum <ben@mozilla.com>	2024-09-04 14:35:29 -04:00
Greg Tatum	cbd844db4b	Rename all commands to use $TASK_WORKDIR/artifacts rather than an absolute path (#825 )	2024-09-04 08:35:51 -05:00
Greg Tatum	3c54f4d6d8	Ensure we always export the VCS_PATH into the PYTHONPATH (#826 )	2024-09-04 08:35:01 -05:00
Greg Tatum	06001d6f8c	Rewrite merge mono and add support for an OPUS monolingual importer (#787 )	2024-08-30 10:35:02 -05:00
Evgeny Pavlov	15152efaaa	Bump disk for cefilter (#807 )	2024-08-29 13:03:01 -07:00
Gabriel Bustamante	87859b3a45	Change base image to trigger toolchain rebuilds (#797 )	2024-08-29 13:03:01 -07:00
Gabriel Bustamante	9e82e005b9	Change base image to trigger toolchain rebuilds (#794 )	2024-08-29 13:03:01 -07:00
Gabriel Bustamante	5effecd4bf	Change base image to trigger toolchain rebuilds (#791 )	2024-08-29 13:03:01 -07:00
Evgeny Pavlov	012a45c54b	Bump disk for student (#786 ) * Bump disk for student * Fix alias	2024-08-29 13:03:01 -07:00
Ben Hearsum (he/him)	82efecebd4	add 2tb gpu workers (#782 ) We're seeing out of disk issues in https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1, which is suspected to be an OpusTrainer bug. In the short term, we're going to workaround this by adding 2tb workers. In the medium term we'll fix the root cause and remove these.	2024-08-29 13:03:01 -07:00
Evgeny Pavlov	3021955fcf	Process alignments in chunks (#763 ) * Process alignments in chunks * Use chunking in tests * Switch to a smaller machine * Document chunk naming	2024-08-29 13:03:01 -07:00
Evgeny Pavlov	7f8e545a64	Use larger machine (#762 )	2024-08-29 13:03:01 -07:00
evgeny pavlov	78408b8837	Optimize alignments (#703 ) * Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter	2024-08-29 13:03:01 -07:00
Valentin Rigal	9db196cc30	Publish experiments to W&B from the CI (#817 ) * Publish experiments from the CI * Disable cache for CI runs * Revert "Disable cache for CI runs" This reverts commit ca4593a39846a1a5cddf5ebf41a02fc698e23bea.	2024-08-29 09:13:43 -07:00
Ben Hearsum (he/him)	f66a7b67fa	feat: add scaffolding and basic tests for taskgraph generation (#776 ) This is prep work for https://github.com/mozilla/firefox-translations-training/issues/628, where I'd like to add some tests to avoid regressing that again in the future. The fixtures here are based on similar tests from Gecko: https://searchfox.org/mozilla-central/source/taskcluster/test. There's a bit of a terrible hack to make optimized task graphs testable, described more in the comments.	2024-08-07 13:13:07 -04:00
Greg Tatum	30adda4c10	Simplify translate mono kind files (#770 ) * Simplify translate-mono-src * Simplify translate-mono-trg	2024-08-01 13:57:23 -05:00
Greg Tatum	8ba3952136	Add a check that there are visible GPUs (#722 )	2024-07-29 17:00:12 -05:00
Ben Hearsum (he/him)	27f95c9849	fix: invalidate caches when fetch, docker, or toolchain tasks change (#761 ) This was an early hack I put in to avoid, eg: retraining entire models because a comment changed in a Dockerfile. We still want to avoid this sort of thing, but these days we do so explicitly with `previous_group_ids` and other techniques in the training config. Getting rid of this will improve our ability to cache issues in PRs. Eg: https://github.com/mozilla/firefox-translations-training/pull/738 had to be tricked into running things on a second iteration that only changed a Dockerfile.	2024-07-25 09:39:42 -04:00
Ben Hearsum (he/him)	c01034ee78	chore: bump taskgraph to 9.2.0 (#738 ) * chore: bump taskgraph to 10.0.1 This picks up some fixes that are expected to fix #680. I'm picking up other dependency updates as well, most notably to redo (2.x -> 3.x). That major bump is just because it's dropping Python 2.x support, which doesn't affect us. * fix: ensure mkdir /builds always succeeds in base docker image This may be implicitly done because it is referenced in a `VOLUME`. See https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/migrations.html#x-10-x. * fix: don't try to decompress fetched python wheels or npz files	2024-07-24 19:25:02 -04:00
Evgeny Pavlov	6b6b64999e	Update ci config (#709 )	2024-07-08 10:54:42 -07:00
Ben Hearsum (he/him)	8f1a068cdc	fix: shorten non-URL dataset names that are more than 50 characters (fixes #654 ) (#712 ) * fix: shorten non-URL dataset names that are more than 50 characters * fix: add a lengthy dataset to default parameters This ensures that we hit the dataset shortening code in CI. Updating to a newer mtdata is required to have a dataset that is lengthy enough.	2024-07-08 13:18:58 -04:00
Bastien Abadie	5d35e4a30c	Expose Takscluster task owner as author for Weight & Biases publication (#704 ) * Expose Takscluster task owner as author for Weight & Biases publication * Publish author tag	2024-07-01 10:07:26 -07:00
Gabriel Bustamante	44d81c1c07	Configure generic-worker (d2g) pools for translation alignment tasks (#702 )	2024-06-26 10:46:59 -05:00
Evgeny Pavlov	ce17c62d84	Fix task expiration issue (#692 )	2024-06-25 15:38:23 -05:00
Gabriel Bustamante	a64a30cad8	Override the existing_tasks explicitly provided in the action's input (#683 ) * Override the existing_tasks explicitly provided in the action's input * Log the labels and task ids of any existing tasks that are overridden --------- Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>	2024-06-25 15:38:23 -05:00
Evgeny Pavlov	524378b80e	Bump memory for shortlist (#685 )	2024-06-25 15:38:23 -05:00
Evgeny Pavlov	9a467171d6	Make mono shuffling deterministic to fix caching issues (#677 ) * Make shuffling deterministic * Retrigger CI	2024-06-25 15:38:23 -05:00
Ben Hearsum (he/him)	78341eb398	fix: Don't abort a training task entirely if a 4xx error is encountered when fetching an artifact from a previous run (#673 ) * fix: remove unused 'reasons_created' in train taskcluster tests * fix: Don't abort a training task entirely if a 4xx error is encountered when fetching an artifact from a previous run	2024-06-25 15:38:23 -05:00
Evgeny Pavlov	0da64ba7cf	Bump disk for scoring (#668 )	2024-06-25 15:38:23 -05:00
Ben Hearsum (he/him)	2b9e53e0c5	chore: upgrade to Taskgraph 9 (#665 ) This is primarily to pick up https://github.com/taskcluster/taskgraph/pull/514, which will be needed for #466.	2024-06-19 10:15:13 -04:00
Ben Hearsum (he/him)	9ad9abaacf	Refactor b-cpu-xlargedisk worker pools to allow for experimentation with different configurations (#686 ) Same as #674, except for main. At the moment the pipeline doesn't run there because there's no `b-linux-large-gcp-1tb` workers anymore.	2024-06-19 09:40:16 -04:00
Ben Hearsum (he/him)	a5dd406ff4	Bump taskcluster-taskgraph to 8.2.0 in poetry (#672 ) * Bump taskcluster-taskgraph to 8.2.0 in poetry * fix: run tests when taskcluster configs changes Some of these files influence test outcomes	2024-06-19 09:11:46 -04:00
Ben Hearsum (he/him)	cb6c76f02c	Upgrade to Taskgraph 8 (#664 ) * chore: Upgrade to taskgraph 8 * Adjust all-pr kind to use all-pr in task names explicitly Prior to taskgraph 8, taskgraph was doing this itself based on the `kind`. * Use optimization to implement skip_unless_pipeline_changed This is in line with what upstream Taskgraph does, and has the added benefit of still generating task definitions in earlier phases even if they will be removed during optimization (makes for easier diffing/testing).	2024-06-11 12:43:12 -04:00
evgeny pavlov	fac7511f0f	Merge branch 'release'	2024-05-27 11:09:54 -07:00
Evgeny Pavlov	d41f07a561	Revert CPU generic workers (#629 ) * Revert CPU generic workers * Trigger CI	2024-05-27 09:44:15 -07:00
Greg Tatum	56040c94b9	Automatically generate training config files with the `task config-generator` (#620 ) * Create a util to automatically generate configs * Add the generated configs * Update the config generation script * Update the configs * Update the configs * Address review comments for the config generator * Fix find_corpus test	2024-05-24 16:09:05 -05:00
Ben Hearsum (he/him)	207a77379b	Retry tasks on exit code 128 to work around intermittent issues with DNS failures (#624 ) This should help make #549 less painful. I suggest we back it out once we get to the bottom of that.	2024-05-23 16:08:06 -04:00
Valentin Rigal	8a1d8ef2c3	Publish evaluation metrics (#598 ) * Configure evaluation tasks * Extract w&b code into module * Do not check taskcluwter when publication is disabled * Publish evaluation metrics to W&B * Fix running eval tracking on CI * Use args.wandb_run_name instead of default teacher * Remove duplicated arguments * Retrieve dataset from Taskcluster directly * Add missing calls to publisher and logging * Allow publishing metrics as a table on existing runs (i.e. previous trainings) * Update regex to parse labels ending with '-1' * Generic support for train/eval different naming * Update tests * Support disabled publication --------- Co-authored-by: Bastien Abadie <bastien@nextcairn.com> Co-authored-by: Bastien Abadie <abadie@teklia.com> Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>	2024-05-22 11:35:29 -07:00
Evgeny Pavlov	6411267b3c	Upgrade to Bicleaner 3 (#604 ) * Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> Co-authored-by: Valentin Rigal <rigal@teklia.com> Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>	2024-05-22 10:49:28 -07:00
Greg Tatum	f5a6af8384	Change the caching strategy for teachers (#606 )	2024-05-20 23:42:44 -05:00

1 2 3 4

151 Коммитов