firefox-translations-training

Граф коммитов

Автор	SHA1	Сообщение	Дата
evgeny pavlov	233ea51a55	Relock poetry after conflicts	2024-08-29 13:03:01 -07:00
Greg Tatum	43d5680620	Add a train task (#812 ) * Add a task for triggering training * Update the training guide	2024-08-28 14:46:10 -05:00
Ben Hearsum (he/him)	f66a7b67fa	feat: add scaffolding and basic tests for taskgraph generation (#776 ) This is prep work for https://github.com/mozilla/firefox-translations-training/issues/628, where I'd like to add some tests to avoid regressing that again in the future. The fixtures here are based on similar tests from Gecko: https://searchfox.org/mozilla-central/source/taskcluster/test. There's a bit of a terrible hack to make optimized task graphs testable, described more in the comments.	2024-08-07 13:13:07 -04:00
Ben Hearsum (he/him)	c01034ee78	chore: bump taskgraph to 9.2.0 (#738 ) * chore: bump taskgraph to 10.0.1 This picks up some fixes that are expected to fix #680. I'm picking up other dependency updates as well, most notably to redo (2.x -> 3.x). That major bump is just because it's dropping Python 2.x support, which doesn't affect us. * fix: ensure mkdir /builds always succeeds in base docker image This may be implicitly done because it is referenced in a `VOLUME`. See https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/migrations.html#x-10-x. * fix: don't try to decompress fetched python wheels or npz files	2024-07-24 19:25:02 -04:00
Valentin Rigal	794bdb2240	Rebase on main @5d35e4a3 (#696 )	2024-07-02 12:04:49 -07:00
Evgeny Pavlov	61a2704711	Fix poetry lock (#706 ) * Revert "Use pip-compile for tracking dependencies (#695)" This reverts commit `24748d0608`. * Fix numpy issue	2024-06-27 10:34:19 -07:00
Ben Hearsum (he/him)	2b9e53e0c5	chore: upgrade to Taskgraph 9 (#665 ) This is primarily to pick up https://github.com/taskcluster/taskgraph/pull/514, which will be needed for #466.	2024-06-19 10:15:13 -04:00
Ben Hearsum (he/him)	a5dd406ff4	Bump taskcluster-taskgraph to 8.2.0 in poetry (#672 ) * Bump taskcluster-taskgraph to 8.2.0 in poetry * fix: run tests when taskcluster configs changes Some of these files influence test outcomes	2024-06-19 09:11:46 -04:00
Greg Tatum	56040c94b9	Automatically generate training config files with the `task config-generator` (#620 ) * Create a util to automatically generate configs * Add the generated configs * Update the config generation script * Update the configs * Update the configs * Address review comments for the config generator * Fix find_corpus test	2024-05-24 16:09:05 -05:00
Greg Tatum	da880da8be	Use a virtual environment per requirements.txt file in run_task (#568 ) * Add missing mtdata dependency * Remove pipeline python dependencies from pyproject.toml * Use the requirements.txt file for run_task * Add venv support to the CI Dockerfile for the testing image * Add timing information to the taskgraph generation and a flag to disable the generation	2024-05-13 12:31:51 -05:00
Greg Tatum	e8c6f2e8d3	Remove the Makefile and replace it with a Taskfile (#510 )	2024-04-09 16:11:13 -05:00
Evgeny Pavlov	fab87a7a70	Add support of inline noise data augmentation (#502 ) * Add eflomal based aligner * Use new aligner for shortlist * Remove old aligner * Add Taskcluster steps for whitespace tokenized alignments * Move file to a renamed directory * Use Tags modifier in training * Update tests for alignments and shortlist * Add support of inline noise augmentation in data importer * Do not use slow inline noise augmentation in devset on CI * Remove the old alignments task * Add a test for student alignments * Fix alignments in training tests * Return matplotlib module after merge * Rename functions * Add more comments in the code * Remove compression env * Relock poetry	2024-03-28 18:10:02 -07:00
Evgeny Pavlov	3774779cb7	Add Marian server for model testing (#492 ) * Compile marian server * Add Marian server for testing * Reformat * Update utils/marian_client.py Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> * Make port configurable * Relock poetry --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>	2024-03-28 15:53:16 -07:00
Greg Tatum	78977402e0	Analysis task that provides the word distribution (#477 )	2024-03-26 13:28:43 -05:00
Valentin Rigal	3f135aa115	Taskcluster task group publication (#406 ) * Base taskcluster task group publication * Move tag parser to utils module * Support metrics * Support multiple teacher training * Fix parsing for evaluation folder * Generic group logs parser * Parse extra evaluation tasks and publish group_logs fake run * Publish Marian config on runs * Publish marian config on runs instead of experiment config * Rebase vrigal:publish-experiment-config * Publish experiment config on group_logs	2024-02-16 09:05:01 -08:00
Evgeny Pavlov	58cce071ef	Support typos and noise modifiers (#428 ) * Update opustrainer * Adjust configs * Add evaluation modifiers * Reduce noise * Add tests for typos and noise * Fix typos augmenter * Fix linting issues * Update docs * Update opustrainer * Adjust configs * Add evaluation modifiers * Reduce noise * Add tests for typos and noise * Fix typos augmenter * Fix linting issues * Update docs * Fix test * Update opus trainer * Remove noise parameters from config * Update opustrainer with fixes * Run linter * Fix tests after merge * Disable noise for student * Update lockfile * Fix formatting * Disable typos for student * Rename assert functions * Switch back to faster validation * Document decision on using augmentations * Fix typo	2024-02-15 15:33:24 -08:00
Evgeny Pavlov	190358a923	Fix linting for tracking (#441 ) * Fix linting pythonpath for tracking * Add pythonpath to the rest of the commands * Remove pythonpath * Update lockfile * Fix wandb directory in tests	2024-02-15 09:21:33 -08:00
Ben Hearsum (he/him)	70fede467f	Add the ability to run starting from a specific task (fixes #227 ) (#377 ) * Add the ability to run starting from a specific task (fixes #227) A couple of example runs with this: * https://firefox-ci-tc.services.mozilla.com/tasks/groups/YHAr0HzwSSe4pe5Yh9dIlg uses https://firefox-ci-tc.services.mozilla.com/tasks/groups/JjNp3KcyTUObUtOA9BgK5g as its `previous-group-id` with `start-stage: train-backwards` and `target-stage: train-teacher` - and ends up running `train-backwards, `translate-mono-trg`, `collect-mono-trg`, and `train-teacher`. * https://firefox-ci-tc.services.mozilla.com/tasks/groups/Sm0YV_8LQP-EOE8Nz6G5Lw uses the above group as its `previous-group-id` with `start-stage: train-teacher` and `target-stage: all`. Note that it ended up depending on tasks from both the above group and the one that it was based on, and ended up scheduling `train-teacher` and everything after it (I didn't bother letting them all run - I think the scheduling is enough to verify this). Big thanks to @gabrielBusta for suggesting this implementation! * Update poetry dependencies to pull in newer taskgraph version	2024-02-14 09:07:07 -05:00
Greg Tatum	f4ded7d07f	Add huggingface to the find_corpus (#397 )	2024-01-26 15:01:19 -06:00
Greg Tatum	673916fbf5	Make CI happy again (#362 ) * Add --no-root to fix linting issue * Add version number * Update ruff * Remove pytest clarity * Switch tests/test_tracking_cli.py to assert on unordered sets	2024-01-13 12:06:07 -06:00
Greg Tatum	f583322300	Add a preflight check utility (#353 )	2024-01-12 09:35:18 -06:00
Valentin Rigal	d35f28e542	Add publication package (#309 ) * Add documentation * Move publication parser prototype From https://github.com/mozilla/translations-experiment-tracking/pull/4 Commit a06886e0 * Update parser package for translations main repo * Remove pre-commit rules * Apply black * Update parser code * Remove package and pin requirements * Nits/Fixes * Fix taskcluster naming * Move parser to 'tracking' root folder * Switch to pyproject.toml + pinned dependencies * Add a sample for experiments structure * Update metrics parser * Add speed metrics * Only publish metrics in a bar chart * Publish fake run at last * Linting and small fixes * Merge .gitignore * Handle pushing metrics when no logs are available * Add tests * Fix tests for CI job * rename Taskcluster sample file * Suggestions * Add type hints + parser refactoring * Improve typing + run static checker (Mypy) * Suggestions * Update tests * Invert metrics data order (bleu_detok, chrf) * Update CI tests task * Fix lint * Update poetry.lock * Fix tests in CI * Fix hardcoded path * Add missing experiments/logs folder (ignored by git) * Group experiments to analyze by alphabetic order --------- Co-authored-by: Bastien Abadie <bastien@nextcairn.com> Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>	2024-01-11 13:25:53 -08:00
Evgeny Pavlov	b253a1ce6b	Fix install opuscleaner (#350 ) * Update and enable opuscleaner * Remove comment	2024-01-10 12:02:22 -08:00
Greg Tatum	e48440fc2c	Fix the vocab training script for Taskcluster (#326 )	2023-12-22 09:40:46 -06:00
Evgeny Pavlov	2d4530d0f5	Always split corpus to a fixed number of parts (#308 ) * Always split corpus to a fixed number of parts * Fix splitting * Rewrite corpus splitting in Python * Replace in taskcluster * Add tests * Unify compression tool with Taskcluster * Move zstd installation to docker image * Disable opuscleaner in CI * Compress chunks * Fix file names * Remove zeros from file index * Start file index with 1 * Fix corpus splitting * Add a link to an issue * Generate script description from doc * Use new test dir * Use new test dir * Test command line args * Clarify expected files * Add logging	2023-12-19 15:25:33 -08:00
Greg Tatum	d1be2bca4a	Update the find corpus tool to provide more information (#280 ) * Add pytest-clarity for better text diffs in tests * Add requests_mock for tests * Add the test_data artifact to the .gitignore * Use an underscore with find_corpus.py * Update the find corpus tool to provide more information * Add humanize to the dependency list	2023-12-12 15:08:59 -06:00
Evgeny Pavlov	0e757b0070	Integrate OpusTrainer (#219 ) integrated OpusTrainer in train.sh added dataset importer that can augment datasets for evaluation removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step removed merge-augmented step adjusted pipeline settings to work with a higher amount of data modified the Snakemake pipeline accordingly but didn't test updated browsermt marian added docs added unit tests	2023-11-17 16:59:02 -08:00
Evgeny Pavlov	d79162cdd2	Add tensorboard util (#233 ) * Add tensorboard util * Use python TC lib for artifact downloading	2023-10-25 12:49:40 -07:00
Evgeny Pavlov	e9102a37ef	Integrate OpusCleaner (#163 ) * Initial integration of opus cleaner * Support custom filters * Use opus cleaner in pipeline * Fix env * Fix filter generation * Add more rules * Fix elrc filter * Fix env * Fix frequent patterns filter * Switch to reading from stdin * Add a feature flag for opus cleaner * Fix condition * Add extra test for non empty files * Integrate with TC * Run linter * Fix step config * Fix step config * Fix step config * Fix step config * Fix command * Fix path * Update OpusCleaner * Remove warning * Log filtered length * Add opuscleaner logs * Add comments * Fix using custom filters * Extract function * Change the CI target back * Fix file path * Replace conda with poetry * Add doc * Add more comments * Rename example filter * Test corpus * Fix filter name * Use opus dataset instead of mtdata * Make CI faster * Add sections to makefile * Fix custom filter search * Redirect stderr to stdout * Fix usage of custom config * Fix config name * Change back to all	2023-09-26 15:29:07 -07:00
Evgeny Pavlov	299d41c34b	Add TC test run to CI (#195 ) * Add snakemake test run to CI * Add toolchain * Add docker image * Reduce datasets * Move ci to a separate config * Add utils to poetry * Fix config * Fix config * Disable docker * Use test docker image * Fix artifacts dir * Fix tests * Fix profile setting * Fix root dir * Faster translation * Expose artifacts * Change default TC config * Fix default TC config * Disable snakemake run * Enable running on PR * Fix ci config * Add vocab size argument * Retrigger CI * Add a comment on snakemake run * Use a smaller teacher model for CI * Try to retrigger downloading * Use the same year for mono src and trg * Revert changes [skip ci] * Revert test config [skip ci] * Fix comment [skip ci]	2023-09-20 09:40:30 -07:00
Greg Tatum	96b9b3f16b	Add ruff and black linting to the CI (#187 ) * Add python's black formatter * Apply black formatting * Add the ruff linter * Run make lint-fix * Suppress or fix lint issues * Add a fix-all make command	2023-09-08 09:50:24 -05:00

31 Коммитов