firefox-translations-training

Граф коммитов

Автор	SHA1	Сообщение	Дата
Greg Tatum	fd2f7da7a4	Spring 2024 config fixes (#659 ) * Remove swedish_work_environment * Remove failling datasets --------- Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>	2024-06-11 15:20:40 -07:00
Evgeny Pavlov	7f12b1ad01	Prepare configs for training (#636 ) * Add prod configs * Add augmentation to the validation set * Fix validation datasets * Switch stage to teacher training * Use evaluate-teacher as target * Disable some mono data and add comments * Disable NLLB mono for back-translations * Add uk-en * Move manually edited configs to a separated folder * Fix datasets	2024-05-30 14:06:50 -07:00
Greg Tatum	cb88080703	Fix the monolingual nllb dataset to use zst and report the sentence count (#650 ) * Update the config generator * Update the configs	2024-05-30 10:20:59 -07:00
Greg Tatum	789bfa635f	Add mono nllb to the config generation (#647 ) * Add nllb mono to config generation * Update configs * Remove extra space from config endings	2024-05-29 14:22:37 -07:00
Evgeny Pavlov	3d592e570b	Add HPLT mono data to the configs (#646 ) * Add HPLT mono data * Revert dataset stats	2024-05-29 11:44:24 -07:00
Greg Tatum	4a86accaae	Update the config generator to remove 404s, and improve the Content-Length lookup (#643 )	2024-05-28 16:17:37 -05:00
Greg Tatum	b5ee6eeced	Fix config generation for ted talks and hide known inaccurate sizes (#641 )	2024-05-28 14:15:36 -05:00
Greg Tatum	56040c94b9	Automatically generate training config files with the `task config-generator` (#620 ) * Create a util to automatically generate configs * Add the generated configs * Update the config generation script * Update the configs * Update the configs * Address review comments for the config generator * Fix find_corpus test	2024-05-24 16:09:05 -05:00
Evgeny Pavlov	31311927ef	Move snakemake to a separate folder (#431 ) * Move snakemake code to a separate folder * Small fixes * Run linter * Revert formatting * Fix readme	2024-02-09 09:46:52 -08:00
Greg Tatum	7f43bd0c7d	Point to the docs for marian args (#381 )	2024-01-25 07:31:58 -06:00
Evgeny Pavlov	2d4530d0f5	Always split corpus to a fixed number of parts (#308 ) * Always split corpus to a fixed number of parts * Fix splitting * Rewrite corpus splitting in Python * Replace in taskcluster * Add tests * Unify compression tool with Taskcluster * Move zstd installation to docker image * Disable opuscleaner in CI * Compress chunks * Fix file names * Remove zeros from file index * Start file index with 1 * Fix corpus splitting * Add a link to an issue * Generate script description from doc * Use new test dir * Use new test dir * Test command line args * Clarify expected files * Add logging	2023-12-19 15:25:33 -08:00
Greg Tatum	742fb8f999	Add documentation to various parts of the scripts and pipeline (#298 )	2023-12-15 13:34:05 -06:00
Evgeny Pavlov	0e757b0070	Integrate OpusTrainer (#219 ) integrated OpusTrainer in train.sh added dataset importer that can augment datasets for evaluation removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step removed merge-augmented step adjusted pipeline settings to work with a higher amount of data modified the Snakemake pipeline accordingly but didn't test updated browsermt marian added docs added unit tests	2023-11-17 16:59:02 -08:00
Evgeny Pavlov	83d43bfcf6	Update docs (#224 ) * Update docs * Fix typos * Fix TC docs * Fix relative links	2023-10-16 16:33:29 -07:00
Evgeny Pavlov	de4218d8cf	Fixes after training on a full dataset (#221 ) * Increase workspace * Add example of TC prod config * Use 4 gpu worker for scoring * Use level 1 workers * Rollback * Sync task cluster yml with main * Use a worker with a larger disk * Increase workspace * Add example of TC prod config * Use 4 gpu worker for scoring * Rollback * Use a worker with a larger disk	2023-10-10 16:24:55 -07:00
Evgeny Pavlov	e9102a37ef	Integrate OpusCleaner (#163 ) * Initial integration of opus cleaner * Support custom filters * Use opus cleaner in pipeline * Fix env * Fix filter generation * Add more rules * Fix elrc filter * Fix env * Fix frequent patterns filter * Switch to reading from stdin * Add a feature flag for opus cleaner * Fix condition * Add extra test for non empty files * Integrate with TC * Run linter * Fix step config * Fix step config * Fix step config * Fix step config * Fix command * Fix path * Update OpusCleaner * Remove warning * Log filtered length * Add opuscleaner logs * Add comments * Fix using custom filters * Extract function * Change the CI target back * Fix file path * Replace conda with poetry * Add doc * Add more comments * Rename example filter * Test corpus * Fix filter name * Use opus dataset instead of mtdata * Make CI faster * Add sections to makefile * Fix custom filter search * Redirect stderr to stdout * Fix usage of custom config * Fix config name * Change back to all	2023-09-26 15:29:07 -07:00
Evgeny Pavlov	299d41c34b	Add TC test run to CI (#195 ) * Add snakemake test run to CI * Add toolchain * Add docker image * Reduce datasets * Move ci to a separate config * Add utils to poetry * Fix config * Fix config * Disable docker * Use test docker image * Fix artifacts dir * Fix tests * Fix profile setting * Fix root dir * Faster translation * Expose artifacts * Change default TC config * Fix default TC config * Disable snakemake run * Enable running on PR * Fix ci config * Add vocab size argument * Retrigger CI * Add a comment on snakemake run * Use a smaller teacher model for CI * Try to retrigger downloading * Use the same year for mono src and trg * Revert changes [skip ci] * Revert test config [skip ci] * Fix comment [skip ci]	2023-09-20 09:40:30 -07:00
Evgeny Pavlov	7c58f6558b	Move configuraiton to profiles (#96 ) * Move configuration settings to profiles * Use realtive paths * Fix output formatting * Update dag * Update docs	2022-06-17 10:56:07 -07:00
Evgeny Pavlov	270d29b90a	Checkpointing training (#80 ) * Enable model checkpointing * Do not use memory limits * Reduce training time for testing * Reconfigure csd3 * Reduce test updates * Change final model name * Fix copying of decoder config	2022-03-31 11:03:51 -07:00
Evgeny Pavlov	22a3751a09	Add support of Mozilla slurm cluster (#72 )	2022-02-22 17:48:21 -08:00
Nikolay Bogoychev	7e58a6badd	Fix to use with the latest mtdata version (#60 ) * Fix to use with the latest mtdata version The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata * Bump mtdata version * MTdata fixes as per @eu9ene 's suggestions * Update prod.conf with the new MTDATA interface Also removed non-existing dataset and removed some duplicates. * Update test config * Sort entries by group to more easily see duplicates	2022-02-03 15:23:25 -08:00
Evgeny Pavlov	a4ada6ce1a	Fine-tuning and bicleaner bug fix (#51 ) - split teacher training and finetuning - fix student fine-tuning - add support of pretrained vocab - fix bicleaner	2022-01-24 13:56:49 -08:00
Evgeny Pavlov	174cceaa6f	Bugfix and optimization (#41 ) - bugfix - training and decoding optimization - evaluation refactoring - small usability improvements - moved marian configurations overriding back to configs	2022-01-05 13:24:05 -08:00
Evgeny Pavlov	3b3f33bf25	Quality improvements (#29 )	2021-12-06 15:03:35 -08:00
Evgeny Pavlov	ef8928b454	Snakemake integration (#24 ) - workflow management using Snakemake - parallelization to run on a cluster - Singularity containerization support - Slurm support - teacher ensemble support	2021-10-28 10:39:09 -07:00

25 Коммитов