firefox-translations-training

Граф коммитов

Автор	SHA1	Сообщение	Дата
Lisa Korotkova	96079ca258	Fix unbound LD_LIBRARY_PATH (#79 )	2022-03-23 12:07:26 -07:00
Evgeny Pavlov	22a3751a09	Add support of Mozilla slurm cluster (#72 )	2022-02-22 17:48:21 -08:00
Nikolay Bogoychev	75de987ea5	Integrate deduplication in the pipeline (#70 ) * bicleaner-ai parallelism fixes * i don't know bash * Use variables for src and trg columns * global variable doesn't appear in the function scope * Add deduplication to the pipeline Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant * Forgot snakemake targets * Try to get snakemake right * Forgot a comma * remove deduper binary from the list on the top * Add deduper target again * typo * Add a build directory * deduplication binary location * Remove hardcoded path to third_party_dir * Add deduplication at the cleaning step * output->input * Test array minus 1 * Revert "Test array minus 1" This reverts commit `2b9c79794a`. * new test * name corpora * Get deterministic shuffling in	2022-02-11 16:50:41 -08:00
Nikolay Bogoychev	20f6429ac7	hotfix: the biclean() function doesn't have access to global variables (#68 )	2022-02-07 10:12:14 -08:00
Nikolay Bogoychev	8dcb6492ae	bicleaner-ai parallelism fixes (#67 )	2022-02-04 14:39:39 -08:00
Nikolay Bogoychev	7e58a6badd	Fix to use with the latest mtdata version (#60 ) * Fix to use with the latest mtdata version The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata * Bump mtdata version * MTdata fixes as per @eu9ene 's suggestions * Update prod.conf with the new MTDATA interface Also removed non-existing dataset and removed some duplicates. * Update test config * Sort entries by group to more easily see duplicates	2022-02-03 15:23:25 -08:00
Evgeny Pavlov	ddba8ebf25	Fix bicleaner skipping (#59 )	2022-01-25 22:26:41 -08:00
Evgeny Pavlov	a4ada6ce1a	Fine-tuning and bicleaner bug fix (#51 ) - split teacher training and finetuning - fix student fine-tuning - add support of pretrained vocab - fix bicleaner	2022-01-24 13:56:49 -08:00
Kenneth Heafield	a0097a86b9	Make patience metric be chrf (#50 )	2022-01-14 10:46:38 -08:00
Evgeny Pavlov	174cceaa6f	Bugfix and optimization (#41 ) - bugfix - training and decoding optimization - evaluation refactoring - small usability improvements - moved marian configurations overriding back to configs	2022-01-05 13:24:05 -08:00
Evgeny Pavlov	3b3f33bf25	Quality improvements (#29 )	2021-12-06 15:03:35 -08:00
Evgeny Pavlov	a09b0ac7ac	Update README.md	2021-10-28 11:07:11 -07:00
Evgeny Pavlov	ef8928b454	Snakemake integration (#24 ) - workflow management using Snakemake - parallelization to run on a cluster - Singularity containerization support - Slurm support - teacher ensemble support	2021-10-28 10:39:09 -07:00
Evgeny Pavlov	4260ed5fcd	hotfix: remove corpus shuffling on merge	2021-08-23 22:05:06 -07:00
Evgeny Pavlov	0f6e64cf19	Minor improvements (#20 ) - Flores dataset importer - custom dataset importer - ability to use a pre-trained backward model - save experiment config on start - stubs for dataset caching ( decided to sync implementation with workflow manager integration ) - use best bleu models instead of best ce-mean-words - fix linting warnings	2021-08-17 13:20:34 -07:00
Evgeny Pavlov	ec783cfbbb	Bicleaner support + fixes (#13 ) SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets. fixes Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs). fixes Added script to find all datasets based on language pair and importer type, ready to use in config fixes Fixed conda environment activation to be reproducible on GCP Other minor reproducibility fixes	2021-07-26 10:00:49 -07:00
Evgeny Pavlov	af2abbf525	Add reference to bergamot project	2021-06-21 14:58:02 -07:00
Evgeny Pavlov	4b12dee551	Fix readme after renaming	2021-06-21 14:38:07 -07:00
Evgeny Pavlov	2bcdef2b36	Rename repo	2021-06-21 14:33:31 -07:00
Evgeny Pavlov	3bea08bf4a	Initial pipeline (#1 )	2021-06-17 15:39:15 -07:00
Evgeny Pavlov	8d11fb1e97	Initial commit	2021-04-30 15:36:49 -07:00

1 2 3 4

171 Коммитов Все ветки Поиск

171 Коммитов

Все ветки