firefox-translations-training

Граф коммитов

Автор	SHA1	Сообщение	Дата
Evgeny Pavlov	0e757b0070	Integrate OpusTrainer (#219 ) integrated OpusTrainer in train.sh added dataset importer that can augment datasets for evaluation removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step removed merge-augmented step adjusted pipeline settings to work with a higher amount of data modified the Snakemake pipeline accordingly but didn't test updated browsermt marian added docs added unit tests	2023-11-17 16:59:02 -08:00
Nikolay Bogoychev	75de987ea5	Integrate deduplication in the pipeline (#70 ) * bicleaner-ai parallelism fixes * i don't know bash * Use variables for src and trg columns * global variable doesn't appear in the function scope * Add deduplication to the pipeline Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant * Forgot snakemake targets * Try to get snakemake right * Forgot a comma * remove deduper binary from the list on the top * Add deduper target again * typo * Add a build directory * deduplication binary location * Remove hardcoded path to third_party_dir * Add deduplication at the cleaning step * output->input * Test array minus 1 * Revert "Test array minus 1" This reverts commit `2b9c79794a`. * new test * name corpora * Get deterministic shuffling in	2022-02-11 16:50:41 -08:00
Evgeny Pavlov	174cceaa6f	Bugfix and optimization (#41 ) - bugfix - training and decoding optimization - evaluation refactoring - small usability improvements - moved marian configurations overriding back to configs	2022-01-05 13:24:05 -08:00
Evgeny Pavlov	ec783cfbbb	Bicleaner support + fixes (#13 ) SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets. fixes Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs). fixes Added script to find all datasets based on language pair and importer type, ready to use in config fixes Fixed conda environment activation to be reproducible on GCP Other minor reproducibility fixes	2021-07-26 10:00:49 -07:00
Evgeny Pavlov	3bea08bf4a	Initial pipeline (#1 )	2021-06-17 15:39:15 -07:00

5 Коммитов