Граф коммитов

5 Коммитов

Автор SHA1 Сообщение Дата
Evgeny Pavlov 0e757b0070
Integrate OpusTrainer (#219)
integrated OpusTrainer in train.sh
    added dataset importer that can augment datasets for evaluation
    removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
    removed merge-augmented step
    adjusted pipeline settings to work with a higher amount of data
    modified the Snakemake pipeline accordingly but didn't test
    updated browsermt marian
    added docs
    added unit tests
2023-11-17 16:59:02 -08:00
Nikolay Bogoychev 75de987ea5
Integrate deduplication in the pipeline (#70)
* bicleaner-ai parallelism fixes

* i don't know bash

* Use variables for src and trg columns

* global variable doesn't appear in the function scope

* Add deduplication to the pipeline

Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant

* Forgot snakemake targets

* Try to get snakemake right

* Forgot a comma

* remove deduper binary from the list on the top

* Add deduper target again

* typo

* Add a build directory

* deduplication binary location

* Remove hardcoded path to third_party_dir

* Add deduplication at the cleaning step

* output->input

* Test array minus 1

* Revert "Test array minus 1"

This reverts commit 2b9c79794a.

* new test

* name corpora

* Get deterministic shuffling in
2022-02-11 16:50:41 -08:00
Evgeny Pavlov 174cceaa6f
Bugfix and optimization (#41)
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
2022-01-05 13:24:05 -08:00
Evgeny Pavlov ec783cfbbb
Bicleaner support + fixes (#13)
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
fixes

Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
fixes


Added script to find all datasets based on language pair and importer type, ready to use in config
fixes


Fixed conda environment activation to be reproducible on GCP

Other minor reproducibility fixes
2021-07-26 10:00:49 -07:00
Evgeny Pavlov 3bea08bf4a
Initial pipeline (#1) 2021-06-17 15:39:15 -07:00