integrated OpusTrainer in train.sh
added dataset importer that can augment datasets for evaluation
removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
removed merge-augmented step
adjusted pipeline settings to work with a higher amount of data
modified the Snakemake pipeline accordingly but didn't test
updated browsermt marian
added docs
added unit tests
* bicleaner-ai parallelism fixes
* i don't know bash
* Use variables for src and trg columns
* global variable doesn't appear in the function scope
* Add deduplication to the pipeline
Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant
* Forgot snakemake targets
* Try to get snakemake right
* Forgot a comma
* remove deduper binary from the list on the top
* Add deduper target again
* typo
* Add a build directory
* deduplication binary location
* Remove hardcoded path to third_party_dir
* Add deduplication at the cleaning step
* output->input
* Test array minus 1
* Revert "Test array minus 1"
This reverts commit 2b9c79794a.
* new test
* name corpora
* Get deterministic shuffling in
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
fixes
Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
fixes
Added script to find all datasets based on language pair and importer type, ready to use in config
fixes
Fixed conda environment activation to be reproducible on GCP
Other minor reproducibility fixes