* bicleaner-ai parallelism fixes
* i don't know bash
* Use variables for src and trg columns
* global variable doesn't appear in the function scope
* Add deduplication to the pipeline
Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant
* Forgot snakemake targets
* Try to get snakemake right
* Forgot a comma
* remove deduper binary from the list on the top
* Add deduper target again
* typo
* Add a build directory
* deduplication binary location
* Remove hardcoded path to third_party_dir
* Add deduplication at the cleaning step
* output->input
* Test array minus 1
* Revert "Test array minus 1"
This reverts commit 2b9c79794a.
* new test
* name corpora
* Get deterministic shuffling in
* Fix to use with the latest mtdata version
The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata
* Bump mtdata version
* MTdata fixes as per @eu9ene 's suggestions
* Update prod.conf with the new MTDATA interface
Also removed non-existing dataset and removed some duplicates.
* Update test config
* Sort entries by group to more easily see duplicates
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
- workflow management using Snakemake
- parallelization to run on a cluster
- Singularity containerization support
- Slurm support
- teacher ensemble support
- Flores dataset importer
- custom dataset importer
- ability to use a pre-trained backward model
- save experiment config on start
- stubs for dataset caching ( decided to sync implementation with workflow manager integration )
- use best bleu models instead of best ce-mean-words
- fix linting warnings
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
fixes
Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
fixes
Added script to find all datasets based on language pair and importer type, ready to use in config
fixes
Fixed conda environment activation to be reproducible on GCP
Other minor reproducibility fixes