* Add prod configs
* Add augmentation to the validation set
* Fix validation datasets
* Switch stage to teacher training
* Use evaluate-teacher as target
* Disable some mono data and add comments
* Disable NLLB mono for back-translations
* Add uk-en
* Move manually edited configs to a separated folder
* Fix datasets
* Create a util to automatically generate configs
* Add the generated configs
* Update the config generation script
* Update the configs
* Update the configs
* Address review comments for the config generator
* Fix find_corpus test
* Always split corpus to a fixed number of parts
* Fix splitting
* Rewrite corpus splitting in Python
* Replace in taskcluster
* Add tests
* Unify compression tool with Taskcluster
* Move zstd installation to docker image
* Disable opuscleaner in CI
* Compress chunks
* Fix file names
* Remove zeros from file index
* Start file index with 1
* Fix corpus splitting
* Add a link to an issue
* Generate script description from doc
* Use new test dir
* Use new test dir
* Test command line args
* Clarify expected files
* Add logging
integrated OpusTrainer in train.sh
added dataset importer that can augment datasets for evaluation
removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
removed merge-augmented step
adjusted pipeline settings to work with a higher amount of data
modified the Snakemake pipeline accordingly but didn't test
updated browsermt marian
added docs
added unit tests
* Increase workspace
* Add example of TC prod config
* Use 4 gpu worker for scoring
* Use level 1 workers
* Rollback
* Sync task cluster yml with main
* Use a worker with a larger disk
* Increase workspace
* Add example of TC prod config
* Use 4 gpu worker for scoring
* Rollback
* Use a worker with a larger disk
* Initial integration of opus cleaner
* Support custom filters
* Use opus cleaner in pipeline
* Fix env
* Fix filter generation
* Add more rules
* Fix elrc filter
* Fix env
* Fix frequent patterns filter
* Switch to reading from stdin
* Add a feature flag for opus cleaner
* Fix condition
* Add extra test for non empty files
* Integrate with TC
* Run linter
* Fix step config
* Fix step config
* Fix step config
* Fix step config
* Fix command
* Fix path
* Update OpusCleaner
* Remove warning
* Log filtered length
* Add opuscleaner logs
* Add comments
* Fix using custom filters
* Extract function
* Change the CI target back
* Fix file path
* Replace conda with poetry
* Add doc
* Add more comments
* Rename example filter
* Test corpus
* Fix filter name
* Use opus dataset instead of mtdata
* Make CI faster
* Add sections to makefile
* Fix custom filter search
* Redirect stderr to stdout
* Fix usage of custom config
* Fix config name
* Change back to all
* Add snakemake test run to CI
* Add toolchain
* Add docker image
* Reduce datasets
* Move ci to a separate config
* Add utils to poetry
* Fix config
* Fix config
* Disable docker
* Use test docker image
* Fix artifacts dir
* Fix tests
* Fix profile setting
* Fix root dir
* Faster translation
* Expose artifacts
* Change default TC config
* Fix default TC config
* Disable snakemake run
* Enable running on PR
* Fix ci config
* Add vocab size argument
* Retrigger CI
* Add a comment on snakemake run
* Use a smaller teacher model for CI
* Try to retrigger downloading
* Use the same year for mono src and trg
* Revert changes [skip ci]
* Revert test config [skip ci]
* Fix comment [skip ci]
* Enable model checkpointing
* Do not use memory limits
* Reduce training time for testing
* Reconfigure csd3
* Reduce test updates
* Change final model name
* Fix copying of decoder config
* Fix to use with the latest mtdata version
The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata
* Bump mtdata version
* MTdata fixes as per @eu9ene 's suggestions
* Update prod.conf with the new MTDATA interface
Also removed non-existing dataset and removed some duplicates.
* Update test config
* Sort entries by group to more easily see duplicates
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
- workflow management using Snakemake
- parallelization to run on a cluster
- Singularity containerization support
- Slurm support
- teacher ensemble support