Граф коммитов

171 Коммитов

Автор SHA1 Сообщение Дата
Lisa Korotkova 96079ca258
Fix unbound LD_LIBRARY_PATH (#79) 2022-03-23 12:07:26 -07:00
Evgeny Pavlov 22a3751a09
Add support of Mozilla slurm cluster (#72) 2022-02-22 17:48:21 -08:00
Nikolay Bogoychev 75de987ea5
Integrate deduplication in the pipeline (#70)
* bicleaner-ai parallelism fixes

* i don't know bash

* Use variables for src and trg columns

* global variable doesn't appear in the function scope

* Add deduplication to the pipeline

Often times we would have duplicated sentences from various datasets. We should remove them before we start training. Furthermore, we should also not shuffle the training data, as marian does its own shuffling. It's redundant

* Forgot snakemake targets

* Try to get snakemake right

* Forgot a comma

* remove deduper binary from the list on the top

* Add deduper target again

* typo

* Add a build directory

* deduplication binary location

* Remove hardcoded path to third_party_dir

* Add deduplication at the cleaning step

* output->input

* Test array minus 1

* Revert "Test array minus 1"

This reverts commit 2b9c79794a.

* new test

* name corpora

* Get deterministic shuffling in
2022-02-11 16:50:41 -08:00
Nikolay Bogoychev 20f6429ac7
hotfix: the biclean() function doesn't have access to global variables (#68) 2022-02-07 10:12:14 -08:00
Nikolay Bogoychev 8dcb6492ae
bicleaner-ai parallelism fixes (#67) 2022-02-04 14:39:39 -08:00
Nikolay Bogoychev 7e58a6badd
Fix to use with the latest mtdata version (#60)
* Fix to use with the latest mtdata version

The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata

* Bump mtdata version

* MTdata fixes as per @eu9ene 's suggestions

* Update prod.conf with the new MTDATA  interface

Also removed non-existing dataset and removed some duplicates.

* Update test config

* Sort entries by group to more easily see duplicates
2022-02-03 15:23:25 -08:00
Evgeny Pavlov ddba8ebf25
Fix bicleaner skipping (#59) 2022-01-25 22:26:41 -08:00
Evgeny Pavlov a4ada6ce1a
Fine-tuning and bicleaner bug fix (#51)
- split teacher training and finetuning
- fix student fine-tuning
- add support of pretrained vocab
- fix bicleaner
2022-01-24 13:56:49 -08:00
Kenneth Heafield a0097a86b9
Make patience metric be chrf (#50) 2022-01-14 10:46:38 -08:00
Evgeny Pavlov 174cceaa6f
Bugfix and optimization (#41)
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
2022-01-05 13:24:05 -08:00
Evgeny Pavlov 3b3f33bf25
Quality improvements (#29) 2021-12-06 15:03:35 -08:00
Evgeny Pavlov a09b0ac7ac
Update README.md 2021-10-28 11:07:11 -07:00
Evgeny Pavlov ef8928b454
Snakemake integration (#24)
- workflow management using Snakemake
- parallelization to run on a cluster
- Singularity containerization support
- Slurm support
- teacher ensemble support
2021-10-28 10:39:09 -07:00
Evgeny Pavlov 4260ed5fcd
hotfix: remove corpus shuffling on merge 2021-08-23 22:05:06 -07:00
Evgeny Pavlov 0f6e64cf19
Minor improvements (#20)
- Flores dataset importer
- custom dataset importer
- ability to use a pre-trained backward model
- save experiment config on start
- stubs for dataset caching ( decided to sync implementation with workflow manager integration )
- use best bleu models instead of best ce-mean-words
- fix linting warnings
2021-08-17 13:20:34 -07:00
Evgeny Pavlov ec783cfbbb
Bicleaner support + fixes (#13)
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
fixes

Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
fixes


Added script to find all datasets based on language pair and importer type, ready to use in config
fixes


Fixed conda environment activation to be reproducible on GCP

Other minor reproducibility fixes
2021-07-26 10:00:49 -07:00
Evgeny Pavlov af2abbf525
Add reference to bergamot project 2021-06-21 14:58:02 -07:00
Evgeny Pavlov 4b12dee551
Fix readme after renaming 2021-06-21 14:38:07 -07:00
Evgeny Pavlov 2bcdef2b36
Rename repo 2021-06-21 14:33:31 -07:00
Evgeny Pavlov 3bea08bf4a
Initial pipeline (#1) 2021-06-17 15:39:15 -07:00
Evgeny Pavlov 8d11fb1e97
Initial commit 2021-04-30 15:36:49 -07:00