Граф коммитов

25 Коммитов

Автор SHA1 Сообщение Дата
Greg Tatum fd2f7da7a4
Spring 2024 config fixes (#659)
* Remove swedish_work_environment

* Remove failling datasets

---------

Co-authored-by: Evgeny Pavlov <epavlov@mozilla.com>
2024-06-11 15:20:40 -07:00
Evgeny Pavlov 7f12b1ad01
Prepare configs for training (#636)
* Add prod configs

* Add augmentation to the validation set

* Fix validation datasets

* Switch stage to teacher training

* Use evaluate-teacher as target

* Disable some mono data and add comments

* Disable NLLB mono for back-translations

* Add uk-en

* Move manually edited configs to a separated folder

* Fix datasets
2024-05-30 14:06:50 -07:00
Greg Tatum cb88080703
Fix the monolingual nllb dataset to use zst and report the sentence count (#650)
* Update the config generator

* Update the configs
2024-05-30 10:20:59 -07:00
Greg Tatum 789bfa635f
Add mono nllb to the config generation (#647)
* Add nllb mono to config generation

* Update configs

* Remove extra space from config endings
2024-05-29 14:22:37 -07:00
Evgeny Pavlov 3d592e570b
Add HPLT mono data to the configs (#646)
* Add HPLT mono data

* Revert dataset stats
2024-05-29 11:44:24 -07:00
Greg Tatum 4a86accaae
Update the config generator to remove 404s, and improve the Content-Length lookup (#643) 2024-05-28 16:17:37 -05:00
Greg Tatum b5ee6eeced
Fix config generation for ted talks and hide known inaccurate sizes (#641) 2024-05-28 14:15:36 -05:00
Greg Tatum 56040c94b9
Automatically generate training config files with the `task config-generator` (#620)
* Create a util to automatically generate configs

* Add the generated configs

* Update the config generation script

* Update the configs

* Update the configs

* Address review comments for the config generator

* Fix find_corpus test
2024-05-24 16:09:05 -05:00
Evgeny Pavlov 31311927ef
Move snakemake to a separate folder (#431)
* Move snakemake code to a separate folder

* Small fixes

* Run linter

* Revert formatting

* Fix readme
2024-02-09 09:46:52 -08:00
Greg Tatum 7f43bd0c7d
Point to the docs for marian args (#381) 2024-01-25 07:31:58 -06:00
Evgeny Pavlov 2d4530d0f5
Always split corpus to a fixed number of parts (#308)
* Always split corpus to a fixed number of parts

* Fix splitting

* Rewrite corpus splitting in Python

* Replace in taskcluster

* Add tests

* Unify compression tool with Taskcluster

* Move zstd installation to docker image

* Disable opuscleaner in CI

* Compress chunks

* Fix file names

* Remove zeros from file index

* Start file index with 1

* Fix corpus splitting

* Add a link to an issue

* Generate script description from doc

* Use new test dir

* Use new test dir

* Test command line args

* Clarify expected files

* Add logging
2023-12-19 15:25:33 -08:00
Greg Tatum 742fb8f999
Add documentation to various parts of the scripts and pipeline (#298) 2023-12-15 13:34:05 -06:00
Evgeny Pavlov 0e757b0070
Integrate OpusTrainer (#219)
integrated OpusTrainer in train.sh
    added dataset importer that can augment datasets for evaluation
    removed teacher fine-tuning step. The pre-training and fine-tuning are now done in one step
    removed merge-augmented step
    adjusted pipeline settings to work with a higher amount of data
    modified the Snakemake pipeline accordingly but didn't test
    updated browsermt marian
    added docs
    added unit tests
2023-11-17 16:59:02 -08:00
Evgeny Pavlov 83d43bfcf6
Update docs (#224)
* Update docs

* Fix typos

* Fix TC docs

* Fix relative links
2023-10-16 16:33:29 -07:00
Evgeny Pavlov de4218d8cf
Fixes after training on a full dataset (#221)
* Increase workspace

* Add example of TC prod config

* Use 4 gpu worker for scoring

* Use level 1 workers

* Rollback

* Sync task cluster yml with main

* Use a worker with a larger disk

* Increase workspace

* Add example of TC prod config

* Use 4 gpu worker for scoring

* Rollback

* Use a worker with a larger disk
2023-10-10 16:24:55 -07:00
Evgeny Pavlov e9102a37ef
Integrate OpusCleaner (#163)
* Initial integration of opus cleaner

* Support custom filters

* Use opus cleaner in pipeline

* Fix env

* Fix filter generation

* Add more rules

* Fix elrc filter

* Fix env

* Fix frequent patterns filter

* Switch to reading from stdin

* Add a feature flag for opus cleaner

* Fix condition

* Add extra test for non empty files

* Integrate with TC

* Run linter

* Fix step config

* Fix step config

* Fix step config

* Fix step config

* Fix command

* Fix path

* Update OpusCleaner

* Remove warning

* Log filtered length

* Add opuscleaner logs

* Add comments

* Fix using custom filters

* Extract function

* Change the CI target back

* Fix file path

* Replace conda with poetry

* Add doc

* Add more comments

* Rename example filter

* Test corpus

* Fix filter name

* Use opus dataset instead of mtdata

* Make CI faster

* Add sections to makefile

* Fix custom filter search

* Redirect stderr to stdout

* Fix usage of custom config

* Fix config name

* Change back to all
2023-09-26 15:29:07 -07:00
Evgeny Pavlov 299d41c34b
Add TC test run to CI (#195)
* Add snakemake test run to CI

* Add toolchain

* Add docker image

* Reduce datasets

* Move ci to a separate config

* Add utils to poetry

* Fix config

* Fix config

* Disable docker

* Use test docker image

* Fix artifacts dir

* Fix tests

* Fix profile setting

* Fix root dir

* Faster translation

* Expose artifacts

* Change default TC config

* Fix default TC config

* Disable snakemake run

* Enable running on PR

* Fix ci config

* Add vocab size argument

* Retrigger CI

* Add a comment on snakemake run

* Use a smaller teacher model for CI

* Try to retrigger downloading

* Use the same year for mono src and trg

* Revert changes [skip ci]

* Revert test config [skip ci]

* Fix comment [skip ci]
2023-09-20 09:40:30 -07:00
Evgeny Pavlov 7c58f6558b
Move configuraiton to profiles (#96)
* Move configuration settings to profiles

* Use realtive paths

* Fix output formatting

* Update dag

* Update docs
2022-06-17 10:56:07 -07:00
Evgeny Pavlov 270d29b90a
Checkpointing training (#80)
* Enable model checkpointing

* Do not use memory limits

* Reduce training time for testing

* Reconfigure csd3

* Reduce test updates

* Change final model name

* Fix copying of decoder config
2022-03-31 11:03:51 -07:00
Evgeny Pavlov 22a3751a09
Add support of Mozilla slurm cluster (#72) 2022-02-22 17:48:21 -08:00
Nikolay Bogoychev 7e58a6badd
Fix to use with the latest mtdata version (#60)
* Fix to use with the latest mtdata version

The latest mtdata changes corpus names, the funcitonality of the function and rearranges a number of internal functions. I hope this will work now. I also filed a few bug reports @mtdata

* Bump mtdata version

* MTdata fixes as per @eu9ene 's suggestions

* Update prod.conf with the new MTDATA  interface

Also removed non-existing dataset and removed some duplicates.

* Update test config

* Sort entries by group to more easily see duplicates
2022-02-03 15:23:25 -08:00
Evgeny Pavlov a4ada6ce1a
Fine-tuning and bicleaner bug fix (#51)
- split teacher training and finetuning
- fix student fine-tuning
- add support of pretrained vocab
- fix bicleaner
2022-01-24 13:56:49 -08:00
Evgeny Pavlov 174cceaa6f
Bugfix and optimization (#41)
- bugfix
- training and decoding optimization
- evaluation refactoring
- small usability improvements
- moved marian configurations overriding back to configs
2022-01-05 13:24:05 -08:00
Evgeny Pavlov 3b3f33bf25
Quality improvements (#29) 2021-12-06 15:03:35 -08:00
Evgeny Pavlov ef8928b454
Snakemake integration (#24)
- workflow management using Snakemake
- parallelization to run on a cluster
- Singularity containerization support
- Slurm support
- teacher ensemble support
2021-10-28 10:39:09 -07:00