firefox-translations-training/utils
Evgeny Pavlov 28de0c8c2d
Configure vocab for CJK (#906)
* Add a converter for Chinese

* Convert imported datasets to simplified

* Add augmentation modifier for cjk

* Update tests

* Move constants to the beginning of the file

* Output tokenized text from alignments step

* Detokenize text in Tags modifier

* Add CJK OpusTrainer configs

* Update taskcluster kinds to use tokenized text and cjk configs

* Test training for Chinese

* Update docs

* Reduce chunk size for alignments

* Add python path env

* Fix comment

* Change character coverage for CJK

* Use larger vocab for CJK

* Use all items from the provided vocabulary

* Revert "Change character coverage for CJK"

This reverts commit a6c35bfe73.

* Use default sentencepiece character coverage

* Relock poetry

* Run linter
2024-11-06 20:45:55 -08:00
..
tasks Add --run-as-user flag to docker-run.py (#919) 2024-11-06 13:57:10 -06:00
build-mono-nllb.py Remove max_words filtering from data importers (#901) 2024-11-06 14:44:42 -08:00
config_generator.py Configure vocab for CJK (#906) 2024-11-06 20:45:55 -08:00
download_hplt.py Add HPLT mono bulk importer (#645) 2024-05-29 14:25:08 -07:00
find_corpus.py Switch bestbleu to chrF (#908) 2024-11-04 13:49:49 -08:00
marian_client.py Add Marian server for model testing (#492) 2024-03-28 15:53:16 -07:00
preflight_check.py Rename repo (#914) 2024-11-01 10:21:28 -05:00
run_model.py Remove the Makefile and replace it with a Taskfile (#510) 2024-04-09 16:11:13 -05:00
taskcluster_downloader.py Remove the Makefile and replace it with a Taskfile (#510) 2024-04-09 16:11:13 -05:00
tb_log_parser.py Add ruff and black linting to the CI (#187) 2023-09-08 09:50:24 -05:00
trigger_training.py Rename repo (#914) 2024-11-01 10:21:28 -05:00