Training pipelines for Firefox Translations neural machine translation models
Перейти к файлу
Ben Hearsum (he/him) 9ad9abaacf
Refactor b-cpu-xlargedisk worker pools to allow for experimentation with different configurations (#686)
Same as #674, except for main. At the moment the pipeline doesn't run there because there's no `b-linux-large-gcp-1tb` workers anymore.
2024-06-19 09:40:16 -04:00
.github Migrate the github actions to taskcluster (#184) 2023-09-07 13:10:36 -05:00
3rd_party Integrate OpusTrainer (#219) 2023-11-17 16:59:02 -08:00
configs Spring 2024 config fixes (#659) 2024-06-11 15:20:40 -07:00
docker Remove the Makefile and replace it with a Taskfile (#510) 2024-04-09 16:11:13 -05:00
docs Parse stalled validation data (#637) 2024-05-30 10:45:50 -07:00
pipeline Fix data importing for sacrebleu (#658) 2024-06-03 15:56:21 -05:00
snakemake Move snakemake to a separate folder (#431) 2024-02-09 09:46:52 -08:00
taskcluster Refactor b-cpu-xlargedisk worker pools to allow for experimentation with different configurations (#686) 2024-06-19 09:40:16 -04:00
tests Parse stalled validation data (#637) 2024-05-30 10:45:50 -07:00
tracking Fix offline group publication (#638) 2024-05-30 17:01:29 -07:00
utils Spring 2024 config fixes (#659) 2024-06-11 15:20:40 -07:00
.dockerignore Remove the Makefile and replace it with a Taskfile (#510) 2024-04-09 16:11:13 -05:00
.gitignore Add publication package (#309) 2024-01-11 13:25:53 -08:00
.gitmodules Integrate deduplication in the pipeline (#70) 2022-02-11 16:50:41 -08:00
.taskcluster.yml Upgrade to Taskgraph 8 (#664) 2024-06-11 12:43:12 -04:00
CODE_OF_CONDUCT.md Initial pipeline (#1) 2021-06-17 15:39:15 -07:00
LICENSE Initial commit 2021-04-30 15:36:49 -07:00
README.md Move snakemake to a separate folder (#431) 2024-02-09 09:46:52 -08:00
Taskfile.yml Automatically generate training config files with the `task config-generator` (#620) 2024-05-24 16:09:05 -05:00
poetry.lock Bump taskcluster-taskgraph to 8.2.0 in poetry (#672) 2024-06-19 09:11:46 -04:00
pyproject.toml Bump taskcluster-taskgraph to 8.2.0 in poetry (#672) 2024-06-19 09:11:46 -04:00

README.md

Firefox Translations training

Training pipelines for Firefox Translations machine translation models.

The trained models are hosted in firefox-translations-models repository, compatible with bergamot-translator and power the Firefox web page translation starting with version 118.

The pipeline was originally developed as a part of Bergamot project that focuses on improving client-side machine translation in a web browser.

Documentation

Pipeline

The pipeline is capable of training a translation model for a language pair end to end. Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.

We use fast translation engine Marian.

You can find more details about the pipeline steps in the documentation.

Orchestrators

An orchestrator is responsible for workflow management and parallelization.

  • Taskcluster - Mozilla task execution framework. It is also used for Firefox CI. It provides access to the hybrid cloud workers (GCP + on-prem) with increased scalability and observability. Usage instructions.
  • Snakemake - a file based orchestrator that allows to run the pipeline locally or on a Slurm cluster. Usage instructions. (The integration is not maintained since Mozilla has switched to Taskcluster. Contributions are welcome.)

Experiment tracking

Marian training metrics are parsed from logs and published using a custom module within the tracking directory. More information is available here.

Learning resources

Acknowledgements

This project uses materials developed by:

  • Bergamot project (github, website) that has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 825303
  • HPLT project (github, website) that has received funding from the European Unions Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK governments Horizon Europe funding guarantee [grant number 10052546]
  • OPUS-MT project (github, website)
  • Many other open source projects and research papers (see References)