CPU-optimized Neural Machine Translation models for Firefox Translations

Перейти к файлу

Evgeny Pavlov 6d33a2784c Add enru base (#169 ) * Add enru base * Update evaluation results [skip ci] * Update model registry [skip ci] --------- Co-authored-by: CircleCI evaluation job <ci-models-evaluation@firefox-translations>		2024-10-16 11:47:12 -05:00
.circleci	Add some languages we are planning to support to the comparison (#130 )	2024-01-30 21:39:07 +01:00
evals	Add JSON evaluation (#147 )	2024-06-27 13:37:51 -05:00
evaluation	Add enru base (#169 )	2024-10-16 11:47:12 -05:00
models	Add enru base (#169 )	2024-10-16 11:47:12 -05:00
remote_settings	Clarify the docs for uploading using remote settings (#143 )	2024-03-11 14:26:44 -05:00
scripts	Add English to Czech (#145 )	2024-04-04 11:08:49 +02:00
tests/remote_settings	Fix remote_settings_script staging server URL (#139 )	2024-03-06 14:02:16 -06:00
.gitattributes	Adding en<>es and es<>en models	2021-02-13 14:57:45 -08:00
.gitignore	Support --lang-pair in remote_settings script (#134 )	2024-02-02 11:31:36 -06:00
LICENSE	Add license	2021-07-15 10:43:58 -07:00
Makefile	Enable comet (#84 )	2023-10-23 14:07:43 -07:00
README.md	Release models 2024-10-01 (#164 )	2024-10-11 16:37:17 -07:00
poetry.lock	Add pytest-clarity for better error messages (#123 )	2023-12-08 11:49:13 -06:00
pyproject.toml	Add pytest-clarity for better error messages (#123 )	2023-12-08 11:49:13 -06:00
registry.json	Add enru base (#169 )	2024-10-16 11:47:12 -05:00

README.md

Firefox Translations models

CPU-optimized NMT models for Firefox Translations.

The model files are hosted using Git LFS.

prod - higher quality models

dev - test models under development (can be of low quality or speed).

When a dev model has satisfactory quality, it is moved to prod.

Automatic quality evaluation

BLEU scores, COMET scores

The evaluation is run as part of a pull request in CI. The PR should include the models in the models/dev or models/prod category. The evaluation will automatically run, and then commits will be added to the pull request. The evaluation uses Microsoft and Google translation APIs, Argos Translate, NLLB and Opus-MT models and pushes results back to the branch (not available for forks). It is performed using the evals tool.

Model training

Use Firefox Translations training pipeline or browsermt/students recipe to train CPU-optimized models. They should have similar size and inference speed to already submitted models.

Training data

Do not use SacreBLEU or Flores datasets as a part of training data, otherwise evaluation will not be correct.

To see SacreBLEU datasets run sacrebleu --list.

Model contribution

All models should be contributed to dev folder first.

Maintainers adding models

Create a pull request to the main branch from another branch in this repo (not a fork). This pull request should include the models, and the evaluation will be added as extra commits in the CI task.

Contributors adding models

Create a pull request to the contrib branch. When it is reviewed and merged, a maintainer should create a pull request from contrib to main. This second PR will run the automatic evaluation and add the evaluation commits.

Local testing

You can run model evaluation locally by running bash scripts/update-results.sh. Make sure to set environment variables GCP_CREDS_PATH and AZURE_TRANSLATOR_KEY to use Google and Microsoft APIs. If you want to run it with bergamot only, remove mentions of those variables from bash scripts/update-results.sh and remove microsoft,google from scripts/eval.sh.

Model types

Vocabulary

Prefix of the vocabulary file in the model registry:

vocab. - vocabulary is reused for the source and target languages
srcvocab. and trgvocab. - different vocabularies for the source and target languages

GEMM precision

Suffix of the model file in the registry:

intgemm8.bin - supports gemm-precision: int8shiftAll inference setting
intgemm.alphas.bin - supports gemm-precision: int8shiftAlphaAll inference setting

Downloading a model from Taskcluster

Example:

cd scripts
SRC=lt TRG=en TASK_ID=SjPZGW9CRYeb9PQr68jCUw bash pull_models.sh

Where TASK_ID is a Taskcluster ID of the export task.

Model deployment

Models are deployed to Remote Settings to be delivered to Firefox.

Records and attachments are uploaded via a CLI tool which lives in the remote_settings directory in this repository.

View the remote_settings README for more details on publishing models.

Currently supported Languages

Prod models are available in all Firefox channels including Release. Dev models are available in Nightly only.

Prod

Bulgarian <-> English
Catalan <-> English
Croatian -> English
Czech <-> English
Danish <-> English
Dutch <-> English
Estonian <-> English
Finnish <-> English
French <-> English
German <-> English
Greek <-> English
Hungarian <-> English
Indonesian <-> English
Italian <-> English
Latvian (Lettish) -> English
Lithuanian -> English
Polish <-> English
Portuguese <-> English
Romanian <-> English
Russian -> English
Serbian -> English
Slovak -> English
Slovenian <-> English
Spanish <-> English
Swedish <-> English
Turkish <-> English
Ukrainian -> English
Vietnamese -> English

Dev

Bosnian -> English
Chinese (Simplified) -> English
Croatian <- English
Icelandic -> English
Latvian (Lettish) <- English
Maltese -> English
Norwegian Bokmål -> English
Norwegian Nynorsk -> English
Persian (Farsi) <-> English
Russian <- English
Slovak <- English
Ukrainian <- English