CPU-optimized Neural Machine Translation models for Firefox Translations
Перейти к файлу
Erik Nordin f0ca3e8f5a Add Bosnian to README.md [skip ci] 2024-08-23 13:19:02 -05:00
.circleci
evals Add JSON evaluation (#147) 2024-06-27 13:37:51 -05:00
evaluation New models 2024-08-22 (#158) 2024-08-23 12:40:41 -05:00
models New models 2024-08-22 (#158) 2024-08-23 12:40:41 -05:00
remote_settings
scripts
tests/remote_settings
.gitattributes
.gitignore
LICENSE
Makefile
README.md Add Bosnian to README.md [skip ci] 2024-08-23 13:19:02 -05:00
poetry.lock
pyproject.toml
registry.json New models 2024-08-22 (#158) 2024-08-23 12:40:41 -05:00

README.md

Firefox Translations models

CPU-optimized NMT models for Firefox Translations.

The model files are hosted using Git LFS.

prod - higher quality models

dev - test models under development (can be of low quality or speed).

When a dev model has satisfactory quality, it is moved to prod.

Automatic quality evaluation

BLEU scores, COMET scores

The evaluation is run as part of a pull request in CI. The PR should include the models in the models/dev or models/prod category. The evaluation will automatically run, and then commits will be added to the pull request. The evaluation uses Microsoft and Google translation APIs, Argos Translate, NLLB and Opus-MT models and pushes results back to the branch (not available for forks). It is performed using the evals tool.

Model training

Use Firefox Translations training pipeline or browsermt/students recipe to train CPU-optimized models. They should have similar size and inference speed to already submitted models.

Training data

Do not use SacreBLEU or Flores datasets as a part of training data, otherwise evaluation will not be correct.

To see SacreBLEU datasets run sacrebleu --list.

Model contribution

All models should be contributed to dev folder first.

Maintainers adding models

Create a pull request to the main branch from another branch in this repo (not a fork). This pull request should include the models, and the evaluation will be added as extra commits in the CI task.

Contributors adding models

Create a pull request to the contrib branch. When it is reviewed and merged, a maintainer should create a pull request from contrib to main. This second PR will run the automatic evaluation and add the evaluation commits.

Local testing

You can run model evaluation locally by running bash scripts/update-results.sh. Make sure to set environment variables GCP_CREDS_PATH and AZURE_TRANSLATOR_KEY to use Google and Microsoft APIs. If you want to run it with bergamot only, remove mentions of those variables from bash scripts/update-results.sh and remove microsoft,google from scripts/eval.sh.

Model types

Vocabulary

Prefix of the vocabulary file in the model registry:

  • vocab. - vocabulary is reused for the source and target languages
  • srcvocab. and trgvocab. - different vocabularies for the source and target languages

GEMM precision

Suffix of the model file in the registry:

  • intgemm8.bin - supports gemm-precision: int8shiftAll inference setting
  • intgemm.alphas.bin - supports gemm-precision: int8shiftAlphaAll inference setting

Downloading a model from Taskcluster

Example:

cd scripts
SRC=lt TRG=en TASK_ID=SjPZGW9CRYeb9PQr68jCUw bash pull_models.sh

Where TASK_ID is a Taskcluster ID of the export task.

Model deployment

Models are deployed to Remote Settings to be delivered to Firefox.

Records and attachments are uploaded via a CLI tool which lives in the remote_settings directory in this repository.

View the remote_settings README for more details on publishing models.

Currently supported Languages

Prod models are available in all Firefox channels including Release. Dev models are available in Nightly only.

Prod

  • Bulgarian <-> English
  • Catalan <-> English
  • Croatian -> English
  • Czech -> English
  • Danish <-> English
  • Dutch <-> English
  • Estonian <-> English
  • Finnish -> English
  • French <-> English
  • German <-> English
  • Greek -> English
  • Hungarian <-> English
  • Indonesian -> English
  • Italian <-> English
  • Latvian (Lettish) -> English
  • Lithuanian -> English
  • Polish <-> English
  • Portuguese <-> English
  • Romanian -> English
  • Russian -> English
  • Serbian -> English
  • Slovak -> English
  • Slovenian <-> English
  • Spanish <-> English
  • Turkish -> English
  • Ukrainian -> English
  • Vietnamese -> English

Dev

  • Bosnian -> English
  • Croatian <- English
  • Czech <- English
  • Finnish <- English
  • Icelandic -> English
  • Latvian (Lettish) <- English
  • Maltese -> English
  • Norwegian Bokmål -> English
  • Norwegian Nynorsk -> English
  • Persian (Farsi) <-> English
  • Russian <- English
  • Slovak <- English
  • Turkish <- English
  • Ukrainian <- English