CPU-optimized Neural Machine Translation models for Firefox Translations
Перейти к файлу
Evgeny Pavlov 9c071beb27
Add a note on labels
2023-09-26 12:25:43 -07:00
.circleci Update config.yml with right fingerprint 2023-04-20 15:30:25 -07:00
evaluation Add Finnish to English (#81) 2023-09-12 11:25:05 -07:00
models Add Finnish to English (#81) 2023-09-12 11:25:05 -07:00
scripts Add Finnish to English (#81) 2023-09-12 11:25:05 -07:00
.gitattributes Adding en<>es and es<>en models 2021-02-13 14:57:45 -08:00
.gitignore Introduces comet and updates bleu reports (#65) 2023-04-20 12:49:48 -07:00
LICENSE Add license 2021-07-15 10:43:58 -07:00
README.md Add a note on labels 2023-09-26 12:25:43 -07:00
registry.json Add Finnish to English (#81) 2023-09-12 11:25:05 -07:00

README.md

Firefox Translations models

CPU-optimized NMT models for Firefox Translations.

The model files are hosted using Git LFS.

prod - higher quality models

dev - test models under development (can be of low quality or speed).

When a dev model has satisfactory quality, it is moved to prod.

Automatic quality evaluation

Results for prod models: BLEU, COMET

Results for dev models: BLEU, COMET

Automatic evaluation is a part of pull request CI. It uses Microsoft and Google translation APIs and pushes results back to the branch (not available for forks). It is performed using firefox-translations-evaluation tool.

Model training

Use Firefox Translations training pipeline or browsermt/students recipe to train CPU-optimized models. They should have similar size and inference speed to already submitted models.

Training data

Do not use SacreBLEU or Flores datasets as a part of training data, otherwise evaluation will not be correct.

To see SacreBLEU datasets run sacrebleu --list.

Model contribution

All models should be contributed to dev folder first.

By maintainers

Create a pull Request to main branch from another branch in this repo.

From forks

Create a Pull Request to contrib branch. When it is reviewed and merged, another pull request to main branch will be created by a maintainer to kick off automatic evaluation.

Local testing

You can run model evaluation locally by running bash scripts/update-results.sh. Make sure to set environment variables GCP_CREDS_PATH and AZURE_TRANSLATOR_KEY to use Google and Microsoft APIs. If you want to run it with bergamot only, remove mentions of those variables from bash scripts/update-results.sh and remove microsoft,google from scripts/eval.sh.

Model deployment

Create a new release with a version tag x.y.z following semantic versioning.

The models will be automatically uploaded to GCS bucket gs://bergamot-models-sandbox/x.y.z/.

Model types

Vocabulary

Prefix of the vocabulary file in the model registry:

  • vocab. - vocabulary is reused for the source and target languages
  • srcvocab. and trgvocab. - different vocabularies for the source and target languages

GEMM precision

Suffix of the model file in the registry:

  • intgemm8.bin - supports gemm-precision: int8shiftAll inference setting
  • intgemm.alphas.bin - supports gemm-precision: int8shiftAlphaAll inference setting

Currently supported Languages

The prod/dev labels in this repo correspond to the labels in the legacy web extension and are not related to the native integration in Firefox.

Prod

  • Spanish <-> English
  • Estonian <-> English
  • English <-> German
  • Czech <-> English
  • Bulgarian <-> English
  • Norwegian Bokmål -> English
  • Portuguese <-> English
  • Italian <-> English
  • Polish <-> English
  • French <-> English

Dev

  • Russian <-> English
  • Persian (Farsi) <-> English
  • Icelandic -> English
  • Norwegian Nynorsk -> English
  • Ukrainian <-> English
  • Dutch <-> English
  • Catalan -> English
  • Hungarian -> English
  • Finnish -> English