2021-07-13 23:21:33 +03:00
# Firefox Translations models
2023-09-26 22:25:43 +03:00
CPU-optimized NMT models for Firefox Translations.
2021-06-15 23:33:21 +03:00
2021-07-15 02:45:15 +03:00
The model files are hosted using [Git LFS ](https://docs.github.com/en/github/managing-large-files/versioning-large-files/about-git-large-file-storage ).
2021-03-02 09:36:15 +03:00
2023-09-26 22:25:43 +03:00
[prod ](models/prod ) - higher quality models
2021-04-30 10:27:49 +03:00
2021-07-27 19:08:27 +03:00
[dev ](models/dev ) - test models under development (can be of low quality or speed).
2021-05-05 03:32:48 +03:00
2022-03-25 03:56:14 +03:00
When a dev model has satisfactory quality, it is moved to prod.
2021-07-26 21:29:59 +03:00
2021-07-27 19:08:27 +03:00
# Automatic quality evaluation
2021-07-26 21:29:59 +03:00
2024-08-22 23:09:29 +03:00
[BLEU scores ](evaluation/bleu-results.md ), [COMET scores ](evaluation/comet-results.md )
2021-07-27 19:08:27 +03:00
2024-03-11 20:54:28 +03:00
The evaluation is run as part of a pull request in CI.
The PR should include the models in the `models/dev` or `models/prod` category.
The evaluation will automatically run, and then commits will be added to the pull request.
The evaluation uses Microsoft and Google translation APIs, Argos Translate, NLLB and Opus-MT models and pushes results back to the branch (not available for forks).
2023-10-24 00:07:43 +03:00
It is performed using the [evals ](/evals ) tool.
2021-07-27 19:08:27 +03:00
2021-07-29 00:32:32 +03:00
# Model training
Use [Firefox Translations training pipeline ](https://github.com/mozilla/firefox-translations-training ) or [browsermt/students ](https://github.com/browsermt/students/tree/master/train-student ) recipe to train CPU-optimized models. They should have similar size and inference speed to already submitted models.
## Training data
2021-08-26 22:02:52 +03:00
Do not use [SacreBLEU ](https://github.com/mjpost/sacrebleu ) or [Flores ](https://github.com/facebookresearch/flores ) datasets as a part of training data, otherwise evaluation will not be correct.
2021-07-29 00:32:32 +03:00
To see SacreBLEU datasets run `sacrebleu --list` .
2021-07-27 19:08:27 +03:00
# Model contribution
2021-07-29 00:32:32 +03:00
All models should be contributed to `dev` folder first.
2024-03-11 20:54:28 +03:00
## Maintainers adding models
2021-07-27 19:08:27 +03:00
2024-03-11 20:54:28 +03:00
Create a pull request to the `main` branch from another branch in this repo (not a fork).
This pull request should include the models, and the evaluation will be added as extra commits in the CI task.
2021-07-27 19:08:27 +03:00
2024-03-11 20:54:28 +03:00
## Contributors adding models
2021-07-27 19:08:27 +03:00
2024-03-11 20:54:28 +03:00
Create a pull request to the `contrib` branch.
When it is reviewed and merged, a maintainer should create a pull request from `contrib` to `main` .
This second PR will run the automatic evaluation and add the evaluation commits.
2021-07-27 19:08:27 +03:00
## Local testing
You can run model evaluation locally by running `bash scripts/update-results.sh` .
Make sure to set environment variables `GCP_CREDS_PATH` and `AZURE_TRANSLATOR_KEY` to use Google and Microsoft APIs.
If you want to run it with `bergamot` only, remove mentions of those variables from `bash scripts/update-results.sh` and remove `microsoft,google` from `scripts/eval.sh` .
2022-06-16 00:48:02 +03:00
# Model types
## Vocabulary
Prefix of the vocabulary file in the model registry:
- `vocab.` - vocabulary is reused for the source and target languages
- `srcvocab.` and `trgvocab.` - different vocabularies for the source and target languages
## GEMM precision
Suffix of the model file in the registry:
- `intgemm8.bin` - supports `gemm-precision: int8shiftAll` inference setting
- `intgemm.alphas.bin` - supports `gemm-precision: int8shiftAlphaAll` inference setting
2024-04-03 20:25:17 +03:00
# Downloading a model from Taskcluster
Example:
```
cd scripts
SRC=lt TRG=en TASK_ID=SjPZGW9CRYeb9PQr68jCUw bash pull_models.sh
```
Where `TASK_ID` is a Taskcluster ID of the `export` task.
2023-11-21 04:16:25 +03:00
# Model deployment
Models are deployed to Remote Settings to be delivered to Firefox.
Records and attachments are uploaded via a CLI tool which lives in the
`remote_settings` directory in this repository.
2024-02-02 20:31:36 +03:00
View the `remote_settings` [README ](https://github.com/mozilla/firefox-translations-models/blob/main/remote_settings/README.md ) for more details on publishing models.
2022-06-16 00:48:02 +03:00
2021-06-15 23:33:21 +03:00
# Currently supported Languages
2022-05-31 03:30:56 +03:00
2024-08-22 23:09:29 +03:00
Prod models are available in all Firefox channels including Release.
Dev models are available in Nightly only.
2023-09-26 22:25:43 +03:00
2021-06-15 23:33:21 +03:00
## Prod
2022-01-29 03:52:20 +03:00
- Bulgarian < - > English
2024-08-21 22:46:18 +03:00
- Catalan < - > English
2024-07-24 20:04:10 +03:00
- Croatian -> English
2024-10-12 02:37:17 +03:00
- Czech < - > English
2024-08-08 02:05:27 +03:00
- Danish < - > English
2023-11-18 21:19:46 +03:00
- Dutch < - > English
2024-03-08 20:41:36 +03:00
- Estonian < - > English
2024-08-23 23:05:14 +03:00
- Finnish < - > English
2023-11-18 21:19:46 +03:00
- French < - > English
- German < - > English
2024-10-07 22:26:50 +03:00
- Greek < - > English
2024-08-08 02:05:27 +03:00
- Hungarian < - > English
2024-10-12 02:37:17 +03:00
- Indonesian < - > English
2022-05-06 00:50:36 +03:00
- Italian < - > English
2024-07-24 20:04:10 +03:00
- Latvian (Lettish) -> English
- Lithuanian -> English
2022-05-31 03:30:56 +03:00
- Polish < - > English
2023-11-18 21:19:46 +03:00
- Portuguese < - > English
2024-10-12 02:37:17 +03:00
- Romanian < - > English
2024-03-08 20:41:36 +03:00
- Russian -> English
2024-07-24 20:04:10 +03:00
- Serbian -> English
- Slovak -> English
2024-08-08 02:05:27 +03:00
- Slovenian < - > English
2023-11-18 21:19:46 +03:00
- Spanish < - > English
2024-09-12 22:24:21 +03:00
- Swedish < - > English
2024-08-23 23:05:14 +03:00
- Turkish < - > English
2024-03-08 20:41:36 +03:00
- Ukrainian -> English
2024-07-24 20:04:10 +03:00
- Vietnamese -> English
2022-05-31 03:30:56 +03:00
2021-06-15 23:33:21 +03:00
## Dev
2024-08-23 21:53:33 +03:00
- Bosnian -> English
2024-10-08 00:07:26 +03:00
- Chinese (Simplified) -> English
2024-08-07 23:03:22 +03:00
- Croatian < - English
2022-02-11 03:50:43 +03:00
- Icelandic -> English
2024-08-07 23:03:22 +03:00
- Latvian (Lettish) < - English
2024-01-26 11:52:31 +03:00
- Maltese -> English
2023-11-18 21:19:46 +03:00
- Norwegian Bokmål -> English
2022-02-11 03:50:43 +03:00
- Norwegian Nynorsk -> English
2023-11-18 21:19:46 +03:00
- Persian (Farsi) < - > English
2024-03-08 20:41:36 +03:00
- Russian < - English
2024-08-23 20:40:41 +03:00
- Slovak < - English
2024-08-08 02:05:27 +03:00
- Ukrainian < - English