Enable comet (#84)

* Enable comet * Update CI image * Run lfs pull for models * Move code from the evals repo * Add data folder * Update evaluation results [skip ci] * Update model registry [skip ci] --------- Co-authored-by: CircleCI evaluation job <ci-models-evaluation@firefox-translations>
2023-10-23 14:07:43 -07:00 · 2023-10-23 14:07:43 -07:00 · 4bfeca6ce0
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -19,7 +19,7 @@ jobs:
            bash scripts/upload.sh
  evaluate:
    machine:
-      image: ubuntu-2004-cuda-11.4:202110-01
+      image: linux-cuda-11:default
    resource_class: gpu.nvidia.large
    working_directory: ~/mozilla/firefox-translations-models
    steps:
@ -41,6 +41,7 @@ jobs:
      - run:
          name: Running evaluation
          command: |
+            find ./models -type f -exec git lfs pull {} \;
            bash scripts/update-results.sh
      - run:
          name: Showing results
--- a/.gitignore
+++ b/.gitignore
@ -131,4 +131,4 @@ dmypy.json
 .idea
 ._.DS_Store
 *.bin
-*.spm
+*.spm
--- a/18
+++ b/18
@ -0,0 +1,18 @@
+
+### Evaluation
+
+MODELS?=
+GCP_CREDS_PATH?=
+AZURE_TRANSLATOR_KEY?=
+
+build-docker:
+	docker build -t bergamot-eval ./evals
+
+run-docker:
+	docker run --name bergamot-eval -it --shm-size=16gb --rm \
+		  --runtime=nvidia --gpus all \
+		  -v $$(pwd)/models:/models \
+		  -v $(GCP_CREDS_PATH):/.gcp_creds \
+		  -e GOOGLE_APPLICATION_CREDENTIALS=/.gcp_creds \
+		  -e AZURE_TRANSLATOR_KEY=$(AZURE_TRANSLATOR_KEY) \
+		  bergamot-eval
--- a/README.md
+++ b/README.md
@ -17,7 +17,7 @@ Results for dev models: [BLEU](evaluation/dev/bleu-results.md), [COMET](evaluati

 Automatic evaluation is a part of pull request CI. 
 It uses Microsoft and Google translation APIs and pushes results back to the branch (not available for forks).
-It is performed using [firefox-translations-evaluation](https://github.com/mozilla/firefox-translations-evaluation) tool.
+It is performed using the [evals](/evals) tool.

 # Model training

--- a/evals/Dockerfile
+++ b/evals/Dockerfile
@ -0,0 +1,39 @@
+FROM nvidia/cuda:11.4.3-runtime-ubuntu20.04
+
+WORKDIR workspace
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+RUN apt update && \
+  apt -y install sudo git cmake
+
+# See https://marian-nmt.github.io/docs/#installation for Marian requirements
+RUN apt-get install -y build-essential \
+    libboost-all-dev libprotobuf17 protobuf-compiler \
+    libprotobuf-dev libssl-dev libgoogle-perftools-dev
+
+# Intel MKL - for Marian usage on CPU
+RUN apt install wget && \
+    wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB' |  apt-key add -
+RUN sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
+    apt-get update && \
+    apt-get install -y intel-mkl-64bit-2020.0-088
+
+# Bergamot
+# pcre2 is requried to build berrgamot-translator with -DUSE_WASM_COMPATIBLE_SOURCES=off
+RUN apt-get install -y libpcre2-dev
+
+# Compile bergamot translator
+ADD ./install/install-bergamot-translator.sh ./
+RUN bash ./install-bergamot-translator.sh
+
+# SacreBLEU and python dependencies
+RUN apt-get update && apt-get install -y python3 python3-venv python3-pip
+ADD ./requirements.txt ./
+RUN pip3 install -r requirements.txt
+
+ADD ./eval ./eval
+ADD ./translators ./translators
+ADD ./data ./data
+
+CMD ["/bin/bash"]
--- a/evals/README.md
+++ b/evals/README.md
@ -0,0 +1,88 @@
+# Firefox Translations Evaluation
+Calculates BLEU and COMET scores for Firefox Translations [models](https://github.com/mozilla/firefox-translations-models)
+using [bergamot-translator](https://github.com/mozilla/bergamot-translator) and compares them to other translation systems.
+
+## Running
+
+We recommend running this on a Linux machine with at least one GPU, and inside a docker container. 
+If you intend to run it on macOS, run the `eval/evaluate.py ` script standalone inside a virtualenv, and skip the `Start docker` section below. 
+You might need to manually install the correspondent packages in the `Dockerfile` in your system and virtual environment.
+
+### Install NVIDIA Container Toolkit
+
+https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
+
+### Start docker
+Recommended memory size for Docker is **16gb**.
+
+Run from the repo root directory:
+```
+export MODELS=<absolute path to a local directory with models>
+
+# Specify Azure key and location if you want to add Azure Translator API for comparison
+export AZURE_TRANSLATOR_KEY=<Azure translator resource API key>
+# optional, specify if it's different than default 'global'
+export AZURE_LOCATION=<location>
+
+# Specify GCP credentials json path if you want to add Google Translator API for comparison
+export GCP_CREDS_PATH=<absolute path to .json>
+
+# Build and run docker container
+make build-docker
+make start-docker
+```
+
+On completion, your terminal should be attached to the launched container.
+
+### Run evaluation
+From inside docker container run:
+```
+python3 eval/evaluate.py \
+    --translators=bergamot,microsoft,google \
+    --pairs=all \
+    --skip-existing \
+    --gpus=1 \
+    --evaluation-engine=comet,bleu \
+    --models-dir=/models/models/prod \
+    --results-dir=/models/evaluation/prod
+```
+
+If you don't have a GPU, use `0` in the `--gpus` argument.
+
+More options:
+```
+python3 eval/evaluate.py --help
+```
+
+## Details
+### Installation scripts
+`install/install-bergamot-translator.sh` - clones and compiles [bergamot-translator](https://github.com/mozilla/bergamot-translator) and [marian](https://github.com/marian-nmt/marian-dev) (launched in docker image).
+
+`install/download-models.sh` - downloads current Mozilla production [models](https://github.com/mozilla/firefox-translations-models).
+
+### Docker & CUDA
+The COMET evaluation framework supports CUDA, and you can enable it by setting the `--gpus` argument in the `eval\evaluate.py` script to the number of GPUs you wish to utilize (`0` disables it).
+If you are using it, make sure you have the [nvidia container toolkit enabled](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) in your docker setup.
+
+### Translators
+1. **bergamot** - uses compiled [bergamot-translator](https://github.com/mozilla/bergamot-translator) in wasm mode
+2. **google** - users Google Translation [API](https://cloud.google.com/translate)
+3. **microsoft** - users Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
+
+### Reuse already calculated scores
+Use `--skip-existing` option to reuse already calculated scores saved as `results/xx-xx/*.bleu` files.
+It is useful to continue evaluation if it was interrupted
+or to rebuild a full report reevaluating only selected translators.
+
+### Datasets
+[SacreBLEU](https://github.com/mjpost/sacrebleu) - all available datasets for a language pair are used for evaluation.
+
+[Flores](https://github.com/facebookresearch/flores) - parallel evaluation dataset for 101 languages.
+
+### Language pairs
+With option `--pairs=all`, language pairs will be discovered
+in the specified models folder (option `--models-dir`)
+and evaluation will run for all of them.
+
+### Results
+Results will be written to the specified directory (option `--results-dir`).
--- a/evals/data/flores.sh
+++ b/evals/data/flores.sh
@ -0,0 +1,18 @@
+#!/bin/bash
+##
+# Downloads flores dataset
+#
+
+set -x
+set -euo pipefail
+
+dir=$1
+
+mkdir -p "${dir}"
+
+test -s "${dir}/flores101_dataset.tar.gz" ||
+  wget -O "${dir}/flores101_dataset.tar.gz" "https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz"
+
+test -s "${dir}/flores101_dataset/dev/eng.dev" ||
+ tar -xzf "${dir}/flores101_dataset.tar.gz" -C "${dir}" --no-same-owner
+
--- a/evals/eval/bleu-results.md
+++ b/evals/eval/bleu-results.md
@ -0,0 +1,55 @@
+# What is BLEU
+
+[BLEU (BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference translations (high quality).
+
+It has been shown that BLEU scores correlate well with human judgment of translation quality. Note that even human translators do not achieve a perfect score of 1.0.
+
+BLEU scores are expressed as a percentage rather than a decimal between 0 and 1.
+Trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.
+
+However, as a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather than decimals) might be helpful.
+
+BLEU Score |	Interpretation
+--- | ---
+< 10 |	Almost useless
+10 - 19 |	Hard to get the gist
+20 - 29 |	The gist is clear, but has significant grammatical errors
+30 - 40 |	Understandable to good translations
+40 - 50 |	High quality translations
+50 - 60 |	Very high quality, adequate, and fluent translations
+\> 60 |	Quality often better than human
+
+[More mathematical details](https://cloud.google.com/translate/automl/docs/evaluate#the_mathematical_details)
+
+Source: https://cloud.google.com/translate/automl/docs/evaluate#bleu
+
+
+BLEU is the most popular becnhmark in academia, so using BLEU allows us also to compare with reserach papers results and competitions (see [Conference on Machine Translation Conference (WMT)](http://statmt.org/wmt21/)).
+
+Read [this article](https://www.rws.com/blog/understanding-mt-quality-bleu-scores/) to better understand what BLEU is and why it is not perfect.
+
+# What are these benchmarks
+
+## Translators
+
+1. **bergamot** - uses compiled  [bergamot-translator](https://github.com/mozilla/bergamot-translator)  (wrapper for marian that is used by Firefox Translations web extension)
+2. **google** - uses Google Translation [API](https://cloud.google.com/translate)
+3. **microsoft** - uses Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
+
+Translation quality of Marian and Bergamot is supposed to be very similar.
+
+## Method
+
+We use official WMT ([Conference on Machine Translation](http://statmt.org/wmt21/)) parallel datasets. Available datasets are discovered automatically based on a language pair.
+
+We perform translation from source to target language using one of three translation systems and then compare the result with the dataset reference and calculate BLEU score.
+
+Evaluation is done using [SacreBLEU](https://github.com/mjpost/sacrebleu) tool which is reliable and widely used in academic world.
+
+Both absolute and relative differences in BLEU scores between Bergamot and other systems are reported.
+
+# Evaluation results
+
+`avg` = average on all datasets
+
+
--- a/evals/eval/clean-cache.sh
+++ b/evals/eval/clean-cache.sh
@ -0,0 +1,7 @@
+#!/bin/bash
+
+set -e
+
+# clean SacreBLEU cache to fix error
+# "This could be a problem with your system output or with sacreBLEU's reference database"
+test -e /root/.sacrebleu && rm -r /root/.sacrebleu
--- a/evals/eval/comet-results.md
+++ b/evals/eval/comet-results.md
@ -0,0 +1,47 @@
+# What is COMET
+
+COMET is a neural framework its developers present for training multilingual machine translation evaluation models. The framework has been reported to obtain new state-of-the-art levels of correlation with human judgments. Recent breakthroughs in cross-lingual pre-trained language modeling have been leveraged by the framework resulting in highly multilingual and adaptable MT evaluation models.
+
+Three models with different human judgments have been trained to showcase the framework. These include Direct Assessments, Human-mediated Translation Edit Rate, and Multidimensional Quality Metrics. These models are designed to exploit information from source input and a target-language reference translation to more accurately predict MT quality.
+
+The models developed by COMET have achieved new state-of-the-art performance on the WMT 2019 Metrics shared task, demonstrating robustness to high-performing systems.
+
+## Interpreting Scores:
+
+When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
+
+In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
+
+However, for the latest COMET models like Unbabel/wmt22-comet-da, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
+
+It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.
+
+Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.
+
+Source: https://aclanthology.org/2020.emnlp-main.213.pdf
+
+Tool: https://github.com/Unbabel/COMET
+
+# What are these benchmarks
+
+## Translators
+
+1. **bergamot** - uses compiled  [bergamot-translator](https://github.com/mozilla/bergamot-translator)  (wrapper for marian that is used by Firefox Translations web extension)
+2. **google** - uses Google Translation [API](https://cloud.google.com/translate)
+3. **microsoft** - uses Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
+
+## Method
+
+We use official WMT ([Conference on Machine Translation](http://statmt.org/wmt21/)) parallel datasets. Available datasets are discovered automatically based on a language pair.
+
+We perform the translation from source to target language using one of the three translation systems, compare the result with the dataset reference, and then calculate the [COMET](https://github.com/Unbabel/COMET) score.
+
+Both absolute and relative differences in the scores between Bergamot and other systems are reported.
+
+We also compare the systems using the `comet-compare` tool that calculates the statistical significance with Paired T-Test and bootstrap resampling.
+
+# Evaluation results
+
+`avg` = average on all datasets
+
+
--- a/evals/eval/eval-comet.sh
+++ b/evals/eval/eval-comet.sh
@ -0,0 +1,15 @@
+#!/bin/bash
+
+
+set -e
+set -o pipefail
+
+mkdir -p $(dirname "${EVAL_PREFIX}")
+
+sacrebleu -t "$DATASET" -l "$SRC-$TRG" --echo src \
+    | tee "$EVAL_PREFIX.$SRC" \
+    | $TRANSLATOR_CMD \
+    | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
+    ; comet-score --gpus "$GPUS" --quiet --only_system -d "$DATASET:$SRC-$TRG" -t "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
+    | awk -F"score: " '{print $2}' \
+    | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.comet"
--- a/evals/eval/eval-custom-comet.sh
+++ b/evals/eval/eval-custom-comet.sh
@ -0,0 +1,11 @@
+#!/bin/bash
+
+
+set -e
+set -o pipefail
+
+$TRANSLATOR_CMD < "$EVAL_PREFIX.$SRC" \
+  | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
+  ; comet-score --quiet --only_system --gpus "$GPUS" -s "$EVAL_PREFIX.$SRC" -t "$EVAL_PREFIX.$TRANSLATOR.$TRG" -r "$EVAL_PREFIX.$TRG" \
+  | awk -F"score: " '{print $2}' \
+  | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.comet"
--- a/evals/eval/eval-custom.sh
+++ b/evals/eval/eval-custom.sh
@ -0,0 +1,12 @@
+#!/bin/bash
+
+
+set -e
+set -o pipefail
+
+$TRANSLATOR_CMD < "$EVAL_PREFIX.$SRC" \
+  | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
+  | sacrebleu --score-only -q -l "$SRC-$TRG" "$EVAL_PREFIX.$TRG" \
+  | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.bleu"
+
+
--- a/evals/eval/eval.sh
+++ b/evals/eval/eval.sh
@ -0,0 +1,14 @@
+#!/bin/bash
+
+
+set -e
+set -o pipefail
+
+mkdir -p $(dirname "${EVAL_PREFIX}")
+
+sacrebleu -t "$DATASET" -l "$SRC-$TRG" --echo src \
+    | tee "$EVAL_PREFIX.$SRC" \
+    | $TRANSLATOR_CMD \
+    | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
+    | sacrebleu --score-only -q -t "$DATASET" -l "$SRC-$TRG" \
+    | tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.bleu"
--- a/evals/eval/evaluate.py
+++ b/evals/eval/evaluate.py
@ -0,0 +1,391 @@
+import shutil
+import subprocess
+import os
+from collections import defaultdict
+import statistics
+import traceback
+from sacrebleu import dataset
+import click
+from toolz import groupby
+from glob import glob
+import pandas as pd
+from mtdata import iso
+from os.path import exists
+
+HOME_DIR = './'
+EVAL_DIR = os.path.join(HOME_DIR, 'eval')
+EVAL_PATH = os.path.join(EVAL_DIR, 'eval.sh')
+EVAL_PATH_COMET = os.path.join(EVAL_DIR, 'eval-comet.sh')
+EVAL_CUSTOM_PATH = os.path.join(EVAL_DIR, 'eval-custom.sh')
+EVAL_CUSTOM_PATH_COMET = os.path.join(EVAL_DIR, 'eval-custom-comet.sh')
+CLEAN_CACHE_PATH = os.path.join(EVAL_DIR, 'clean-cache.sh')
+
+CUSTOM_DATASETS = ['flores-dev', 'flores-test']
+CUSTOM_DATA_DIR = os.path.join(HOME_DIR, 'data')
+FLORES_PATH = os.path.join(CUSTOM_DATA_DIR, 'flores.sh')
+
+BERGAMOT_APP_PATH = os.path.join(HOME_DIR, 'bergamot-translator', 'build', 'app', 'bergamot')
+BERGAMOT_EVAL_PATH = os.path.join(HOME_DIR, 'translators', 'bergamot.sh')
+
+
+TRANS_ORDER = {'bergamot': 0,
+               'google': 1,
+               'microsoft': 2}
+
+
+def get_dataset_prefix(dataset_name, pair, results_dir):
+    dataset_name = dataset_name.replace('/', '_')
+    return os.path.join(results_dir, f'{pair[0]}-{pair[1]}', f'{dataset_name}')
+
+
+def get_bleu_path(dataset_name, pair, results_dir, translator, evaluation_engine):
+    prefix = get_dataset_prefix(dataset_name, pair, results_dir)
+    return f'{prefix}.{translator}.{pair[1]}.{evaluation_engine}'
+
+
+# Custom data
+
+
+def download_custom_data():
+    print('Downloading Flores dataset')
+    os.makedirs(CUSTOM_DATA_DIR, exist_ok=True)
+    subprocess.run(['bash', FLORES_PATH, CUSTOM_DATA_DIR])
+
+
+def copy_flores_lang(dataset_name, lang, eval_prefix):
+    flores_dataset = 'dev' if dataset_name == 'flores-dev' else 'devtest'
+
+    if lang == 'zh' or lang == 'zh-Hans':
+        lang_code = 'zho_simpl'
+    elif lang == 'zh-Hant':
+        lang_code = 'zho_trad'
+    elif lang == 'nb':
+        lang_code = 'nob'
+    else:
+        lang_code = iso.iso3_code(lang)
+
+    os.makedirs(os.path.dirname(eval_prefix), exist_ok=True)
+    shutil.copy(os.path.join(CUSTOM_DATA_DIR, 'flores101_dataset', flores_dataset, f'{lang_code}.{flores_dataset}'),
+                f'{eval_prefix}.{lang}')
+
+
+def copy_custom_data(dataset_name, pair, results_dir):
+    src, trg = pair
+    eval_prefix = get_dataset_prefix(dataset_name, pair, results_dir)
+
+    if dataset_name.startswith('flores'):
+        copy_flores_lang(dataset_name, src, eval_prefix)
+        copy_flores_lang(dataset_name, trg, eval_prefix)
+    else:
+        raise ValueError(f'Unsupported custom dataset: {dataset_name}')
+
+
+# Evaluation
+
+def find_datasets(pair):
+    formatted_pair = f'{pair[0]}-{pair[1]}'
+    datasets = []
+    datasets += CUSTOM_DATASETS
+
+    for dataset_name, descr in dataset.DATASETS.items():
+        is_wmt_official = dataset_name.startswith('wmt') and len(dataset_name) == 5
+        is_other_accepted = dataset_name == 'iwslt17' or dataset_name == 'mtedx/test'
+
+        if not (is_wmt_official or is_other_accepted) or formatted_pair not in descr.langpairs:
+            continue
+
+        datasets.append(dataset_name)
+
+    return datasets
+
+
+def evaluate(pair, set_name, translator, evaluation_engine, gpus, models_dir, results_dir):
+    source, target = pair
+
+    my_env = os.environ.copy()
+    my_env['SRC'] = source
+    my_env['TRG'] = target
+    my_env['DATASET'] = set_name
+    my_env['EVAL_PREFIX'] = get_dataset_prefix(set_name, pair, results_dir)
+    my_env['TRANSLATOR'] = translator
+    my_env['GPUS'] = gpus
+
+    if translator == 'bergamot':
+        my_env['MODEL_DIR'] = os.path.join(models_dir, f'{source}{target}')
+        my_env['APP_PATH'] = BERGAMOT_APP_PATH
+        cmd = f'bash {BERGAMOT_EVAL_PATH}'
+    elif translator == 'google':
+        cmd = f"python3 {os.path.join(HOME_DIR, 'translators', 'google_translate.py')}"
+    elif translator == 'microsoft':
+        cmd = f"python3 {os.path.join(HOME_DIR, 'translators', 'microsoft.py')}"
+    else:
+        raise ValueError(f'Translator is not supported: {translator}')
+
+    my_env['TRANSLATOR_CMD'] = cmd
+    eval_path = EVAL_CUSTOM_PATH if set_name in CUSTOM_DATASETS else EVAL_PATH
+
+    if set_name in CUSTOM_DATASETS and evaluation_engine == 'bleu':
+        eval_path = EVAL_CUSTOM_PATH
+    elif set_name in CUSTOM_DATASETS and evaluation_engine == 'comet':
+        eval_path = EVAL_CUSTOM_PATH_COMET
+    elif set_name not in CUSTOM_DATASETS and evaluation_engine == 'bleu':
+        eval_path = EVAL_PATH
+    elif set_name not in CUSTOM_DATASETS and evaluation_engine == 'comet':
+        eval_path = EVAL_PATH_COMET
+
+    retries = 3
+    while True:
+        try:
+            res = subprocess.run(['bash', eval_path], env=my_env, stdout=subprocess.PIPE,
+                                 stderr=subprocess.PIPE)
+            print("stdout: ", res.stdout.decode('utf-8'))
+            print("stderr: ", res.stderr.decode('utf-8'))
+            if evaluation_engine == "bleu":
+                float_res = float(res.stdout.decode('utf-8').strip())
+            elif evaluation_engine == "comet":
+                float_res = float(res.stdout.decode('utf-8').strip().split("\n")[-1])
+            return float_res
+        except:
+            traceback.print_exc()
+            if retries == 0:
+                raise
+            retries -= 1
+            subprocess.run(['bash', CLEAN_CACHE_PATH])
+            print('Attempt failed, retrying')
+
+
+def run_dir(lang_pairs, skip_existing, translators, evaluation_engines, gpus, results_dir, models_dir):
+    reordered = sorted(translators.split(','), key=lambda x: TRANS_ORDER[x])
+
+    for evaluation_engine in evaluation_engines.split(','):
+        for pair in lang_pairs:
+            if 'nn' in pair:
+                print('There are no evaluation datasets for Norwegian Nynorsk '
+                    'and it is not supported by Google and Microsoft API. Skipping evaluation')
+                continue
+
+            for dataset_name in find_datasets(pair):
+                for translator in reordered:
+                    print(f'Evaluation for dataset: {dataset_name}, translator: {translator}, pair: {pair[0]}-{pair[1]}, evaluation engine: {evaluation_engine}')
+
+                    res_path = get_bleu_path(dataset_name, pair, results_dir, translator, evaluation_engine)
+                    print(f'Searching for {res_path}')
+
+                    if skip_existing and os.path.isfile(res_path) and os.stat(res_path).st_size > 0:
+                        print(f"Already exists, skipping ({res_path})")
+                        with open(res_path) as f:
+                            score = float(f.read().strip())
+                    else:
+                        print('Not found, running evaluation...')
+                        if dataset_name in CUSTOM_DATASETS:
+                            copy_custom_data(dataset_name, pair, results_dir)
+                        score = evaluate(pair, dataset_name, translator, evaluation_engine, gpus, results_dir=results_dir, models_dir=models_dir)
+
+                    print(f'Result {evaluation_engine}: {score}\n')
+
+def run_comet_compare(lang_pairs, skip_existing, translators, gpus, models_dir, results_dir):
+    for pair in lang_pairs:
+        if 'nn' in pair:
+            print('There are no evaluation datasets for Norwegian Nynorsk '
+                'and it is not supported by Google and Microsoft API. Skipping comparison')
+            continue
+
+        source, target = pair
+
+        for dataset_name in find_datasets(pair):
+            original_dataset_name = dataset_name
+            dataset_name = dataset_name.replace('/', '_')
+            print(f'Comparison for dataset: {dataset_name}, pair: {source}-{target}')
+            working_folder = f'{results_dir}/{source}-{target}/'
+            output_filename = f'{working_folder}/{dataset_name}.{source}-{target}.cometcompare'
+            if skip_existing and os.path.isfile(output_filename) and os.stat(output_filename).st_size > 0:
+                print(f'Comparison exists. Skipping...')
+                continue
+
+            source_dataset = f'{dataset_name}.{source}'
+            targets = ""
+            for translator in translators.split(','):
+                targets += f'{dataset_name}.{translator}.{target} '
+            command = ""
+            if dataset_name in CUSTOM_DATASETS:
+                reference = f'{dataset_name}.{target}'
+                command = f'comet-compare --gpus {gpus} -s {source_dataset} -t {targets.strip()} -r {reference}'
+            else:
+                command = f'comet-compare --gpus {gpus} -d {original_dataset_name}:{source}-{target} -t {targets.strip()}'
+            res = subprocess.run(command.split(' '), cwd=working_folder,
+                                 stdout=subprocess.PIPE,
+                                 stderr=subprocess.PIPE)
+            stdout =  res.stdout.decode('utf-8')
+            with open(output_filename, 'w') as f:
+                f.write(stdout)
+            print("stdout: ", res.stdout.decode('utf-8'))
+            print("stderr: ", res.stderr.decode('utf-8'))
+
+
+# Report generation
+
+def build_report(res_dir, evaluation_engines):
+    os.makedirs(os.path.join(res_dir, 'img'), exist_ok=True)
+
+    for evaluation_engine in evaluation_engines.split(","):
+        results = read_results(res_dir, evaluation_engine)
+        with open(os.path.join(EVAL_DIR, evaluation_engine + '-results.md')) as f:
+            lines = [l.strip() for l in f.readlines()]
+
+        avg_results = get_avg_scores(results)
+        build_section(avg_results, 'avg', lines, res_dir, evaluation_engine)
+
+        for lang_pair, datasets in results.items():
+            build_section(datasets, lang_pair, lines, res_dir, evaluation_engine)
+
+        results_path = os.path.join(res_dir, evaluation_engine + '-results.md')
+        with open(results_path, 'w+') as f:
+            f.write('\n'.join(lines))
+            print(f'Results are written to {results_path}')
+
+
+def build_section(datasets, key, lines, res_dir, evaluation_engine):
+    lines.append(f'\n## {key}\n')
+    lines.append(f'| Translator/Dataset | {" | ".join(datasets.keys())} |')
+    lines.append(f"| {' | '.join(['---' for _ in range(len(datasets) + 1)])} |")
+
+    inverted_formatted = defaultdict(dict)
+    inverted_scores = defaultdict(dict)
+    comet_comparisons = defaultdict(dict)
+    for dataset_name, translators in datasets.items():
+        bergamot_res = translators.get('bergamot')
+        reordered = sorted(translators.items(), key=lambda x: TRANS_ORDER[x[0]])
+
+        for translator, score in reordered:
+            if score == 0:
+                formatted_score = 'N/A'
+            elif translator != 'bergamot' and bergamot_res:
+                change_perc = (score - bergamot_res) / bergamot_res * 100
+                change = score - bergamot_res
+                sign = '+' if change > 0 else ''
+                formatted_score = f'{score:.2f} ({sign}{change:.2f}, {sign}{change_perc:.2f}%)'
+            else:
+                formatted_score = f'{score:.2f}'
+
+            inverted_formatted[translator][dataset_name] = formatted_score
+            inverted_scores[translator][dataset_name] = score
+
+        # if this is a non-avg comet report, and a cometcompare report exists, we print it
+        cometcompare_path = "{}/{}/{}.{}.cometcompare".format(res_dir,key,dataset_name,key)
+        if evaluation_engine == "comet" and key != "avg" and "{}.{}".format(dataset_name, key) not in comet_comparisons and exists(cometcompare_path):
+            cometcompare_file = open(cometcompare_path)
+            filelines = cometcompare_file.readlines()
+            final_report = ""
+            for line in filelines:
+                if "outperforms" in line:
+                    final_report += f'- {line}'
+            comet_comparisons["{}.{}".format(dataset_name, key)] = final_report
+
+    for translator, scores in inverted_formatted.items():
+        lines.append(f'| {translator} | {" | ".join(scores.values())} |')
+
+    img_path = os.path.join(res_dir, 'img', f'{key}-{evaluation_engine}.png')
+    plot_lang_pair(datasets, inverted_scores, img_path, evaluation_engine)
+
+    img_relative_path = '/'.join(img_path.split("/")[-2:])
+    lines.append(f'\n![Results]({img_relative_path})')
+
+    printed_header = False
+    for dataset in comet_comparisons:
+        if (not printed_header):
+            lines.append("### Comparisons between systems")
+            lines.append("*If a comparison is omitted, the systems have equal averages (tie). Click on the dataset for a complete report*")
+            printed_header = True
+
+        lines.append(f'#### [{dataset}]({key}/{dataset}.cometcompare)')
+        lines.append(f'{comet_comparisons[dataset]}')
+
+    lines.append("---")
+
+def read_results(res_dir, evaluation_engine):
+    results = defaultdict(dict)
+    all_translators = set()
+    for bleu_file in glob(res_dir + '/*/*.' + evaluation_engine):
+        dataset_name, translator, = os.path.basename(bleu_file).split('.')[:2]
+        pair = bleu_file.split('/')[-2]
+        with open(bleu_file) as f:
+            score = float(f.read().strip())
+
+        if dataset_name not in results[pair]:
+            results[pair][dataset_name] = {}
+        results[pair][dataset_name][translator] = score
+        all_translators.add(translator)
+
+    # fix missing translators
+    for _, datasets in results.items():
+        for _, translators in datasets.items():
+            for translator in all_translators:
+                if translator not in translators:
+                    translators[translator] = 0
+
+    return results
+
+
+def get_avg_scores(results):
+    scores = {}
+    for lang_pair, datasets in results.items():
+        tran_scores = [(tran, score)
+                       for data, trans in datasets.items()
+                       for tran, score in trans.items()]
+        avg_scores = {tran: statistics.mean([s for _, s in scores])
+                      for tran, scores in groupby(lambda x: x[0], tran_scores).items()}
+        scores[lang_pair] = avg_scores
+    return scores
+
+
+def plot_lang_pair(datasets, inverted_scores, img_path, evaluation_engine):
+    trans_scores = {t: s.values() for t, s in inverted_scores.items()}
+    translators = [t for t in TRANS_ORDER.keys() if t in inverted_scores]
+
+    df = pd.DataFrame(trans_scores, index=datasets, columns=translators)
+    fig = df.plot.bar(ylabel=evaluation_engine).get_figure()
+    fig.set_size_inches(18.5, 10.5)
+    fig.savefig(img_path, bbox_inches="tight")
+
+
+# Main
+
+
+@click.command()
+@click.option('--pairs',
+              default='all',
+              help='Comma separated language pairs or `all`. Example: es-en,de-et')
+@click.option('--translators',
+              default='bergamot',
+              help='Comma separated translators. Example: bergamot,google')
+@click.option('--results-dir',
+              help='Directory for results')
+@click.option('--models-dir',
+              help='Directory with models')
+@click.option('--skip-existing',
+              default=False,
+              is_flag=True,
+              help='Whether to skip already calculated scores. '
+                   'They are located in `results/xx-xx` folders as *.bleu or *.comet files.')
+@click.option('--evaluation-engine',
+              default="bleu",
+              help='Determine which evaluation engine to use: bleu or comet')
+@click.option('--comet-compare',
+              default=True,
+              help='Determine if comet-compare should be executed or not. Default: True')
+@click.option('--gpus',
+              default="0",
+              help='Determine the number of GPUs used by the comet engine (if applicable). Default: 0')
+def run(pairs, translators, results_dir, models_dir, skip_existing, evaluation_engine, gpus, comet_compare):
+    lang_pairs = [(pair[:2], pair[-2:])
+                  for pair in (os.listdir(models_dir) if pairs == 'all' else pairs.split(','))]
+    print(f'Language pairs to evaluate: {lang_pairs}')
+    download_custom_data()
+    run_dir(lang_pairs, skip_existing, translators, evaluation_engine, gpus, models_dir=models_dir, results_dir=results_dir)
+    if comet_compare:
+        run_comet_compare(lang_pairs, skip_existing, translators, gpus, models_dir=models_dir, results_dir=results_dir)
+    build_report(results_dir,evaluation_engine)
+
+
+if __name__ == '__main__':
+    run()
--- a/evals/install/install-bergamot-translator.sh
+++ b/evals/install/install-bergamot-translator.sh
@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Downloads and compiles Bergamot translator and Marian
+# See https://marian-nmt.github.io/docs/#installation for requirements
+
+set -e
+
+if [ -e "bergamot-translator" ]; then
+    echo "already cloned"
+else
+    echo "Cloning https://github.com/mozilla/bergamot-translator.git"
+    git clone https://github.com/mozilla/bergamot-translator.git
+fi
+
+cd bergamot-translator
+
+echo "Compiling bergamot-translator"
+mkdir -p build
+cd build
+cmake ../ -DUSE_WASM_COMPATIBLE_SOURCES=off -DCMAKE_BUILD_TYPE=Release
+make -j$(nproc)
--- a/evals/requirements.txt
+++ b/evals/requirements.txt
@ -0,0 +1,10 @@
+sacrebleu==2.3.1
+click==8.0.1
+google-cloud-translate==3.2.1
+requests==2.26.0
+toolz==0.11.1
+tqdm==4.61.2
+pandas==1.1.5
+matplotlib==3.4.2
+mtdata==0.2.9
+unbabel-comet==2.1.1
--- a/evals/translators/bergamot.config.yml
+++ b/evals/translators/bergamot.config.yml
@ -0,0 +1,28 @@
+
+# These Marian options are set according to
+# https://github.com/mozilla/firefox-translations/blob/main/extension/controller/translation/translationWorker.js
+# to imitate production setting
+
+
+bergamot-mode: wasm
+models:
+  - MODEL
+vocabs:
+  - SRCVOCAB
+  - TRGVOCAB
+shortlist:
+    - SHORTLIST
+    - false
+beam-size: 1
+normalize: 1.0
+word-penalty: 0
+max-length-break: 128
+mini-batch-words: 1024
+workspace: 128
+max-length-factor: 2.0
+skip-cost: true
+cpu-threads: 0
+quiet: false
+quiet-translation: false
+gemm-precision: PRECISION
+alignment: soft
--- a/evals/translators/bergamot.sh
+++ b/evals/translators/bergamot.sh
@ -0,0 +1,29 @@
+#!/bin/bash
+
+set -x
+set -euo pipefail
+
+SRCVOCAB=$(find "${MODEL_DIR}" -name srcvocab.*.spm)
+TRGVOCAB=$(find "${MODEL_DIR}" -name trgvocab.*.spm)
+VOCAB=$(find "${MODEL_DIR}" -name vocab.*.spm)
+MODEL=$(find "${MODEL_DIR}" -name model.${SRC}${TRG}.*.bin)
+SHORTLIST=$(find "${MODEL_DIR}" -name *${SRC}${TRG}.s2t.bin)
+CONFIG="/tmp/bergamot.config.${SRC}${TRG}.yml"
+
+if [[ ${MODEL} == *.intgemm.alphas.bin ]]; then
+  PRECISION=int8shiftAlphaAll
+elif [[ ${MODEL} == *.intgemm8.bin ]]; then
+  PRECISION=int8shiftAll
+else
+  echo "Unknown model name pattern: ${MODEL}"
+  exit 1
+fi
+
+cp translators/bergamot.config.yml "${CONFIG}"
+sed -i -e "s+MODEL+${MODEL}+g" "${CONFIG}"
+sed -i -e "s+SRCVOCAB+${SRCVOCAB:-$VOCAB}+g" "${CONFIG}"
+sed -i -e "s+TRGVOCAB+${TRGVOCAB:-$VOCAB}+g" "${CONFIG}"
+sed -i -e "s+SHORTLIST+${SHORTLIST}+g" "${CONFIG}"
+sed -i -e "s+PRECISION+${PRECISION}+g" "${CONFIG}"
+
+$APP_PATH --model-config-paths "${CONFIG}" --log-level=info
--- a/evals/translators/google_translate.py
+++ b/evals/translators/google_translate.py
@ -0,0 +1,37 @@
+# Pricing
+# https://cloud.google.com/translate/pricing
+
+import os
+
+from google.cloud import translate_v2
+import sys
+
+from tqdm import tqdm
+import toolz
+
+translate_client = translate_v2.Client()
+
+
+def translate(texts):
+    """Translates text into the target language.
+
+    Texts must be an ISO 639-1 language code.
+    See https://g.co/cloud/translate/v2/translate-reference#supported_languages
+    """
+    source = os.environ['SRC']
+    target = os.environ['TRG']
+
+    results = []
+    # decrease partition size if hitting limit of max 204800 bytes per request
+    for partition in tqdm(list(toolz.partition_all(100, texts))):
+        response = translate_client.translate(partition, target_language=target, source_language=source)
+        results += [r['translatedText'] for r in response]
+
+    return results
+
+
+if __name__ == '__main__':
+    texts = [line.strip() for line in sys.stdin]
+    translations = translate(texts)
+    sys.stdout.write('\n'.join(translations))
+    sys.stdout.write('\n')
--- a/evals/translators/microsoft.py
+++ b/evals/translators/microsoft.py
@ -0,0 +1,50 @@
+# Pricing:
+# https://azure.microsoft.com/en-us/pricing/details/cognitive-services/translator/
+
+import requests, uuid
+import os
+import sys
+import toolz
+from tqdm import tqdm
+
+
+subscription_key = os.environ['AZURE_TRANSLATOR_KEY']
+location = os.getenv("AZURE_LOCATION", 'global')
+url = "https://api.cognitive.microsofttranslator.com/translate"
+headers = {
+    'Ocp-Apim-Subscription-Key': subscription_key,
+    'Ocp-Apim-Subscription-Region': location,
+    'Content-type': 'application/json',
+    'X-ClientTraceId': str(uuid.uuid4())
+}
+
+
+def translate(texts):
+    source = os.environ['SRC']
+    target = os.environ['TRG']
+
+    params = {
+        'api-version': '3.0',
+        'from': source,
+        'to': [target]
+    }
+
+    results = []
+    # decrease partition size if hitting limit of max 10000 characters per request
+    for partition in tqdm(list(toolz.partition_all(20, texts))):
+        body = [{'text': text} for text in partition]
+        response = requests.post(url, params=params, headers=headers, json=body)
+
+        if response.status_code != 200:
+            raise ValueError(f'Incorrect response. code: {response.status_code} body: {response.json()}')
+
+        results += [r['translations'][0]['text'] for r in response.json()]
+
+    return results
+
+
+if __name__ == '__main__':
+    texts = [line.strip() for line in sys.stdin]
+    translations = translate(texts)
+    sys.stdout.write('\n'.join(translations))
+    sys.stdout.write('\n')
--- a/evaluation/dev/comet-results.md
+++ b/evaluation/dev/comet-results.md
@ -48,11 +48,11 @@ We also compare the systems using the `comet-compare` tool that calculates the s

 ## avg

-| Translator/Dataset | hu-en | ru-en | en-nl | en-ru | en-fa | nl-en | uk-en | fa-en | ca-en | en-uk | is-en |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| bergamot | 0.56 | 0.49 | 0.58 | 0.54 | 0.31 | 0.63 | 0.52 | 0.50 | 0.65 | 0.51 | 0.15 |
-| google | 0.66 (+0.10, +17.32%) | 0.59 (+0.10, +20.83%) | 0.67 (+0.08, +14.30%) | 0.76 (+0.21, +39.38%) | 0.70 (+0.39, +126.54%) | 0.70 (+0.07, +10.71%) | 0.67 (+0.15, +28.26%) | 0.74 (+0.24, +48.00%) | 0.82 (+0.16, +24.78%) | 0.79 (+0.27, +53.31%) | 0.70 (+0.55, +370.91%) |
-| microsoft | 0.66 (+0.10, +17.85%) | 0.60 (+0.11, +22.13%) | 0.65 (+0.06, +11.05%) | 0.72 (+0.18, +32.36%) | 0.41 (+0.10, +31.65%) | 0.69 (+0.06, +9.12%) | 0.64 (+0.12, +23.16%) | 0.66 (+0.16, +32.78%) | 0.79 (+0.14, +21.22%) | 0.75 (+0.23, +45.60%) | 0.67 (+0.52, +353.71%) |
+| Translator/Dataset | hu-en | ru-en | fi-en | en-nl | en-ru | en-fa | nl-en | uk-en | fa-en | ca-en | en-uk | is-en |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| bergamot | 0.56 | 0.49 | 0.86 | 0.58 | 0.54 | 0.31 | 0.63 | 0.52 | 0.50 | 0.65 | 0.51 | 0.15 |
+| google | 0.66 (+0.10, +17.32%) | 0.59 (+0.10, +20.83%) | 0.89 (+0.03, +3.42%) | 0.67 (+0.08, +14.30%) | 0.76 (+0.21, +39.38%) | 0.70 (+0.39, +126.54%) | 0.70 (+0.07, +10.71%) | 0.67 (+0.15, +28.26%) | 0.74 (+0.24, +48.00%) | 0.82 (+0.16, +24.78%) | 0.79 (+0.27, +53.31%) | 0.70 (+0.55, +370.91%) |
+| microsoft | 0.66 (+0.10, +17.85%) | 0.60 (+0.11, +22.13%) | 0.89 (+0.03, +3.83%) | 0.65 (+0.06, +11.05%) | 0.72 (+0.18, +32.36%) | 0.41 (+0.10, +31.65%) | 0.69 (+0.06, +9.12%) | 0.64 (+0.12, +23.16%) | 0.66 (+0.16, +32.78%) | 0.79 (+0.14, +21.22%) | 0.75 (+0.23, +45.60%) | 0.67 (+0.52, +353.71%) |

 ![Results](img/avg-comet.png)
 ---
@ -157,6 +157,52 @@ We also compare the systems using the `comet-compare` tool that calculates the s

 ---

+## fi-en
+
+| Translator/Dataset | wmt17 | flores-test | wmt15 | wmt18 | wmt16 | wmt19 | flores-dev |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| bergamot | 0.86 | 0.87 | 0.85 | 0.84 | 0.85 | 0.85 | 0.87 |
+| google | 0.89 (+0.03, +3.33%) | 0.90 (+0.03, +3.46%) | 0.88 (+0.03, +3.40%) | 0.86 (+0.02, +2.89%) | 0.88 (+0.03, +3.48%) | 0.88 (+0.03, +3.89%) | 0.90 (+0.03, +3.50%) |
+| microsoft | 0.90 (+0.03, +3.68%) | 0.90 (+0.03, +3.57%) | 0.89 (+0.04, +4.14%) | 0.87 (+0.03, +3.79%) | 0.89 (+0.03, +3.93%) | 0.89 (+0.04, +4.16%) | 0.90 (+0.03, +3.54%) |
+
+![Results](img/fi-en-comet.png)
+### Comparisons between systems
+*If a comparison is omitted, the systems have equal averages (tie). Click on the dataset for a complete report*
+#### [wmt17.fi-en](fi-en/wmt17.fi-en.cometcompare)
+- wmt17.microsoft.en outperforms wmt17.bergamot.en.
+- wmt17.google.en outperforms wmt17.bergamot.en.
+- wmt17.microsoft.en outperforms wmt17.google.en.
+
+#### [flores-test.fi-en](fi-en/flores-test.fi-en.cometcompare)
+- flores-test.microsoft.en outperforms flores-test.bergamot.en.
+- flores-test.google.en outperforms flores-test.bergamot.en.
+
+#### [wmt15.fi-en](fi-en/wmt15.fi-en.cometcompare)
+- wmt15.microsoft.en outperforms wmt15.bergamot.en.
+- wmt15.google.en outperforms wmt15.bergamot.en.
+- wmt15.microsoft.en outperforms wmt15.google.en.
+
+#### [wmt18.fi-en](fi-en/wmt18.fi-en.cometcompare)
+- wmt18.microsoft.en outperforms wmt18.bergamot.en.
+- wmt18.google.en outperforms wmt18.bergamot.en.
+- wmt18.microsoft.en outperforms wmt18.google.en.
+
+#### [wmt16.fi-en](fi-en/wmt16.fi-en.cometcompare)
+- wmt16.microsoft.en outperforms wmt16.bergamot.en.
+- wmt16.google.en outperforms wmt16.bergamot.en.
+- wmt16.microsoft.en outperforms wmt16.google.en.
+
+#### [wmt19.fi-en](fi-en/wmt19.fi-en.cometcompare)
+- wmt19.microsoft.en outperforms wmt19.bergamot.en.
+- wmt19.google.en outperforms wmt19.bergamot.en.
+- wmt19.microsoft.en outperforms wmt19.google.en.
+
+#### [flores-dev.fi-en](fi-en/flores-dev.fi-en.cometcompare)
+- flores-dev.microsoft.en outperforms flores-dev.bergamot.en.
+- flores-dev.google.en outperforms flores-dev.bergamot.en.
+
+---
+
 ## en-nl

 | Translator/Dataset | flores-dev | flores-test |
--- a/evaluation/dev/fi-en/flores-dev.bergamot.en.comet
+++ b/evaluation/dev/fi-en/flores-dev.bergamot.en.comet
@ -0,0 +1 @@
+0.8733
--- a/evaluation/dev/fi-en/flores-dev.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/flores-dev.fi-en.cometcompare
@ -0,0 +1,60 @@
+==========================
+x_name: flores-dev.bergamot.en
+y_name: flores-dev.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8735
+y-mean:	0.9043
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-18.7835
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+flores-dev.microsoft.en outperforms flores-dev.bergamot.en.
+==========================
+x_name: flores-dev.bergamot.en
+y_name: flores-dev.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8735
+y-mean:	0.9039
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-18.0695
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+flores-dev.google.en outperforms flores-dev.bergamot.en.
+==========================
+x_name: flores-dev.microsoft.en
+y_name: flores-dev.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.9043
+y-mean:	0.9039
+ties (%):	0.6400
+x_wins (%):	0.2667
+y_wins (%):	0.0933
+
+Paired T-Test Results:
+statistic:	0.4615
+p_value:	0.6446
+Null hypothesis can't be rejected.
+Both systems have equal averages.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y      flores-dev.bergamot.en    flores-dev.microsoft.en    flores-dev.google.en
+-----------------------  ------------------------  -------------------------  ----------------------
+flores-dev.bergamot.en                             False                      False
+flores-dev.microsoft.en  True                                                 False
+flores-dev.google.en     True                      False
--- a/evaluation/dev/fi-en/flores-dev.google.en.comet
+++ b/evaluation/dev/fi-en/flores-dev.google.en.comet
@ -0,0 +1 @@
+0.9039
--- a/evaluation/dev/fi-en/flores-dev.microsoft.en.comet
+++ b/evaluation/dev/fi-en/flores-dev.microsoft.en.comet
@ -0,0 +1 @@
+0.9042
--- a/evaluation/dev/fi-en/flores-test.bergamot.en.comet
+++ b/evaluation/dev/fi-en/flores-test.bergamot.en.comet
@ -0,0 +1 @@
+0.8702
--- a/evaluation/dev/fi-en/flores-test.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/flores-test.fi-en.cometcompare
@ -0,0 +1,60 @@
+==========================
+x_name: flores-test.bergamot.en
+y_name: flores-test.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8700
+y-mean:	0.9013
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-20.3147
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+flores-test.microsoft.en outperforms flores-test.bergamot.en.
+==========================
+x_name: flores-test.bergamot.en
+y_name: flores-test.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8700
+y-mean:	0.9003
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-19.7405
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+flores-test.google.en outperforms flores-test.bergamot.en.
+==========================
+x_name: flores-test.microsoft.en
+y_name: flores-test.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.9013
+y-mean:	0.9003
+ties (%):	0.4400
+x_wins (%):	0.5033
+y_wins (%):	0.0567
+
+Paired T-Test Results:
+statistic:	1.3298
+p_value:	0.1839
+Null hypothesis can't be rejected.
+Both systems have equal averages.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y       flores-test.bergamot.en    flores-test.microsoft.en    flores-test.google.en
+------------------------  -------------------------  --------------------------  -----------------------
+flores-test.bergamot.en                              False                       False
+flores-test.microsoft.en  True                                                   False
+flores-test.google.en     True                       False
--- a/evaluation/dev/fi-en/flores-test.google.en.comet
+++ b/evaluation/dev/fi-en/flores-test.google.en.comet
@ -0,0 +1 @@
+0.9003
--- a/evaluation/dev/fi-en/flores-test.microsoft.en.comet
+++ b/evaluation/dev/fi-en/flores-test.microsoft.en.comet
@ -0,0 +1 @@
+0.9013
--- a/evaluation/dev/fi-en/wmt15.bergamot.en.comet
+++ b/evaluation/dev/fi-en/wmt15.bergamot.en.comet
@ -0,0 +1 @@
+0.8498
--- a/evaluation/dev/fi-en/wmt15.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/wmt15.fi-en.cometcompare
@ -0,0 +1,61 @@
+==========================
+x_name: wmt15.bergamot.en
+y_name: wmt15.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8495
+y-mean:	0.8849
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-21.2660
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt15.microsoft.en outperforms wmt15.bergamot.en.
+==========================
+x_name: wmt15.bergamot.en
+y_name: wmt15.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8495
+y-mean:	0.8786
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-16.7812
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt15.google.en outperforms wmt15.bergamot.en.
+==========================
+x_name: wmt15.microsoft.en
+y_name: wmt15.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8849
+y-mean:	0.8786
+ties (%):	0.0000
+x_wins (%):	1.0000
+y_wins (%):	0.0000
+
+Paired T-Test Results:
+statistic:	6.4529
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt15.microsoft.en outperforms wmt15.google.en.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y    wmt15.bergamot.en    wmt15.microsoft.en    wmt15.google.en
+---------------------  -------------------  --------------------  -----------------
+wmt15.bergamot.en                           False                 False
+wmt15.microsoft.en     True                                       True
+wmt15.google.en        True                 False
--- a/evaluation/dev/fi-en/wmt15.google.en.comet
+++ b/evaluation/dev/fi-en/wmt15.google.en.comet
@ -0,0 +1 @@
+0.8787
--- a/evaluation/dev/fi-en/wmt15.microsoft.en.comet
+++ b/evaluation/dev/fi-en/wmt15.microsoft.en.comet
@ -0,0 +1 @@
+0.8850
--- a/evaluation/dev/fi-en/wmt16.bergamot.en.comet
+++ b/evaluation/dev/fi-en/wmt16.bergamot.en.comet
@ -0,0 +1 @@
+0.8539
--- a/evaluation/dev/fi-en/wmt16.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/wmt16.fi-en.cometcompare
@ -0,0 +1,61 @@
+==========================
+x_name: wmt16.bergamot.en
+y_name: wmt16.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8541
+y-mean:	0.8875
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-31.1694
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt16.microsoft.en outperforms wmt16.bergamot.en.
+==========================
+x_name: wmt16.bergamot.en
+y_name: wmt16.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8541
+y-mean:	0.8837
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-26.1908
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt16.google.en outperforms wmt16.bergamot.en.
+==========================
+x_name: wmt16.microsoft.en
+y_name: wmt16.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8875
+y-mean:	0.8837
+ties (%):	0.0033
+x_wins (%):	0.9967
+y_wins (%):	0.0000
+
+Paired T-Test Results:
+statistic:	5.6877
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt16.microsoft.en outperforms wmt16.google.en.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y    wmt16.bergamot.en    wmt16.microsoft.en    wmt16.google.en
+---------------------  -------------------  --------------------  -----------------
+wmt16.bergamot.en                           False                 False
+wmt16.microsoft.en     True                                       True
+wmt16.google.en        True                 False
--- a/evaluation/dev/fi-en/wmt16.google.en.comet
+++ b/evaluation/dev/fi-en/wmt16.google.en.comet
@ -0,0 +1 @@
+0.8836
--- a/evaluation/dev/fi-en/wmt16.microsoft.en.comet
+++ b/evaluation/dev/fi-en/wmt16.microsoft.en.comet
@ -0,0 +1 @@
+0.8875
--- a/evaluation/dev/fi-en/wmt17.bergamot.en.comet
+++ b/evaluation/dev/fi-en/wmt17.bergamot.en.comet
@ -0,0 +1 @@
+0.8647
--- a/evaluation/dev/fi-en/wmt17.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/wmt17.fi-en.cometcompare
@ -0,0 +1,61 @@
+==========================
+x_name: wmt17.bergamot.en
+y_name: wmt17.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8647
+y-mean:	0.8963
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-29.2289
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt17.microsoft.en outperforms wmt17.bergamot.en.
+==========================
+x_name: wmt17.bergamot.en
+y_name: wmt17.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8647
+y-mean:	0.8932
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-26.0983
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt17.google.en outperforms wmt17.bergamot.en.
+==========================
+x_name: wmt17.microsoft.en
+y_name: wmt17.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8963
+y-mean:	0.8932
+ties (%):	0.0167
+x_wins (%):	0.9833
+y_wins (%):	0.0000
+
+Paired T-Test Results:
+statistic:	4.4332
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt17.microsoft.en outperforms wmt17.google.en.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y    wmt17.bergamot.en    wmt17.microsoft.en    wmt17.google.en
+---------------------  -------------------  --------------------  -----------------
+wmt17.bergamot.en                           False                 False
+wmt17.microsoft.en     True                                       True
+wmt17.google.en        True                 False
--- a/evaluation/dev/fi-en/wmt17.google.en.comet
+++ b/evaluation/dev/fi-en/wmt17.google.en.comet
@ -0,0 +1 @@
+0.8935
--- a/evaluation/dev/fi-en/wmt17.microsoft.en.comet
+++ b/evaluation/dev/fi-en/wmt17.microsoft.en.comet
@ -0,0 +1 @@
+0.8965
--- a/evaluation/dev/fi-en/wmt18.bergamot.en.comet
+++ b/evaluation/dev/fi-en/wmt18.bergamot.en.comet
@ -0,0 +1 @@
+0.8386
--- a/evaluation/dev/fi-en/wmt18.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/wmt18.fi-en.cometcompare
@ -0,0 +1,61 @@
+==========================
+x_name: wmt18.bergamot.en
+y_name: wmt18.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8384
+y-mean:	0.8702
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-28.2352
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt18.microsoft.en outperforms wmt18.bergamot.en.
+==========================
+x_name: wmt18.bergamot.en
+y_name: wmt18.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8384
+y-mean:	0.8627
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-21.0238
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt18.google.en outperforms wmt18.bergamot.en.
+==========================
+x_name: wmt18.microsoft.en
+y_name: wmt18.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8702
+y-mean:	0.8627
+ties (%):	0.0000
+x_wins (%):	1.0000
+y_wins (%):	0.0000
+
+Paired T-Test Results:
+statistic:	11.1206
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt18.microsoft.en outperforms wmt18.google.en.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y    wmt18.bergamot.en    wmt18.microsoft.en    wmt18.google.en
+---------------------  -------------------  --------------------  -----------------
+wmt18.bergamot.en                           False                 False
+wmt18.microsoft.en     True                                       True
+wmt18.google.en        True                 False
--- a/evaluation/dev/fi-en/wmt18.google.en.comet
+++ b/evaluation/dev/fi-en/wmt18.google.en.comet
@ -0,0 +1 @@
+0.8628
--- a/evaluation/dev/fi-en/wmt18.microsoft.en.comet
+++ b/evaluation/dev/fi-en/wmt18.microsoft.en.comet
@ -0,0 +1 @@
+0.8704
--- a/evaluation/dev/fi-en/wmt19.bergamot.en.comet
+++ b/evaluation/dev/fi-en/wmt19.bergamot.en.comet
@ -0,0 +1 @@
+0.8517
--- a/evaluation/dev/fi-en/wmt19.fi-en.cometcompare
+++ b/evaluation/dev/fi-en/wmt19.fi-en.cometcompare
@ -0,0 +1,61 @@
+==========================
+x_name: wmt19.bergamot.en
+y_name: wmt19.microsoft.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8516
+y-mean:	0.8870
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-25.2125
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt19.microsoft.en outperforms wmt19.bergamot.en.
+==========================
+x_name: wmt19.bergamot.en
+y_name: wmt19.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8516
+y-mean:	0.8846
+ties (%):	0.0000
+x_wins (%):	0.0000
+y_wins (%):	1.0000
+
+Paired T-Test Results:
+statistic:	-22.9932
+p_value:	0.0000
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt19.google.en outperforms wmt19.bergamot.en.
+==========================
+x_name: wmt19.microsoft.en
+y_name: wmt19.google.en
+
+Bootstrap Resampling Results:
+x-mean:	0.8870
+y-mean:	0.8846
+ties (%):	0.1267
+x_wins (%):	0.8700
+y_wins (%):	0.0033
+
+Paired T-Test Results:
+statistic:	2.7813
+p_value:	0.0055
+Null hypothesis rejected according to t-test.
+Scores differ significantly across samples.
+wmt19.microsoft.en outperforms wmt19.google.en.
+
+Summary
+If system_x is better than system_y then:
+Null hypothesis rejected according to t-test with p_value=0.05.
+Scores differ significantly across samples.
+system_x \ system_y    wmt19.bergamot.en    wmt19.microsoft.en    wmt19.google.en
+---------------------  -------------------  --------------------  -----------------
+wmt19.bergamot.en                           False                 False
+wmt19.microsoft.en     True                                       True
+wmt19.google.en        True                 False
--- a/evaluation/dev/fi-en/wmt19.google.en.comet
+++ b/evaluation/dev/fi-en/wmt19.google.en.comet
@ -0,0 +1 @@
+0.8848
--- a/evaluation/dev/fi-en/wmt19.microsoft.en.comet
+++ b/evaluation/dev/fi-en/wmt19.microsoft.en.comet
@ -0,0 +1 @@
+0.8871
--- a/evaluation/dev/img/avg-comet.png
+++ b/evaluation/dev/img/avg-comet.png
--- a/evaluation/dev/img/fi-en-comet.png
+++ b/evaluation/dev/img/fi-en-comet.png
--- a/registry.json
+++ b/registry.json
@ -205,13 +205,6 @@
    }
  },
  "enfr": {
-    "model": {
-      "name": "model.enfr.intgemm.alphas.bin",
-      "size": 17140961,
-      "estimatedCompressedSize": 12293754,
-      "expectedSha256Hash": "0678019c4d74c8c81d2de17e3e58d3aba5f5eb48f5595d9240c17f69d30461de",
-      "modelType": "prod"
-    },
    "lex": {
      "name": "lex.50.50.enfr.s2t.bin",
      "size": 7886500,
@ -219,6 +212,13 @@
      "expectedSha256Hash": "38fb44bad1fd5f1e6bfdcf15cc8baa09d61aad2a4f9c587914e24e7b5c25c32c",
      "modelType": "prod"
    },
+    "model": {
+      "name": "model.enfr.intgemm.alphas.bin",
+      "size": 17140961,
+      "estimatedCompressedSize": 12293754,
+      "expectedSha256Hash": "0678019c4d74c8c81d2de17e3e58d3aba5f5eb48f5595d9240c17f69d30461de",
+      "modelType": "prod"
+    },
    "vocab": {
      "name": "vocab.fren.spm",
      "size": 831382,
--- a/scripts/eval.sh
+++ b/scripts/eval.sh
@ -6,10 +6,10 @@ python3 eval/evaluate.py \
  --translators=bergamot,microsoft,google \
  --pairs=all --skip-existing \
  --results-dir=/models/evaluation/dev --models-dir=/models/models/dev \
-  --gpus=1 --evaluation-engine=bleu
+  --gpus=1 --evaluation-engine=comet,bleu

 python3 eval/evaluate.py \
  --translators=bergamot,microsoft,google \
  --pairs=all --skip-existing \
  --results-dir=/models/evaluation/prod --models-dir=/models/models/prod \
-  --gpus=1 --evaluation-engine=bleu
+  --gpus=1 --evaluation-engine=comet,bleu
--- a/scripts/update-results.sh
+++ b/scripts/update-results.sh
@ -21,15 +21,8 @@ xargs rm -f
 echo "Extracting models"
 gzip -drf models/*/*/*

-echo "Cloning evaluation repo"
-if [ ! -e firefox-translations-evaluation ]; then
-  git clone https://github.com/mozilla/firefox-translations-evaluation.git
-fi
-
 echo "Building docker image"
-cd firefox-translations-evaluation
-docker build -t bergamot-eval .
-cd ..
+make build-docker

 echo "Running evaluation"
 GCP_CREDS_PATH="/tmp/.gcp_creds"