* Enable comet

* Update CI image

* Run lfs pull for models

* Move code from the evals repo

* Add data folder

* Update evaluation results [skip ci]

* Update model registry [skip ci]

---------

Co-authored-by: CircleCI evaluation job <ci-models-evaluation@firefox-translations>
This commit is contained in:
Evgeny Pavlov 2023-10-23 14:07:43 -07:00 коммит произвёл GitHub
Родитель 9c071beb27
Коммит 4bfeca6ce0
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
55 изменённых файлов: 1401 добавлений и 25 удалений

Просмотреть файл

@ -19,7 +19,7 @@ jobs:
bash scripts/upload.sh
evaluate:
machine:
image: ubuntu-2004-cuda-11.4:202110-01
image: linux-cuda-11:default
resource_class: gpu.nvidia.large
working_directory: ~/mozilla/firefox-translations-models
steps:
@ -41,6 +41,7 @@ jobs:
- run:
name: Running evaluation
command: |
find ./models -type f -exec git lfs pull {} \;
bash scripts/update-results.sh
- run:
name: Showing results

2
.gitignore поставляемый
Просмотреть файл

@ -131,4 +131,4 @@ dmypy.json
.idea
._.DS_Store
*.bin
*.spm
*.spm

18
Makefile Normal file
Просмотреть файл

@ -0,0 +1,18 @@
### Evaluation
MODELS?=
GCP_CREDS_PATH?=
AZURE_TRANSLATOR_KEY?=
build-docker:
docker build -t bergamot-eval ./evals
run-docker:
docker run --name bergamot-eval -it --shm-size=16gb --rm \
--runtime=nvidia --gpus all \
-v $$(pwd)/models:/models \
-v $(GCP_CREDS_PATH):/.gcp_creds \
-e GOOGLE_APPLICATION_CREDENTIALS=/.gcp_creds \
-e AZURE_TRANSLATOR_KEY=$(AZURE_TRANSLATOR_KEY) \
bergamot-eval

Просмотреть файл

@ -17,7 +17,7 @@ Results for dev models: [BLEU](evaluation/dev/bleu-results.md), [COMET](evaluati
Automatic evaluation is a part of pull request CI.
It uses Microsoft and Google translation APIs and pushes results back to the branch (not available for forks).
It is performed using [firefox-translations-evaluation](https://github.com/mozilla/firefox-translations-evaluation) tool.
It is performed using the [evals](/evals) tool.
# Model training

39
evals/Dockerfile Normal file
Просмотреть файл

@ -0,0 +1,39 @@
FROM nvidia/cuda:11.4.3-runtime-ubuntu20.04
WORKDIR workspace
ARG DEBIAN_FRONTEND=noninteractive
RUN apt update && \
apt -y install sudo git cmake
# See https://marian-nmt.github.io/docs/#installation for Marian requirements
RUN apt-get install -y build-essential \
libboost-all-dev libprotobuf17 protobuf-compiler \
libprotobuf-dev libssl-dev libgoogle-perftools-dev
# Intel MKL - for Marian usage on CPU
RUN apt install wget && \
wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB' | apt-key add -
RUN sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' && \
apt-get update && \
apt-get install -y intel-mkl-64bit-2020.0-088
# Bergamot
# pcre2 is requried to build berrgamot-translator with -DUSE_WASM_COMPATIBLE_SOURCES=off
RUN apt-get install -y libpcre2-dev
# Compile bergamot translator
ADD ./install/install-bergamot-translator.sh ./
RUN bash ./install-bergamot-translator.sh
# SacreBLEU and python dependencies
RUN apt-get update && apt-get install -y python3 python3-venv python3-pip
ADD ./requirements.txt ./
RUN pip3 install -r requirements.txt
ADD ./eval ./eval
ADD ./translators ./translators
ADD ./data ./data
CMD ["/bin/bash"]

88
evals/README.md Normal file
Просмотреть файл

@ -0,0 +1,88 @@
# Firefox Translations Evaluation
Calculates BLEU and COMET scores for Firefox Translations [models](https://github.com/mozilla/firefox-translations-models)
using [bergamot-translator](https://github.com/mozilla/bergamot-translator) and compares them to other translation systems.
## Running
We recommend running this on a Linux machine with at least one GPU, and inside a docker container.
If you intend to run it on macOS, run the `eval/evaluate.py ` script standalone inside a virtualenv, and skip the `Start docker` section below.
You might need to manually install the correspondent packages in the `Dockerfile` in your system and virtual environment.
### Install NVIDIA Container Toolkit
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
### Start docker
Recommended memory size for Docker is **16gb**.
Run from the repo root directory:
```
export MODELS=<absolute path to a local directory with models>
# Specify Azure key and location if you want to add Azure Translator API for comparison
export AZURE_TRANSLATOR_KEY=<Azure translator resource API key>
# optional, specify if it's different than default 'global'
export AZURE_LOCATION=<location>
# Specify GCP credentials json path if you want to add Google Translator API for comparison
export GCP_CREDS_PATH=<absolute path to .json>
# Build and run docker container
make build-docker
make start-docker
```
On completion, your terminal should be attached to the launched container.
### Run evaluation
From inside docker container run:
```
python3 eval/evaluate.py \
--translators=bergamot,microsoft,google \
--pairs=all \
--skip-existing \
--gpus=1 \
--evaluation-engine=comet,bleu \
--models-dir=/models/models/prod \
--results-dir=/models/evaluation/prod
```
If you don't have a GPU, use `0` in the `--gpus` argument.
More options:
```
python3 eval/evaluate.py --help
```
## Details
### Installation scripts
`install/install-bergamot-translator.sh` - clones and compiles [bergamot-translator](https://github.com/mozilla/bergamot-translator) and [marian](https://github.com/marian-nmt/marian-dev) (launched in docker image).
`install/download-models.sh` - downloads current Mozilla production [models](https://github.com/mozilla/firefox-translations-models).
### Docker & CUDA
The COMET evaluation framework supports CUDA, and you can enable it by setting the `--gpus` argument in the `eval\evaluate.py` script to the number of GPUs you wish to utilize (`0` disables it).
If you are using it, make sure you have the [nvidia container toolkit enabled](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) in your docker setup.
### Translators
1. **bergamot** - uses compiled [bergamot-translator](https://github.com/mozilla/bergamot-translator) in wasm mode
2. **google** - users Google Translation [API](https://cloud.google.com/translate)
3. **microsoft** - users Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
### Reuse already calculated scores
Use `--skip-existing` option to reuse already calculated scores saved as `results/xx-xx/*.bleu` files.
It is useful to continue evaluation if it was interrupted
or to rebuild a full report reevaluating only selected translators.
### Datasets
[SacreBLEU](https://github.com/mjpost/sacrebleu) - all available datasets for a language pair are used for evaluation.
[Flores](https://github.com/facebookresearch/flores) - parallel evaluation dataset for 101 languages.
### Language pairs
With option `--pairs=all`, language pairs will be discovered
in the specified models folder (option `--models-dir`)
and evaluation will run for all of them.
### Results
Results will be written to the specified directory (option `--results-dir`).

18
evals/data/flores.sh Normal file
Просмотреть файл

@ -0,0 +1,18 @@
#!/bin/bash
##
# Downloads flores dataset
#
set -x
set -euo pipefail
dir=$1
mkdir -p "${dir}"
test -s "${dir}/flores101_dataset.tar.gz" ||
wget -O "${dir}/flores101_dataset.tar.gz" "https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz"
test -s "${dir}/flores101_dataset/dev/eng.dev" ||
tar -xzf "${dir}/flores101_dataset.tar.gz" -C "${dir}" --no-same-owner

Просмотреть файл

@ -0,0 +1,55 @@
# What is BLEU
[BLEU (BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference translations (high quality).
It has been shown that BLEU scores correlate well with human judgment of translation quality. Note that even human translators do not achieve a perfect score of 1.0.
BLEU scores are expressed as a percentage rather than a decimal between 0 and 1.
Trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.
However, as a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather than decimals) might be helpful.
BLEU Score | Interpretation
--- | ---
< 10 | Almost useless
10 - 19 | Hard to get the gist
20 - 29 | The gist is clear, but has significant grammatical errors
30 - 40 | Understandable to good translations
40 - 50 | High quality translations
50 - 60 | Very high quality, adequate, and fluent translations
\> 60 | Quality often better than human
[More mathematical details](https://cloud.google.com/translate/automl/docs/evaluate#the_mathematical_details)
Source: https://cloud.google.com/translate/automl/docs/evaluate#bleu
BLEU is the most popular becnhmark in academia, so using BLEU allows us also to compare with reserach papers results and competitions (see [Conference on Machine Translation Conference (WMT)](http://statmt.org/wmt21/)).
Read [this article](https://www.rws.com/blog/understanding-mt-quality-bleu-scores/) to better understand what BLEU is and why it is not perfect.
# What are these benchmarks
## Translators
1. **bergamot** - uses compiled [bergamot-translator](https://github.com/mozilla/bergamot-translator) (wrapper for marian that is used by Firefox Translations web extension)
2. **google** - uses Google Translation [API](https://cloud.google.com/translate)
3. **microsoft** - uses Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
Translation quality of Marian and Bergamot is supposed to be very similar.
## Method
We use official WMT ([Conference on Machine Translation](http://statmt.org/wmt21/)) parallel datasets. Available datasets are discovered automatically based on a language pair.
We perform translation from source to target language using one of three translation systems and then compare the result with the dataset reference and calculate BLEU score.
Evaluation is done using [SacreBLEU](https://github.com/mjpost/sacrebleu) tool which is reliable and widely used in academic world.
Both absolute and relative differences in BLEU scores between Bergamot and other systems are reported.
# Evaluation results
`avg` = average on all datasets

7
evals/eval/clean-cache.sh Executable file
Просмотреть файл

@ -0,0 +1,7 @@
#!/bin/bash
set -e
# clean SacreBLEU cache to fix error
# "This could be a problem with your system output or with sacreBLEU's reference database"
test -e /root/.sacrebleu && rm -r /root/.sacrebleu

Просмотреть файл

@ -0,0 +1,47 @@
# What is COMET
COMET is a neural framework its developers present for training multilingual machine translation evaluation models. The framework has been reported to obtain new state-of-the-art levels of correlation with human judgments. Recent breakthroughs in cross-lingual pre-trained language modeling have been leveraged by the framework resulting in highly multilingual and adaptable MT evaluation models.
Three models with different human judgments have been trained to showcase the framework. These include Direct Assessments, Human-mediated Translation Edit Rate, and Multidimensional Quality Metrics. These models are designed to exploit information from source input and a target-language reference translation to more accurately predict MT quality.
The models developed by COMET have achieved new state-of-the-art performance on the WMT 2019 Metrics shared task, demonstrating robustness to high-performing systems.
## Interpreting Scores:
When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
However, for the latest COMET models like Unbabel/wmt22-comet-da, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.
Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.
Source: https://aclanthology.org/2020.emnlp-main.213.pdf
Tool: https://github.com/Unbabel/COMET
# What are these benchmarks
## Translators
1. **bergamot** - uses compiled [bergamot-translator](https://github.com/mozilla/bergamot-translator) (wrapper for marian that is used by Firefox Translations web extension)
2. **google** - uses Google Translation [API](https://cloud.google.com/translate)
3. **microsoft** - uses Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)
## Method
We use official WMT ([Conference on Machine Translation](http://statmt.org/wmt21/)) parallel datasets. Available datasets are discovered automatically based on a language pair.
We perform the translation from source to target language using one of the three translation systems, compare the result with the dataset reference, and then calculate the [COMET](https://github.com/Unbabel/COMET) score.
Both absolute and relative differences in the scores between Bergamot and other systems are reported.
We also compare the systems using the `comet-compare` tool that calculates the statistical significance with Paired T-Test and bootstrap resampling.
# Evaluation results
`avg` = average on all datasets

15
evals/eval/eval-comet.sh Executable file
Просмотреть файл

@ -0,0 +1,15 @@
#!/bin/bash
set -e
set -o pipefail
mkdir -p $(dirname "${EVAL_PREFIX}")
sacrebleu -t "$DATASET" -l "$SRC-$TRG" --echo src \
| tee "$EVAL_PREFIX.$SRC" \
| $TRANSLATOR_CMD \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
; comet-score --gpus "$GPUS" --quiet --only_system -d "$DATASET:$SRC-$TRG" -t "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
| awk -F"score: " '{print $2}' \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.comet"

11
evals/eval/eval-custom-comet.sh Executable file
Просмотреть файл

@ -0,0 +1,11 @@
#!/bin/bash
set -e
set -o pipefail
$TRANSLATOR_CMD < "$EVAL_PREFIX.$SRC" \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
; comet-score --quiet --only_system --gpus "$GPUS" -s "$EVAL_PREFIX.$SRC" -t "$EVAL_PREFIX.$TRANSLATOR.$TRG" -r "$EVAL_PREFIX.$TRG" \
| awk -F"score: " '{print $2}' \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.comet"

12
evals/eval/eval-custom.sh Executable file
Просмотреть файл

@ -0,0 +1,12 @@
#!/bin/bash
set -e
set -o pipefail
$TRANSLATOR_CMD < "$EVAL_PREFIX.$SRC" \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
| sacrebleu --score-only -q -l "$SRC-$TRG" "$EVAL_PREFIX.$TRG" \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.bleu"

14
evals/eval/eval.sh Executable file
Просмотреть файл

@ -0,0 +1,14 @@
#!/bin/bash
set -e
set -o pipefail
mkdir -p $(dirname "${EVAL_PREFIX}")
sacrebleu -t "$DATASET" -l "$SRC-$TRG" --echo src \
| tee "$EVAL_PREFIX.$SRC" \
| $TRANSLATOR_CMD \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG" \
| sacrebleu --score-only -q -t "$DATASET" -l "$SRC-$TRG" \
| tee "$EVAL_PREFIX.$TRANSLATOR.$TRG.bleu"

391
evals/eval/evaluate.py Normal file
Просмотреть файл

@ -0,0 +1,391 @@
import shutil
import subprocess
import os
from collections import defaultdict
import statistics
import traceback
from sacrebleu import dataset
import click
from toolz import groupby
from glob import glob
import pandas as pd
from mtdata import iso
from os.path import exists
HOME_DIR = './'
EVAL_DIR = os.path.join(HOME_DIR, 'eval')
EVAL_PATH = os.path.join(EVAL_DIR, 'eval.sh')
EVAL_PATH_COMET = os.path.join(EVAL_DIR, 'eval-comet.sh')
EVAL_CUSTOM_PATH = os.path.join(EVAL_DIR, 'eval-custom.sh')
EVAL_CUSTOM_PATH_COMET = os.path.join(EVAL_DIR, 'eval-custom-comet.sh')
CLEAN_CACHE_PATH = os.path.join(EVAL_DIR, 'clean-cache.sh')
CUSTOM_DATASETS = ['flores-dev', 'flores-test']
CUSTOM_DATA_DIR = os.path.join(HOME_DIR, 'data')
FLORES_PATH = os.path.join(CUSTOM_DATA_DIR, 'flores.sh')
BERGAMOT_APP_PATH = os.path.join(HOME_DIR, 'bergamot-translator', 'build', 'app', 'bergamot')
BERGAMOT_EVAL_PATH = os.path.join(HOME_DIR, 'translators', 'bergamot.sh')
TRANS_ORDER = {'bergamot': 0,
'google': 1,
'microsoft': 2}
def get_dataset_prefix(dataset_name, pair, results_dir):
dataset_name = dataset_name.replace('/', '_')
return os.path.join(results_dir, f'{pair[0]}-{pair[1]}', f'{dataset_name}')
def get_bleu_path(dataset_name, pair, results_dir, translator, evaluation_engine):
prefix = get_dataset_prefix(dataset_name, pair, results_dir)
return f'{prefix}.{translator}.{pair[1]}.{evaluation_engine}'
# Custom data
def download_custom_data():
print('Downloading Flores dataset')
os.makedirs(CUSTOM_DATA_DIR, exist_ok=True)
subprocess.run(['bash', FLORES_PATH, CUSTOM_DATA_DIR])
def copy_flores_lang(dataset_name, lang, eval_prefix):
flores_dataset = 'dev' if dataset_name == 'flores-dev' else 'devtest'
if lang == 'zh' or lang == 'zh-Hans':
lang_code = 'zho_simpl'
elif lang == 'zh-Hant':
lang_code = 'zho_trad'
elif lang == 'nb':
lang_code = 'nob'
else:
lang_code = iso.iso3_code(lang)
os.makedirs(os.path.dirname(eval_prefix), exist_ok=True)
shutil.copy(os.path.join(CUSTOM_DATA_DIR, 'flores101_dataset', flores_dataset, f'{lang_code}.{flores_dataset}'),
f'{eval_prefix}.{lang}')
def copy_custom_data(dataset_name, pair, results_dir):
src, trg = pair
eval_prefix = get_dataset_prefix(dataset_name, pair, results_dir)
if dataset_name.startswith('flores'):
copy_flores_lang(dataset_name, src, eval_prefix)
copy_flores_lang(dataset_name, trg, eval_prefix)
else:
raise ValueError(f'Unsupported custom dataset: {dataset_name}')
# Evaluation
def find_datasets(pair):
formatted_pair = f'{pair[0]}-{pair[1]}'
datasets = []
datasets += CUSTOM_DATASETS
for dataset_name, descr in dataset.DATASETS.items():
is_wmt_official = dataset_name.startswith('wmt') and len(dataset_name) == 5
is_other_accepted = dataset_name == 'iwslt17' or dataset_name == 'mtedx/test'
if not (is_wmt_official or is_other_accepted) or formatted_pair not in descr.langpairs:
continue
datasets.append(dataset_name)
return datasets
def evaluate(pair, set_name, translator, evaluation_engine, gpus, models_dir, results_dir):
source, target = pair
my_env = os.environ.copy()
my_env['SRC'] = source
my_env['TRG'] = target
my_env['DATASET'] = set_name
my_env['EVAL_PREFIX'] = get_dataset_prefix(set_name, pair, results_dir)
my_env['TRANSLATOR'] = translator
my_env['GPUS'] = gpus
if translator == 'bergamot':
my_env['MODEL_DIR'] = os.path.join(models_dir, f'{source}{target}')
my_env['APP_PATH'] = BERGAMOT_APP_PATH
cmd = f'bash {BERGAMOT_EVAL_PATH}'
elif translator == 'google':
cmd = f"python3 {os.path.join(HOME_DIR, 'translators', 'google_translate.py')}"
elif translator == 'microsoft':
cmd = f"python3 {os.path.join(HOME_DIR, 'translators', 'microsoft.py')}"
else:
raise ValueError(f'Translator is not supported: {translator}')
my_env['TRANSLATOR_CMD'] = cmd
eval_path = EVAL_CUSTOM_PATH if set_name in CUSTOM_DATASETS else EVAL_PATH
if set_name in CUSTOM_DATASETS and evaluation_engine == 'bleu':
eval_path = EVAL_CUSTOM_PATH
elif set_name in CUSTOM_DATASETS and evaluation_engine == 'comet':
eval_path = EVAL_CUSTOM_PATH_COMET
elif set_name not in CUSTOM_DATASETS and evaluation_engine == 'bleu':
eval_path = EVAL_PATH
elif set_name not in CUSTOM_DATASETS and evaluation_engine == 'comet':
eval_path = EVAL_PATH_COMET
retries = 3
while True:
try:
res = subprocess.run(['bash', eval_path], env=my_env, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
print("stdout: ", res.stdout.decode('utf-8'))
print("stderr: ", res.stderr.decode('utf-8'))
if evaluation_engine == "bleu":
float_res = float(res.stdout.decode('utf-8').strip())
elif evaluation_engine == "comet":
float_res = float(res.stdout.decode('utf-8').strip().split("\n")[-1])
return float_res
except:
traceback.print_exc()
if retries == 0:
raise
retries -= 1
subprocess.run(['bash', CLEAN_CACHE_PATH])
print('Attempt failed, retrying')
def run_dir(lang_pairs, skip_existing, translators, evaluation_engines, gpus, results_dir, models_dir):
reordered = sorted(translators.split(','), key=lambda x: TRANS_ORDER[x])
for evaluation_engine in evaluation_engines.split(','):
for pair in lang_pairs:
if 'nn' in pair:
print('There are no evaluation datasets for Norwegian Nynorsk '
'and it is not supported by Google and Microsoft API. Skipping evaluation')
continue
for dataset_name in find_datasets(pair):
for translator in reordered:
print(f'Evaluation for dataset: {dataset_name}, translator: {translator}, pair: {pair[0]}-{pair[1]}, evaluation engine: {evaluation_engine}')
res_path = get_bleu_path(dataset_name, pair, results_dir, translator, evaluation_engine)
print(f'Searching for {res_path}')
if skip_existing and os.path.isfile(res_path) and os.stat(res_path).st_size > 0:
print(f"Already exists, skipping ({res_path})")
with open(res_path) as f:
score = float(f.read().strip())
else:
print('Not found, running evaluation...')
if dataset_name in CUSTOM_DATASETS:
copy_custom_data(dataset_name, pair, results_dir)
score = evaluate(pair, dataset_name, translator, evaluation_engine, gpus, results_dir=results_dir, models_dir=models_dir)
print(f'Result {evaluation_engine}: {score}\n')
def run_comet_compare(lang_pairs, skip_existing, translators, gpus, models_dir, results_dir):
for pair in lang_pairs:
if 'nn' in pair:
print('There are no evaluation datasets for Norwegian Nynorsk '
'and it is not supported by Google and Microsoft API. Skipping comparison')
continue
source, target = pair
for dataset_name in find_datasets(pair):
original_dataset_name = dataset_name
dataset_name = dataset_name.replace('/', '_')
print(f'Comparison for dataset: {dataset_name}, pair: {source}-{target}')
working_folder = f'{results_dir}/{source}-{target}/'
output_filename = f'{working_folder}/{dataset_name}.{source}-{target}.cometcompare'
if skip_existing and os.path.isfile(output_filename) and os.stat(output_filename).st_size > 0:
print(f'Comparison exists. Skipping...')
continue
source_dataset = f'{dataset_name}.{source}'
targets = ""
for translator in translators.split(','):
targets += f'{dataset_name}.{translator}.{target} '
command = ""
if dataset_name in CUSTOM_DATASETS:
reference = f'{dataset_name}.{target}'
command = f'comet-compare --gpus {gpus} -s {source_dataset} -t {targets.strip()} -r {reference}'
else:
command = f'comet-compare --gpus {gpus} -d {original_dataset_name}:{source}-{target} -t {targets.strip()}'
res = subprocess.run(command.split(' '), cwd=working_folder,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout = res.stdout.decode('utf-8')
with open(output_filename, 'w') as f:
f.write(stdout)
print("stdout: ", res.stdout.decode('utf-8'))
print("stderr: ", res.stderr.decode('utf-8'))
# Report generation
def build_report(res_dir, evaluation_engines):
os.makedirs(os.path.join(res_dir, 'img'), exist_ok=True)
for evaluation_engine in evaluation_engines.split(","):
results = read_results(res_dir, evaluation_engine)
with open(os.path.join(EVAL_DIR, evaluation_engine + '-results.md')) as f:
lines = [l.strip() for l in f.readlines()]
avg_results = get_avg_scores(results)
build_section(avg_results, 'avg', lines, res_dir, evaluation_engine)
for lang_pair, datasets in results.items():
build_section(datasets, lang_pair, lines, res_dir, evaluation_engine)
results_path = os.path.join(res_dir, evaluation_engine + '-results.md')
with open(results_path, 'w+') as f:
f.write('\n'.join(lines))
print(f'Results are written to {results_path}')
def build_section(datasets, key, lines, res_dir, evaluation_engine):
lines.append(f'\n## {key}\n')
lines.append(f'| Translator/Dataset | {" | ".join(datasets.keys())} |')
lines.append(f"| {' | '.join(['---' for _ in range(len(datasets) + 1)])} |")
inverted_formatted = defaultdict(dict)
inverted_scores = defaultdict(dict)
comet_comparisons = defaultdict(dict)
for dataset_name, translators in datasets.items():
bergamot_res = translators.get('bergamot')
reordered = sorted(translators.items(), key=lambda x: TRANS_ORDER[x[0]])
for translator, score in reordered:
if score == 0:
formatted_score = 'N/A'
elif translator != 'bergamot' and bergamot_res:
change_perc = (score - bergamot_res) / bergamot_res * 100
change = score - bergamot_res
sign = '+' if change > 0 else ''
formatted_score = f'{score:.2f} ({sign}{change:.2f}, {sign}{change_perc:.2f}%)'
else:
formatted_score = f'{score:.2f}'
inverted_formatted[translator][dataset_name] = formatted_score
inverted_scores[translator][dataset_name] = score
# if this is a non-avg comet report, and a cometcompare report exists, we print it
cometcompare_path = "{}/{}/{}.{}.cometcompare".format(res_dir,key,dataset_name,key)
if evaluation_engine == "comet" and key != "avg" and "{}.{}".format(dataset_name, key) not in comet_comparisons and exists(cometcompare_path):
cometcompare_file = open(cometcompare_path)
filelines = cometcompare_file.readlines()
final_report = ""
for line in filelines:
if "outperforms" in line:
final_report += f'- {line}'
comet_comparisons["{}.{}".format(dataset_name, key)] = final_report
for translator, scores in inverted_formatted.items():
lines.append(f'| {translator} | {" | ".join(scores.values())} |')
img_path = os.path.join(res_dir, 'img', f'{key}-{evaluation_engine}.png')
plot_lang_pair(datasets, inverted_scores, img_path, evaluation_engine)
img_relative_path = '/'.join(img_path.split("/")[-2:])
lines.append(f'\n![Results]({img_relative_path})')
printed_header = False
for dataset in comet_comparisons:
if (not printed_header):
lines.append("### Comparisons between systems")
lines.append("*If a comparison is omitted, the systems have equal averages (tie). Click on the dataset for a complete report*")
printed_header = True
lines.append(f'#### [{dataset}]({key}/{dataset}.cometcompare)')
lines.append(f'{comet_comparisons[dataset]}')
lines.append("---")
def read_results(res_dir, evaluation_engine):
results = defaultdict(dict)
all_translators = set()
for bleu_file in glob(res_dir + '/*/*.' + evaluation_engine):
dataset_name, translator, = os.path.basename(bleu_file).split('.')[:2]
pair = bleu_file.split('/')[-2]
with open(bleu_file) as f:
score = float(f.read().strip())
if dataset_name not in results[pair]:
results[pair][dataset_name] = {}
results[pair][dataset_name][translator] = score
all_translators.add(translator)
# fix missing translators
for _, datasets in results.items():
for _, translators in datasets.items():
for translator in all_translators:
if translator not in translators:
translators[translator] = 0
return results
def get_avg_scores(results):
scores = {}
for lang_pair, datasets in results.items():
tran_scores = [(tran, score)
for data, trans in datasets.items()
for tran, score in trans.items()]
avg_scores = {tran: statistics.mean([s for _, s in scores])
for tran, scores in groupby(lambda x: x[0], tran_scores).items()}
scores[lang_pair] = avg_scores
return scores
def plot_lang_pair(datasets, inverted_scores, img_path, evaluation_engine):
trans_scores = {t: s.values() for t, s in inverted_scores.items()}
translators = [t for t in TRANS_ORDER.keys() if t in inverted_scores]
df = pd.DataFrame(trans_scores, index=datasets, columns=translators)
fig = df.plot.bar(ylabel=evaluation_engine).get_figure()
fig.set_size_inches(18.5, 10.5)
fig.savefig(img_path, bbox_inches="tight")
# Main
@click.command()
@click.option('--pairs',
default='all',
help='Comma separated language pairs or `all`. Example: es-en,de-et')
@click.option('--translators',
default='bergamot',
help='Comma separated translators. Example: bergamot,google')
@click.option('--results-dir',
help='Directory for results')
@click.option('--models-dir',
help='Directory with models')
@click.option('--skip-existing',
default=False,
is_flag=True,
help='Whether to skip already calculated scores. '
'They are located in `results/xx-xx` folders as *.bleu or *.comet files.')
@click.option('--evaluation-engine',
default="bleu",
help='Determine which evaluation engine to use: bleu or comet')
@click.option('--comet-compare',
default=True,
help='Determine if comet-compare should be executed or not. Default: True')
@click.option('--gpus',
default="0",
help='Determine the number of GPUs used by the comet engine (if applicable). Default: 0')
def run(pairs, translators, results_dir, models_dir, skip_existing, evaluation_engine, gpus, comet_compare):
lang_pairs = [(pair[:2], pair[-2:])
for pair in (os.listdir(models_dir) if pairs == 'all' else pairs.split(','))]
print(f'Language pairs to evaluate: {lang_pairs}')
download_custom_data()
run_dir(lang_pairs, skip_existing, translators, evaluation_engine, gpus, models_dir=models_dir, results_dir=results_dir)
if comet_compare:
run_comet_compare(lang_pairs, skip_existing, translators, gpus, models_dir=models_dir, results_dir=results_dir)
build_report(results_dir,evaluation_engine)
if __name__ == '__main__':
run()

Просмотреть файл

@ -0,0 +1,21 @@
#!/bin/bash
# Downloads and compiles Bergamot translator and Marian
# See https://marian-nmt.github.io/docs/#installation for requirements
set -e
if [ -e "bergamot-translator" ]; then
echo "already cloned"
else
echo "Cloning https://github.com/mozilla/bergamot-translator.git"
git clone https://github.com/mozilla/bergamot-translator.git
fi
cd bergamot-translator
echo "Compiling bergamot-translator"
mkdir -p build
cd build
cmake ../ -DUSE_WASM_COMPATIBLE_SOURCES=off -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

10
evals/requirements.txt Normal file
Просмотреть файл

@ -0,0 +1,10 @@
sacrebleu==2.3.1
click==8.0.1
google-cloud-translate==3.2.1
requests==2.26.0
toolz==0.11.1
tqdm==4.61.2
pandas==1.1.5
matplotlib==3.4.2
mtdata==0.2.9
unbabel-comet==2.1.1

Просмотреть файл

@ -0,0 +1,28 @@
# These Marian options are set according to
# https://github.com/mozilla/firefox-translations/blob/main/extension/controller/translation/translationWorker.js
# to imitate production setting
bergamot-mode: wasm
models:
- MODEL
vocabs:
- SRCVOCAB
- TRGVOCAB
shortlist:
- SHORTLIST
- false
beam-size: 1
normalize: 1.0
word-penalty: 0
max-length-break: 128
mini-batch-words: 1024
workspace: 128
max-length-factor: 2.0
skip-cost: true
cpu-threads: 0
quiet: false
quiet-translation: false
gemm-precision: PRECISION
alignment: soft

29
evals/translators/bergamot.sh Executable file
Просмотреть файл

@ -0,0 +1,29 @@
#!/bin/bash
set -x
set -euo pipefail
SRCVOCAB=$(find "${MODEL_DIR}" -name srcvocab.*.spm)
TRGVOCAB=$(find "${MODEL_DIR}" -name trgvocab.*.spm)
VOCAB=$(find "${MODEL_DIR}" -name vocab.*.spm)
MODEL=$(find "${MODEL_DIR}" -name model.${SRC}${TRG}.*.bin)
SHORTLIST=$(find "${MODEL_DIR}" -name *${SRC}${TRG}.s2t.bin)
CONFIG="/tmp/bergamot.config.${SRC}${TRG}.yml"
if [[ ${MODEL} == *.intgemm.alphas.bin ]]; then
PRECISION=int8shiftAlphaAll
elif [[ ${MODEL} == *.intgemm8.bin ]]; then
PRECISION=int8shiftAll
else
echo "Unknown model name pattern: ${MODEL}"
exit 1
fi
cp translators/bergamot.config.yml "${CONFIG}"
sed -i -e "s+MODEL+${MODEL}+g" "${CONFIG}"
sed -i -e "s+SRCVOCAB+${SRCVOCAB:-$VOCAB}+g" "${CONFIG}"
sed -i -e "s+TRGVOCAB+${TRGVOCAB:-$VOCAB}+g" "${CONFIG}"
sed -i -e "s+SHORTLIST+${SHORTLIST}+g" "${CONFIG}"
sed -i -e "s+PRECISION+${PRECISION}+g" "${CONFIG}"
$APP_PATH --model-config-paths "${CONFIG}" --log-level=info

Просмотреть файл

@ -0,0 +1,37 @@
# Pricing
# https://cloud.google.com/translate/pricing
import os
from google.cloud import translate_v2
import sys
from tqdm import tqdm
import toolz
translate_client = translate_v2.Client()
def translate(texts):
"""Translates text into the target language.
Texts must be an ISO 639-1 language code.
See https://g.co/cloud/translate/v2/translate-reference#supported_languages
"""
source = os.environ['SRC']
target = os.environ['TRG']
results = []
# decrease partition size if hitting limit of max 204800 bytes per request
for partition in tqdm(list(toolz.partition_all(100, texts))):
response = translate_client.translate(partition, target_language=target, source_language=source)
results += [r['translatedText'] for r in response]
return results
if __name__ == '__main__':
texts = [line.strip() for line in sys.stdin]
translations = translate(texts)
sys.stdout.write('\n'.join(translations))
sys.stdout.write('\n')

Просмотреть файл

@ -0,0 +1,50 @@
# Pricing:
# https://azure.microsoft.com/en-us/pricing/details/cognitive-services/translator/
import requests, uuid
import os
import sys
import toolz
from tqdm import tqdm
subscription_key = os.environ['AZURE_TRANSLATOR_KEY']
location = os.getenv("AZURE_LOCATION", 'global')
url = "https://api.cognitive.microsofttranslator.com/translate"
headers = {
'Ocp-Apim-Subscription-Key': subscription_key,
'Ocp-Apim-Subscription-Region': location,
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
def translate(texts):
source = os.environ['SRC']
target = os.environ['TRG']
params = {
'api-version': '3.0',
'from': source,
'to': [target]
}
results = []
# decrease partition size if hitting limit of max 10000 characters per request
for partition in tqdm(list(toolz.partition_all(20, texts))):
body = [{'text': text} for text in partition]
response = requests.post(url, params=params, headers=headers, json=body)
if response.status_code != 200:
raise ValueError(f'Incorrect response. code: {response.status_code} body: {response.json()}')
results += [r['translations'][0]['text'] for r in response.json()]
return results
if __name__ == '__main__':
texts = [line.strip() for line in sys.stdin]
translations = translate(texts)
sys.stdout.write('\n'.join(translations))
sys.stdout.write('\n')

Просмотреть файл

@ -48,11 +48,11 @@ We also compare the systems using the `comet-compare` tool that calculates the s
## avg
| Translator/Dataset | hu-en | ru-en | en-nl | en-ru | en-fa | nl-en | uk-en | fa-en | ca-en | en-uk | is-en |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| bergamot | 0.56 | 0.49 | 0.58 | 0.54 | 0.31 | 0.63 | 0.52 | 0.50 | 0.65 | 0.51 | 0.15 |
| google | 0.66 (+0.10, +17.32%) | 0.59 (+0.10, +20.83%) | 0.67 (+0.08, +14.30%) | 0.76 (+0.21, +39.38%) | 0.70 (+0.39, +126.54%) | 0.70 (+0.07, +10.71%) | 0.67 (+0.15, +28.26%) | 0.74 (+0.24, +48.00%) | 0.82 (+0.16, +24.78%) | 0.79 (+0.27, +53.31%) | 0.70 (+0.55, +370.91%) |
| microsoft | 0.66 (+0.10, +17.85%) | 0.60 (+0.11, +22.13%) | 0.65 (+0.06, +11.05%) | 0.72 (+0.18, +32.36%) | 0.41 (+0.10, +31.65%) | 0.69 (+0.06, +9.12%) | 0.64 (+0.12, +23.16%) | 0.66 (+0.16, +32.78%) | 0.79 (+0.14, +21.22%) | 0.75 (+0.23, +45.60%) | 0.67 (+0.52, +353.71%) |
| Translator/Dataset | hu-en | ru-en | fi-en | en-nl | en-ru | en-fa | nl-en | uk-en | fa-en | ca-en | en-uk | is-en |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| bergamot | 0.56 | 0.49 | 0.86 | 0.58 | 0.54 | 0.31 | 0.63 | 0.52 | 0.50 | 0.65 | 0.51 | 0.15 |
| google | 0.66 (+0.10, +17.32%) | 0.59 (+0.10, +20.83%) | 0.89 (+0.03, +3.42%) | 0.67 (+0.08, +14.30%) | 0.76 (+0.21, +39.38%) | 0.70 (+0.39, +126.54%) | 0.70 (+0.07, +10.71%) | 0.67 (+0.15, +28.26%) | 0.74 (+0.24, +48.00%) | 0.82 (+0.16, +24.78%) | 0.79 (+0.27, +53.31%) | 0.70 (+0.55, +370.91%) |
| microsoft | 0.66 (+0.10, +17.85%) | 0.60 (+0.11, +22.13%) | 0.89 (+0.03, +3.83%) | 0.65 (+0.06, +11.05%) | 0.72 (+0.18, +32.36%) | 0.41 (+0.10, +31.65%) | 0.69 (+0.06, +9.12%) | 0.64 (+0.12, +23.16%) | 0.66 (+0.16, +32.78%) | 0.79 (+0.14, +21.22%) | 0.75 (+0.23, +45.60%) | 0.67 (+0.52, +353.71%) |
![Results](img/avg-comet.png)
---
@ -157,6 +157,52 @@ We also compare the systems using the `comet-compare` tool that calculates the s
---
## fi-en
| Translator/Dataset | wmt17 | flores-test | wmt15 | wmt18 | wmt16 | wmt19 | flores-dev |
| --- | --- | --- | --- | --- | --- | --- | --- |
| bergamot | 0.86 | 0.87 | 0.85 | 0.84 | 0.85 | 0.85 | 0.87 |
| google | 0.89 (+0.03, +3.33%) | 0.90 (+0.03, +3.46%) | 0.88 (+0.03, +3.40%) | 0.86 (+0.02, +2.89%) | 0.88 (+0.03, +3.48%) | 0.88 (+0.03, +3.89%) | 0.90 (+0.03, +3.50%) |
| microsoft | 0.90 (+0.03, +3.68%) | 0.90 (+0.03, +3.57%) | 0.89 (+0.04, +4.14%) | 0.87 (+0.03, +3.79%) | 0.89 (+0.03, +3.93%) | 0.89 (+0.04, +4.16%) | 0.90 (+0.03, +3.54%) |
![Results](img/fi-en-comet.png)
### Comparisons between systems
*If a comparison is omitted, the systems have equal averages (tie). Click on the dataset for a complete report*
#### [wmt17.fi-en](fi-en/wmt17.fi-en.cometcompare)
- wmt17.microsoft.en outperforms wmt17.bergamot.en.
- wmt17.google.en outperforms wmt17.bergamot.en.
- wmt17.microsoft.en outperforms wmt17.google.en.
#### [flores-test.fi-en](fi-en/flores-test.fi-en.cometcompare)
- flores-test.microsoft.en outperforms flores-test.bergamot.en.
- flores-test.google.en outperforms flores-test.bergamot.en.
#### [wmt15.fi-en](fi-en/wmt15.fi-en.cometcompare)
- wmt15.microsoft.en outperforms wmt15.bergamot.en.
- wmt15.google.en outperforms wmt15.bergamot.en.
- wmt15.microsoft.en outperforms wmt15.google.en.
#### [wmt18.fi-en](fi-en/wmt18.fi-en.cometcompare)
- wmt18.microsoft.en outperforms wmt18.bergamot.en.
- wmt18.google.en outperforms wmt18.bergamot.en.
- wmt18.microsoft.en outperforms wmt18.google.en.
#### [wmt16.fi-en](fi-en/wmt16.fi-en.cometcompare)
- wmt16.microsoft.en outperforms wmt16.bergamot.en.
- wmt16.google.en outperforms wmt16.bergamot.en.
- wmt16.microsoft.en outperforms wmt16.google.en.
#### [wmt19.fi-en](fi-en/wmt19.fi-en.cometcompare)
- wmt19.microsoft.en outperforms wmt19.bergamot.en.
- wmt19.google.en outperforms wmt19.bergamot.en.
- wmt19.microsoft.en outperforms wmt19.google.en.
#### [flores-dev.fi-en](fi-en/flores-dev.fi-en.cometcompare)
- flores-dev.microsoft.en outperforms flores-dev.bergamot.en.
- flores-dev.google.en outperforms flores-dev.bergamot.en.
---
## en-nl
| Translator/Dataset | flores-dev | flores-test |

Просмотреть файл

@ -0,0 +1 @@
0.8733

Просмотреть файл

@ -0,0 +1,60 @@
==========================
x_name: flores-dev.bergamot.en
y_name: flores-dev.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8735
y-mean: 0.9043
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -18.7835
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
flores-dev.microsoft.en outperforms flores-dev.bergamot.en.
==========================
x_name: flores-dev.bergamot.en
y_name: flores-dev.google.en
Bootstrap Resampling Results:
x-mean: 0.8735
y-mean: 0.9039
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -18.0695
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
flores-dev.google.en outperforms flores-dev.bergamot.en.
==========================
x_name: flores-dev.microsoft.en
y_name: flores-dev.google.en
Bootstrap Resampling Results:
x-mean: 0.9043
y-mean: 0.9039
ties (%): 0.6400
x_wins (%): 0.2667
y_wins (%): 0.0933
Paired T-Test Results:
statistic: 0.4615
p_value: 0.6446
Null hypothesis can't be rejected.
Both systems have equal averages.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y flores-dev.bergamot.en flores-dev.microsoft.en flores-dev.google.en
----------------------- ------------------------ ------------------------- ----------------------
flores-dev.bergamot.en False False
flores-dev.microsoft.en True False
flores-dev.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.9039

Просмотреть файл

@ -0,0 +1 @@
0.9042

Просмотреть файл

@ -0,0 +1 @@
0.8702

Просмотреть файл

@ -0,0 +1,60 @@
==========================
x_name: flores-test.bergamot.en
y_name: flores-test.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8700
y-mean: 0.9013
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -20.3147
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
flores-test.microsoft.en outperforms flores-test.bergamot.en.
==========================
x_name: flores-test.bergamot.en
y_name: flores-test.google.en
Bootstrap Resampling Results:
x-mean: 0.8700
y-mean: 0.9003
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -19.7405
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
flores-test.google.en outperforms flores-test.bergamot.en.
==========================
x_name: flores-test.microsoft.en
y_name: flores-test.google.en
Bootstrap Resampling Results:
x-mean: 0.9013
y-mean: 0.9003
ties (%): 0.4400
x_wins (%): 0.5033
y_wins (%): 0.0567
Paired T-Test Results:
statistic: 1.3298
p_value: 0.1839
Null hypothesis can't be rejected.
Both systems have equal averages.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y flores-test.bergamot.en flores-test.microsoft.en flores-test.google.en
------------------------ ------------------------- -------------------------- -----------------------
flores-test.bergamot.en False False
flores-test.microsoft.en True False
flores-test.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.9003

Просмотреть файл

@ -0,0 +1 @@
0.9013

Просмотреть файл

@ -0,0 +1 @@
0.8498

Просмотреть файл

@ -0,0 +1,61 @@
==========================
x_name: wmt15.bergamot.en
y_name: wmt15.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8495
y-mean: 0.8849
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -21.2660
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt15.microsoft.en outperforms wmt15.bergamot.en.
==========================
x_name: wmt15.bergamot.en
y_name: wmt15.google.en
Bootstrap Resampling Results:
x-mean: 0.8495
y-mean: 0.8786
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -16.7812
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt15.google.en outperforms wmt15.bergamot.en.
==========================
x_name: wmt15.microsoft.en
y_name: wmt15.google.en
Bootstrap Resampling Results:
x-mean: 0.8849
y-mean: 0.8786
ties (%): 0.0000
x_wins (%): 1.0000
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 6.4529
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt15.microsoft.en outperforms wmt15.google.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y wmt15.bergamot.en wmt15.microsoft.en wmt15.google.en
--------------------- ------------------- -------------------- -----------------
wmt15.bergamot.en False False
wmt15.microsoft.en True True
wmt15.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.8787

Просмотреть файл

@ -0,0 +1 @@
0.8850

Просмотреть файл

@ -0,0 +1 @@
0.8539

Просмотреть файл

@ -0,0 +1,61 @@
==========================
x_name: wmt16.bergamot.en
y_name: wmt16.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8541
y-mean: 0.8875
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -31.1694
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt16.microsoft.en outperforms wmt16.bergamot.en.
==========================
x_name: wmt16.bergamot.en
y_name: wmt16.google.en
Bootstrap Resampling Results:
x-mean: 0.8541
y-mean: 0.8837
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -26.1908
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt16.google.en outperforms wmt16.bergamot.en.
==========================
x_name: wmt16.microsoft.en
y_name: wmt16.google.en
Bootstrap Resampling Results:
x-mean: 0.8875
y-mean: 0.8837
ties (%): 0.0033
x_wins (%): 0.9967
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 5.6877
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt16.microsoft.en outperforms wmt16.google.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y wmt16.bergamot.en wmt16.microsoft.en wmt16.google.en
--------------------- ------------------- -------------------- -----------------
wmt16.bergamot.en False False
wmt16.microsoft.en True True
wmt16.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.8836

Просмотреть файл

@ -0,0 +1 @@
0.8875

Просмотреть файл

@ -0,0 +1 @@
0.8647

Просмотреть файл

@ -0,0 +1,61 @@
==========================
x_name: wmt17.bergamot.en
y_name: wmt17.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8647
y-mean: 0.8963
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -29.2289
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt17.microsoft.en outperforms wmt17.bergamot.en.
==========================
x_name: wmt17.bergamot.en
y_name: wmt17.google.en
Bootstrap Resampling Results:
x-mean: 0.8647
y-mean: 0.8932
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -26.0983
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt17.google.en outperforms wmt17.bergamot.en.
==========================
x_name: wmt17.microsoft.en
y_name: wmt17.google.en
Bootstrap Resampling Results:
x-mean: 0.8963
y-mean: 0.8932
ties (%): 0.0167
x_wins (%): 0.9833
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 4.4332
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt17.microsoft.en outperforms wmt17.google.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y wmt17.bergamot.en wmt17.microsoft.en wmt17.google.en
--------------------- ------------------- -------------------- -----------------
wmt17.bergamot.en False False
wmt17.microsoft.en True True
wmt17.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.8935

Просмотреть файл

@ -0,0 +1 @@
0.8965

Просмотреть файл

@ -0,0 +1 @@
0.8386

Просмотреть файл

@ -0,0 +1,61 @@
==========================
x_name: wmt18.bergamot.en
y_name: wmt18.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8384
y-mean: 0.8702
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -28.2352
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt18.microsoft.en outperforms wmt18.bergamot.en.
==========================
x_name: wmt18.bergamot.en
y_name: wmt18.google.en
Bootstrap Resampling Results:
x-mean: 0.8384
y-mean: 0.8627
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -21.0238
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt18.google.en outperforms wmt18.bergamot.en.
==========================
x_name: wmt18.microsoft.en
y_name: wmt18.google.en
Bootstrap Resampling Results:
x-mean: 0.8702
y-mean: 0.8627
ties (%): 0.0000
x_wins (%): 1.0000
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 11.1206
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt18.microsoft.en outperforms wmt18.google.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y wmt18.bergamot.en wmt18.microsoft.en wmt18.google.en
--------------------- ------------------- -------------------- -----------------
wmt18.bergamot.en False False
wmt18.microsoft.en True True
wmt18.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.8628

Просмотреть файл

@ -0,0 +1 @@
0.8704

Просмотреть файл

@ -0,0 +1 @@
0.8517

Просмотреть файл

@ -0,0 +1,61 @@
==========================
x_name: wmt19.bergamot.en
y_name: wmt19.microsoft.en
Bootstrap Resampling Results:
x-mean: 0.8516
y-mean: 0.8870
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -25.2125
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt19.microsoft.en outperforms wmt19.bergamot.en.
==========================
x_name: wmt19.bergamot.en
y_name: wmt19.google.en
Bootstrap Resampling Results:
x-mean: 0.8516
y-mean: 0.8846
ties (%): 0.0000
x_wins (%): 0.0000
y_wins (%): 1.0000
Paired T-Test Results:
statistic: -22.9932
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt19.google.en outperforms wmt19.bergamot.en.
==========================
x_name: wmt19.microsoft.en
y_name: wmt19.google.en
Bootstrap Resampling Results:
x-mean: 0.8870
y-mean: 0.8846
ties (%): 0.1267
x_wins (%): 0.8700
y_wins (%): 0.0033
Paired T-Test Results:
statistic: 2.7813
p_value: 0.0055
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
wmt19.microsoft.en outperforms wmt19.google.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y wmt19.bergamot.en wmt19.microsoft.en wmt19.google.en
--------------------- ------------------- -------------------- -----------------
wmt19.bergamot.en False False
wmt19.microsoft.en True True
wmt19.google.en True False

Просмотреть файл

@ -0,0 +1 @@
0.8848

Просмотреть файл

@ -0,0 +1 @@
0.8871

Двоичные данные
evaluation/dev/img/avg-comet.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 25 KiB

После

Ширина:  |  Высота:  |  Размер: 23 KiB

Двоичные данные
evaluation/dev/img/fi-en-comet.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 22 KiB

Просмотреть файл

@ -205,13 +205,6 @@
}
},
"enfr": {
"model": {
"name": "model.enfr.intgemm.alphas.bin",
"size": 17140961,
"estimatedCompressedSize": 12293754,
"expectedSha256Hash": "0678019c4d74c8c81d2de17e3e58d3aba5f5eb48f5595d9240c17f69d30461de",
"modelType": "prod"
},
"lex": {
"name": "lex.50.50.enfr.s2t.bin",
"size": 7886500,
@ -219,6 +212,13 @@
"expectedSha256Hash": "38fb44bad1fd5f1e6bfdcf15cc8baa09d61aad2a4f9c587914e24e7b5c25c32c",
"modelType": "prod"
},
"model": {
"name": "model.enfr.intgemm.alphas.bin",
"size": 17140961,
"estimatedCompressedSize": 12293754,
"expectedSha256Hash": "0678019c4d74c8c81d2de17e3e58d3aba5f5eb48f5595d9240c17f69d30461de",
"modelType": "prod"
},
"vocab": {
"name": "vocab.fren.spm",
"size": 831382,

Просмотреть файл

@ -6,10 +6,10 @@ python3 eval/evaluate.py \
--translators=bergamot,microsoft,google \
--pairs=all --skip-existing \
--results-dir=/models/evaluation/dev --models-dir=/models/models/dev \
--gpus=1 --evaluation-engine=bleu
--gpus=1 --evaluation-engine=comet,bleu
python3 eval/evaluate.py \
--translators=bergamot,microsoft,google \
--pairs=all --skip-existing \
--results-dir=/models/evaluation/prod --models-dir=/models/models/prod \
--gpus=1 --evaluation-engine=bleu
--gpus=1 --evaluation-engine=comet,bleu

Просмотреть файл

@ -21,15 +21,8 @@ xargs rm -f
echo "Extracting models"
gzip -drf models/*/*/*
echo "Cloning evaluation repo"
if [ ! -e firefox-translations-evaluation ]; then
git clone https://github.com/mozilla/firefox-translations-evaluation.git
fi
echo "Building docker image"
cd firefox-translations-evaluation
docker build -t bergamot-eval .
cd ..
make build-docker
echo "Running evaluation"
GCP_CREDS_PATH="/tmp/.gcp_creds"