История

Richard Shin 257e5dd23d Initial code release (#2 )		2021-11-05 13:55:27 -07:00
..
evaluation	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
example_test_predictions	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
scripts	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
tmp	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
utils	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
Dockerfile	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
LICENSE	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
README.md	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
allennlp_preds_format.py	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
evaluate.yaml	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
poetry.lock	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
pyproject.toml	Initial code release (#2 )	2021-11-05 13:55:27 -07:00
requirements.txt	Initial code release (#2 )	2021-11-05 13:55:27 -07:00

README.md

Break Evaluator

Evaluator for the Break dataset (AI2 Israel).
Used in both the Break and Break High-level leaderboards.

Example

% PYTHONPATH="." python3.7 scripts/evaluate_predictions.py 
--dataset_file=/labels/labels.csv \
--preds_file=/predictions/predictions.csv \
--no_cache \
--output_file_base=/results/results \
--metrics ged_scores exact_match sari normalized_exact_match \
				
% cat results/results_metrics.json
{"exact_match": 0.24242424242424243, "sari": 0.7061778423719823, "ged": 0.4089606835211786, "normalized_exact_match": 0.32323232323232326}

Usage

Input

The evaluation script recieves as input a Break dataset_file which is a CSV file containing the correct labels. Additionally, it should receive preds_file, a CSV file containing a model's predictions, ordered according to dataset_file. The output_file_base indicates the file to which the evaluation output be saved. Last metrics indicates which evaluation metrics should be included out of ged_scores, exact_match, sari, normalized_exact_match.

The tmp directory contains examples of dataset_file and preds_file.

Output

The evaluation output will be saved to output_file_base_metrics.json

Setup

To run the evaluation script locally, using a conda virtual environment, do the following:

Create a virtual environment

conda create -n [ENV_NAME] python=3.7
conda activate [ENV_NAME]

Install requirements

pip install -r requirements.txt 
python -m spacy download en_core_web_sm

Run in shell

PYTHONPATH="." python3.7 scripts/evaluate_predictions.py 
--dataset_file=/labels/labels.csv \
--preds_file=/predictions/predictions.csv \
--no_cache \
--output_file_base=/results/results \
--metrics ged_scores exact_match sari normalized_exact_match \

Docker

We build an evaluator image using Docker, and the specified Dockerfile.

Build

To build the break-evaluator image:

docker build --tag break-evaluator .

Run

Our evaluator should receive three files as input, the dataset true labels, the model's prediction file and the path to the output file. We therefore bind mount the relevant files when using docker run. The specific volume mounts, given our relevant files are storem in tmp, will be:

-v "$(pwd)"/tmp/results/:/results:rw
-v "$(pwd)"/tmp/predictions/:/predictions:ro
-v "$(pwd)"/tmp/labels/:/labels:ro

The full run command being:

sudo docker run -it -v "$(pwd)"/tmp/results/:/results:rw -v "$(pwd)"/tmp/predictions/:/predictions:ro -v "$(pwd)"/tmp/labels/:/labels:ro break-evaluator bash -c "python3.7 scripts/evaluate_predictions.py --dataset_file=/labels/labels.csv --preds_file=/predictions/predictions.csv --no_cache --output_file_base=/results/results --metrics ged_scores exact_match sari normalized_exact_match"

Beaker

To add a Beaker image of the evaluator run:

beaker image create -n break-evaluator-YYYY-MM-DD break-evaluator:latest

Evaluation Metircs

To learn more about the evaluation metrics used for Break, please refer to the paper "Break It Down: A Question Understanding Benchmark" (Wolfson et al., TACL 2020).
The "Normalized Exact Match" metric, is a newly introduced evaluation metric for QDMR that will be included in future work. It compares two QDMRs by normalizing their respective graphs: further decomposing steps; ordering chains of "filter" operations; lemmatizing step noun phrases; etc.