This repository contains information about the general langugae generation evaluation benchmark GLGE, which is composed of 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat). In order to provide more diversified difficulty challenges, we provide 3 different difficulty versions (**easy**, **medium**, and **hard**) for each task.
The 8 tasks in GLGE can be categorized into 4 groups: Abstractive Text Summarization tasks, Answer-aware Question Generation tasks, Conversational Question Answering task, and Personalizing Dialogue task.
CNN/DailyMail `\cite{hermann2015cnndm}` dataset contains 220K articles from the Daily Mail newspapers, and 93K articles from the CNN. Each article contains a bullet point summary. GLGE use the non-anonymized variant used in `\cite{see2017get}`. After the pre-processing, there are 311,971 <article,summary> data pairs, where the source input is the article, and the target output is the summary which consists of multiple sentences. ROUGE-1, ROUGE-2, and ROUGE-L are used as the metrics.
Gigaword `\cite{rush2015neural}` contains 4M examples extracted from the news articles of the Gigaword corpus `\cite{graff2003gigaword}`. After the pre-processing, there are 3,995,559 <article,summary> data pairs, where the source input is the the first sentence of the article, and the target output is the headline that often only contains a single sentence. ROUGE-1, ROUGE-2, and ROUGE-L are used as the metrics.
#### XSUM
XSUM `\cite{narayan2018don}` consists of 227K online articles from the British Broadcasting Corporation (BBC), which contains professionally written single-sentence summaries. After the pre-processing, there are 226,677 <article,summary> data pairs, where the source input is the the news article, and the target output is a single summary sentence. ROUGE-1, ROUGE-2, and ROUGE-L are used as the metrics.
We collect 202K online news articles from a real world search engine. Each article contains a professionally written single-sentence headline. After the pre-processing, there are 201,947 <article,headline> data pairs, where the source input is the news article, and the target output is a news headline. ROUGE-1, ROUGE-2, and ROUGE-L are used as the evaluation metrics.
SQuAD 1.1 `cite{rajpurkar2016squad}` dataset contains 536 Wikipedia articles with over 100k Amazon Mechanical Turks crowd-worker created questions posed about the articles with the corresponding answer span. Since the original hidden test set of the SQuAD 1.1 is hidden, we re-split the dataset with the examples from the original training set and dev set. After the pre-processing, there are 98,169 <answer,passage,question> data triples the source input is a Wikipedia passage along with an answer span, and the target output is a question. ROUGE-L, BLEU-4, and METEOR are used as the metrics.
We collect 222K passages from a real world search engine. Each passage contains a highlight span and a related query. After the pre-processing, there are 221,657 <highlightspan,passage,question> data triples, where the source input is a news passage along with highlight span, and the target output is a user question. ROUGE-L, BLEU-4, and METEOR are used as the metrics.
CoQA `\cite{reddy2019coqa}` dataset contains 127K questions with answers, obtained from 8k conversations about text passages from seven diverse domains. After the pre-processing, there are 116,630 <conversationhistory,passage,question,answer> data 4-tuples, where the source input is a sequence of conversation history along with a given question and a give passage, and the target output is a free-form answer text. F1-Score is used as the metrics.
### Conversational Question Answering Task
#### PersonaChat
PersonaChat `\cite{zhang2018personalizing}` dataset is consist of 162,064 utterances, which require models generate responses according to given multi-turn conversations and persona profile. After the pre-processing, there are 151,157 <personaprofiledescriptiontext,conversationhistory,response> data triples, where the source input is a sequence of conversation history along with several sentences of persona profile description text, and the target output is a response. BLEU-1, BLEU-2, Distinct-1, and Distinct-2 are used as the metrics.
In order to use our dataset, please navigate to [GLGE Leaderboard](https://microsoft.github.io/glge/) and agree to our terms of service. After you do so a download link will be made available.
We put the baselines to ProphetNet [repo](https://github.com/microsoft/ProphetNet/GLGE_baselines). It contains the pre-trained models, fine-tuning scripts, and evaluation scripts for GLGE.
To submit your predictions for evaluation, please create a single folder which contains the prediction files (see [submission_examples](submission_examples/) for an example).
The prediction file shoud be named with the following format: `{task}.{version}.test` where `{version}` is the difficulty versions (**easy**, **medium**, and **hard**), task is the task name (**cnndm**, **gigaword**, **xsum**, **msnews**, **sqaudqg**, **msqg**, **coqa**, and **personachat** ).
Please validate that you have done this correctly by evaluating against the development file. Once that is done <ahref='glge@microsoft.com'>email your submission</a>. We will reply with your model performance.
## Paper
If you use our benchmark or dataset, please cite our paper `\cite{Liu2020GLGE}`.
```
GLGE: A New General Language Generation Evaluation Benchmark
```
Additionally, since GLGE is also built out of exiting 6 datasets, please ensure you cite all of them.
We evaluate our model using the GLGE benchmark `\cite{Liu2020GLGE}`, a general langugae generation evaluation benchmark consiting of CNN/DailyMail `\cite{hermann2015cnndm}``\cite{see2017get}`, Gigaword `\cite{rush2015neural}``\cite{graff2003gigaword}`, XSum `\cite{narayan2018don}`, MSNews, SQuAD 1.1 `cite{rajpurkar2016squad}`, MSQG, CoQA `\cite{reddy2019coqa}`, and PersonaChat `\cite{zhang2018personalizing}`.
Microsoft and any contributors grant you a license to the Microsoft documentation and other content
in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode),
see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the
[LICENSE-CODE](LICENSE-CODE) file.
Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation
may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries.
The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks.
Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
Privacy information can be found at https://privacy.microsoft.com/en-us/
Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents,
or trademarks, whether by implication, estoppel or otherwise.