diff --git a/.circleci/config.yml b/.circleci/config.yml
index b6a1b50bd..b7a9059b8 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -309,7 +309,7 @@ jobs:
                       - v0.4-{{ checksum "setup.py" }}
             - run: pip install --upgrade pip
             - run: pip install .[sklearn,torch,sentencepiece,testing]
-            - run: pip install -r examples/requirements.txt
+            - run: pip install -r examples/_tests_requirements.txt
             - save_cache:
                   key: v0.4-torch_examples-{{ checksum "setup.py" }}
                   paths:
diff --git a/examples/README.md b/examples/README.md
index b9a3acca7..8349206ab 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -16,59 +16,58 @@ limitations under the License.
 
 # Examples
 
-Version 2.9 of 🤗 Transformers introduced a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
-Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.2+.
-
-Here is the list of all our examples:
-- **grouped by task** (all official examples work for multiple models)
-- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
-  just lack some features),
-- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
-- links to **Colab notebooks** to walk through the scripts and run them easily,
-- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
-
+This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to
+be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects).
 
 ## Important note
 
 **Important**
 
-To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements.
-Execute the following steps in a new virtual environment:
-
+To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
 ```bash
 git clone https://github.com/huggingface/transformers
 cd transformers
 pip install .
-pip install -r ./examples/requirements.txt
+```
+Then cd in the example folder of your choice and run
+```bash
+pip install -r requirements.txt
 ```
 
-Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.4.0):
+Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.5.1):
 ```bash
-git checkout tags/v3.4.0
+git checkout tags/v3.5.1
 ```
 
 ## The Big Table of Tasks
 
+Here is the list of all our examples:
+- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
+  just lack some features),
+- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
+- links to **Colab notebooks** to walk through the scripts and run them easily,
+<!--
+Coming soon!
+- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
+-->
+
 | Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
 |---|---|:---:|:---:|:---:|:---:|
 | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | Raw text        | ✅ | -  | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
-| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
-| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
 | [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | -
-| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)           | -               | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
-| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation)                 | All             | - | -  | - | -
 | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                     | CNN/Daily Mail  | ✅  | - | - | -
+| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
+| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)           | -               | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
+| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
 | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                       | WMT             | ✅  | - | - | -
-| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology)                       | -               | - | - | - | -
-| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial)                   | HANS            | ✅ | - | - | -
 
 
-<br>
-
+<!--
 ## One-click Deploy to Cloud (wip)
 
 **Coming soon!**
+-->
 
 ## Running on TPUs
 
diff --git a/examples/_tests_requirements.txt b/examples/_tests_requirements.txt
new file mode 100644
index 000000000..e40aef179
--- /dev/null
+++ b/examples/_tests_requirements.txt
@@ -0,0 +1,20 @@
+tensorboard
+scikit-learn
+seqeval
+psutil
+sacrebleu
+rouge-score
+tensorflow_datasets
+matplotlib
+git-python==1.0.3
+faiss-cpu
+streamlit
+elasticsearch
+nltk
+pandas
+datasets >= 1.1.3
+fire
+pytest
+conllu
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/benchmarking/README.md b/examples/benchmarking/README.md
index d2d7ac91c..7099ed9f6 100644
--- a/examples/benchmarking/README.md
+++ b/examples/benchmarking/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 # 🤗 Benchmark results
 
 Here, you can find a list of the different benchmark results created by the community.
diff --git a/examples/benchmarking/plot_csv_file.py b/examples/benchmarking/plot_csv_file.py
index 6614df0a9..58dc50bb8 100644
--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import csv
 from collections import defaultdict
 from dataclasses import dataclass, field
diff --git a/examples/bert-loses-patience/pabee/__init__.py b/examples/benchmarking/requirements.txt
similarity index 100%
rename from examples/bert-loses-patience/pabee/__init__.py
rename to examples/benchmarking/requirements.txt
diff --git a/examples/conftest.py b/examples/conftest.py
index 75f5667f3..2415ae8db 100644
--- a/examples/conftest.py
+++ b/examples/conftest.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # tests directory-specific settings - this file is run automatically
 # by pytest before any tests are run
 
diff --git a/examples/contrib/README.md b/examples/contrib/README.md
deleted file mode 100644
index f2d0616e6..000000000
--- a/examples/contrib/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Community contributed examples
-
-This folder contains examples which are not actively maintained (mostly contributed by the community).
-
-Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
diff --git a/examples/language-modeling/README.md b/examples/language-modeling/README.md
index 2a62f7e4a..d9cc2a72c 100644
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Language model training
 
 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
diff --git a/examples/language-modeling/requirements.txt b/examples/language-modeling/requirements.txt
new file mode 100644
index 000000000..0f5c38bd4
--- /dev/null
+++ b/examples/language-modeling/requirements.txt
@@ -0,0 +1,3 @@
+datasets >= 1.1.3
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/legacy/README.md b/examples/legacy/README.md
new file mode 100644
index 000000000..eaf64f624
--- /dev/null
+++ b/examples/legacy/README.md
@@ -0,0 +1,21 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Legacy examples
+
+This folder contains examples which are not actively maintained (mostly contributed by the community).
+
+Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
diff --git a/examples/lightning_base.py b/examples/legacy/pytorch-lightning/lightning_base.py
similarity index 100%
rename from examples/lightning_base.py
rename to examples/legacy/pytorch-lightning/lightning_base.py
diff --git a/examples/requirements.txt b/examples/legacy/pytorch-lightning/requirements.txt
similarity index 100%
rename from examples/requirements.txt
rename to examples/legacy/pytorch-lightning/requirements.txt
diff --git a/examples/text-classification/run_pl_glue.py b/examples/legacy/pytorch-lightning/run_glue.py
similarity index 100%
rename from examples/text-classification/run_pl_glue.py
rename to examples/legacy/pytorch-lightning/run_glue.py
diff --git a/examples/text-classification/run_pl.sh b/examples/legacy/pytorch-lightning/run_glue.sh
similarity index 93%
rename from examples/text-classification/run_pl.sh
rename to examples/legacy/pytorch-lightning/run_glue.sh
index d407478ff..7cd57306d 100755
--- a/examples/text-classification/run_pl.sh
+++ b/examples/legacy/pytorch-lightning/run_glue.sh
@@ -21,7 +21,7 @@ mkdir -p $OUTPUT_DIR
 # Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 
-python3 run_pl_glue.py --gpus 1 --data_dir $DATA_DIR \
+python3 run_glue.py --gpus 1 --data_dir $DATA_DIR \
 --task $TASK \
 --model_name_or_path $BERT_MODEL \
 --output_dir $OUTPUT_DIR \
diff --git a/examples/token-classification/run_pl_ner.py b/examples/legacy/pytorch-lightning/run_ner.py
similarity index 100%
rename from examples/token-classification/run_pl_ner.py
rename to examples/legacy/pytorch-lightning/run_ner.py
diff --git a/examples/token-classification/run_pl.sh b/examples/legacy/pytorch-lightning/run_ner.sh
similarity index 97%
rename from examples/token-classification/run_pl.sh
rename to examples/legacy/pytorch-lightning/run_ner.sh
index 5abcd981b..2913473eb 100755
--- a/examples/token-classification/run_pl.sh
+++ b/examples/legacy/pytorch-lightning/run_ner.sh
@@ -31,7 +31,7 @@ mkdir -p $OUTPUT_DIR
 # Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 
-python3 run_pl_ner.py --data_dir ./ \
+python3 run_ner.py --data_dir ./ \
 --labels ./labels.txt \
 --model_name_or_path $BERT_MODEL \
 --output_dir $OUTPUT_DIR \
diff --git a/examples/token-classification/run_pos_pl.sh b/examples/legacy/pytorch-lightning/run_pos.sh
similarity index 96%
rename from examples/token-classification/run_pos_pl.sh
rename to examples/legacy/pytorch-lightning/run_pos.sh
index e2539ea71..93765366c 100755
--- a/examples/token-classification/run_pos_pl.sh
+++ b/examples/legacy/pytorch-lightning/run_pos.sh
@@ -26,7 +26,7 @@ export SEED=1
 # Add parent directory to python path to access lightning_base.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 
-python3 run_pl_ner.py --data_dir ./ \
+python3 run_ner.py --data_dir ./ \
 --task_type POS \
 --model_name_or_path $BERT_MODEL \
 --output_dir $OUTPUT_DIR \
diff --git a/examples/question-answering/run_squad.py b/examples/legacy/question-answering/run_squad.py
similarity index 100%
rename from examples/question-answering/run_squad.py
rename to examples/legacy/question-answering/run_squad.py
diff --git a/examples/question-answering/run_squad_trainer.py b/examples/legacy/question-answering/run_squad_trainer.py
similarity index 100%
rename from examples/question-answering/run_squad_trainer.py
rename to examples/legacy/question-answering/run_squad_trainer.py
diff --git a/examples/contrib/run_camembert.py b/examples/legacy/run_camembert.py
similarity index 100%
rename from examples/contrib/run_camembert.py
rename to examples/legacy/run_camembert.py
diff --git a/examples/contrib/run_chinese_ref.py b/examples/legacy/run_chinese_ref.py
similarity index 100%
rename from examples/contrib/run_chinese_ref.py
rename to examples/legacy/run_chinese_ref.py
diff --git a/examples/contrib/legacy/run_language_modeling.py b/examples/legacy/run_language_modeling.py
similarity index 100%
rename from examples/contrib/legacy/run_language_modeling.py
rename to examples/legacy/run_language_modeling.py
diff --git a/examples/contrib/run_openai_gpt.py b/examples/legacy/run_openai_gpt.py
similarity index 100%
rename from examples/contrib/run_openai_gpt.py
rename to examples/legacy/run_openai_gpt.py
diff --git a/examples/contrib/run_swag.py b/examples/legacy/run_swag.py
similarity index 100%
rename from examples/contrib/run_swag.py
rename to examples/legacy/run_swag.py
diff --git a/examples/contrib/run_transfo_xl.py b/examples/legacy/run_transfo_xl.py
similarity index 100%
rename from examples/contrib/run_transfo_xl.py
rename to examples/legacy/run_transfo_xl.py
diff --git a/examples/legacy/token-classification/README.md b/examples/legacy/token-classification/README.md
new file mode 100644
index 000000000..3411a6154
--- /dev/null
+++ b/examples/legacy/token-classification/README.md
@@ -0,0 +1,229 @@
+## Token classification
+
+Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner.py).
+
+The following examples are covered in this section:
+
+* NER on the GermEval 2014 (German NER) dataset
+* Emerging and Rare Entities task: WNUT’17 (English NER) dataset
+
+Details and results for the fine-tuning provided by @stefan-it.
+
+### GermEval 2014 (German NER) dataset
+
+#### Data (Download and pre-processing steps)
+
+Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
+
+Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
+
+```bash
+curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
+curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
+curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
+```
+
+The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`.
+One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s.
+The `preprocess.py` script located in the `scripts` folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
+
+Let's define some variables that we need for further pre-processing steps and training the model:
+
+```bash
+export MAX_LENGTH=128
+export BERT_MODEL=bert-base-multilingual-cased
+```
+
+Run the pre-processing script on training, dev and test datasets:
+
+```bash
+python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
+python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
+python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
+```
+
+The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
+
+```bash
+cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
+```
+
+#### Prepare the run
+
+Additional environment variables must be set:
+
+```bash
+export OUTPUT_DIR=germeval-model
+export BATCH_SIZE=32
+export NUM_EPOCHS=3
+export SAVE_STEPS=750
+export SEED=1
+```
+
+#### Run the Pytorch version
+
+To start training, just run:
+
+```bash
+python3 run_ner.py --data_dir ./ \
+--labels ./labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR \
+--max_seq_length  $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--per_device_train_batch_size $BATCH_SIZE \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_eval \
+--do_predict
+```
+
+If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
+
+#### JSON-based configuration file
+
+Instead of passing all parameters via commandline arguments, the `run_ner.py` script also supports reading parameters from a json-based configuration file:
+
+```json
+{
+    "data_dir": ".",
+    "labels": "./labels.txt",
+    "model_name_or_path": "bert-base-multilingual-cased",
+    "output_dir": "germeval-model",
+    "max_seq_length": 128,
+    "num_train_epochs": 3,
+    "per_device_train_batch_size": 32,
+    "save_steps": 750,
+    "seed": 1,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": true
+}
+```
+
+It must be saved with a `.json` extension and can be used by running `python3 run_ner.py config.json`.
+
+#### Evaluation
+
+Evaluation on development dataset outputs the following for our example:
+
+```bash
+10/04/2019 00:42:06 - INFO - __main__ -   ***** Eval results  *****
+10/04/2019 00:42:06 - INFO - __main__ -     f1 = 0.8623348017621146
+10/04/2019 00:42:06 - INFO - __main__ -     loss = 0.07183869666975543
+10/04/2019 00:42:06 - INFO - __main__ -     precision = 0.8467916366258111
+10/04/2019 00:42:06 - INFO - __main__ -     recall = 0.8784592370979806
+```
+
+On the test dataset the following results could be achieved:
+
+```bash
+10/04/2019 00:42:42 - INFO - __main__ -   ***** Eval results  *****
+10/04/2019 00:42:42 - INFO - __main__ -     f1 = 0.8614389652384803
+10/04/2019 00:42:42 - INFO - __main__ -     loss = 0.07064602487454782
+10/04/2019 00:42:42 - INFO - __main__ -     precision = 0.8604651162790697
+10/04/2019 00:42:42 - INFO - __main__ -     recall = 0.8624150210424085
+```
+
+### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
+
+Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
+
+> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
+> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
+> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
+
+Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
+
+#### Data (Download and pre-processing steps)
+
+The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
+
+The following commands show how to prepare the dataset for fine-tuning:
+
+```bash
+mkdir -p data_wnut_17
+
+curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll'  | tr '\t' ' ' > data_wnut_17/train.txt.tmp
+curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
+curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
+```
+
+Let's define some variables that we need for further pre-processing steps:
+
+```bash
+export MAX_LENGTH=128
+export BERT_MODEL=bert-large-cased
+```
+
+Here we use the English BERT large model for fine-tuning.
+The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
+
+```bash
+python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
+python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
+python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
+```
+
+In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
+
+```bash
+cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
+```
+
+#### Run the Pytorch version
+
+Fine-tuning with the PyTorch version can be started using the `run_ner.py` script. In this example we use a JSON-based configuration file.
+
+This configuration file looks like:
+
+```json
+{
+    "data_dir": "./data_wnut_17",
+    "labels": "./data_wnut_17/labels.txt",
+    "model_name_or_path": "bert-large-cased",
+    "output_dir": "wnut-17-model-1",
+    "max_seq_length": 128,
+    "num_train_epochs": 3,
+    "per_device_train_batch_size": 32,
+    "save_steps": 425,
+    "seed": 1,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": true,
+    "fp16": false
+}
+```
+
+If your GPU supports half-precision training, please set `fp16` to `true`.
+
+Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
+
+#### Evaluation
+
+Evaluation on development dataset outputs the following:
+
+```bash
+05/29/2020 23:33:44 - INFO - __main__ -   ***** Eval results *****
+05/29/2020 23:33:44 - INFO - __main__ -     eval_loss = 0.26505235286212275
+05/29/2020 23:33:44 - INFO - __main__ -     eval_precision = 0.7008264462809918
+05/29/2020 23:33:44 - INFO - __main__ -     eval_recall = 0.507177033492823
+05/29/2020 23:33:44 - INFO - __main__ -     eval_f1 = 0.5884802220680084
+05/29/2020 23:33:44 - INFO - __main__ -     epoch = 3.0
+```
+
+On the test dataset the following results could be achieved:
+
+```bash
+05/29/2020 23:33:44 - INFO - transformers.trainer -   ***** Running Prediction *****
+05/29/2020 23:34:02 - INFO - __main__ -     eval_loss = 0.30948806500973547
+05/29/2020 23:34:02 - INFO - __main__ -     eval_precision = 0.5840108401084011
+05/29/2020 23:34:02 - INFO - __main__ -     eval_recall = 0.3994439295644115
+05/29/2020 23:34:02 - INFO - __main__ -     eval_f1 = 0.47440836543753434
+```
+
+WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
diff --git a/examples/token-classification/run_old.sh b/examples/legacy/token-classification/run.sh
similarity index 98%
rename from examples/token-classification/run_old.sh
rename to examples/legacy/token-classification/run.sh
index 90cb4484d..f5cbf0d50 100755
--- a/examples/token-classification/run_old.sh
+++ b/examples/legacy/token-classification/run.sh
@@ -20,7 +20,7 @@ export NUM_EPOCHS=3
 export SAVE_STEPS=750
 export SEED=1
 
-python3 run_ner_old.py \
+python3 run_ner.py \
 --task_type NER \
 --data_dir . \
 --labels ./labels.txt \
diff --git a/examples/token-classification/run_chunk.sh b/examples/legacy/token-classification/run_chunk.sh
similarity index 97%
rename from examples/token-classification/run_chunk.sh
rename to examples/legacy/token-classification/run_chunk.sh
index 3dbb03306..13341555b 100755
--- a/examples/token-classification/run_chunk.sh
+++ b/examples/legacy/token-classification/run_chunk.sh
@@ -21,7 +21,7 @@ export NUM_EPOCHS=3
 export SAVE_STEPS=750
 export SEED=1
 
-python3 run_ner_old.py \
+python3 run_ner.py \
 --task_type Chunk \
 --data_dir . \
 --model_name_or_path $BERT_MODEL \
diff --git a/examples/token-classification/run_ner_old.py b/examples/legacy/token-classification/run_ner.py
similarity index 100%
rename from examples/token-classification/run_ner_old.py
rename to examples/legacy/token-classification/run_ner.py
diff --git a/examples/token-classification/run_pos.sh b/examples/legacy/token-classification/run_pos.sh
similarity index 97%
rename from examples/token-classification/run_pos.sh
rename to examples/legacy/token-classification/run_pos.sh
index 50aed87d4..7d76ed8a2 100755
--- a/examples/token-classification/run_pos.sh
+++ b/examples/legacy/token-classification/run_pos.sh
@@ -21,7 +21,7 @@ export NUM_EPOCHS=3
 export SAVE_STEPS=750
 export SEED=1
 
-python3 run_ner_old.py \
+python3 run_ner.py \
 --task_type POS \
 --data_dir . \
 --model_name_or_path $BERT_MODEL \
diff --git a/examples/token-classification/scripts/preprocess.py b/examples/legacy/token-classification/scripts/preprocess.py
similarity index 100%
rename from examples/token-classification/scripts/preprocess.py
rename to examples/legacy/token-classification/scripts/preprocess.py
diff --git a/examples/token-classification/tasks.py b/examples/legacy/token-classification/tasks.py
similarity index 100%
rename from examples/token-classification/tasks.py
rename to examples/legacy/token-classification/tasks.py
diff --git a/examples/token-classification/utils_ner.py b/examples/legacy/token-classification/utils_ner.py
similarity index 100%
rename from examples/token-classification/utils_ner.py
rename to examples/legacy/token-classification/utils_ner.py
diff --git a/examples/multiple-choice/README.md b/examples/multiple-choice/README.md
index 75430fea6..3d0a643cd 100644
--- a/examples/multiple-choice/README.md
+++ b/examples/multiple-choice/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Multiple Choice
 
 Based on the script [`run_multiple_choice.py`]().
diff --git a/examples/multiple-choice/requirements.txt b/examples/multiple-choice/requirements.txt
new file mode 100644
index 000000000..013c579bc
--- /dev/null
+++ b/examples/multiple-choice/requirements.txt
@@ -0,0 +1,2 @@
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/question-answering/README.md b/examples/question-answering/README.md
index fffa72d01..ff800edc7 100644
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -1,8 +1,29 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
 
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
 
 ## SQuAD
 
-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
+Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).
+
+**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
+uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
+[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
+of the script.
+
+The old version of this script can be found [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/question-answering/run_squad.py).
 
 #### Fine-tuning BERT on SQuAD1.0
 
diff --git a/examples/question-answering/requirements.txt b/examples/question-answering/requirements.txt
new file mode 100644
index 000000000..ff72fc841
--- /dev/null
+++ b/examples/question-answering/requirements.txt
@@ -0,0 +1 @@
+datasets >= 1.1.3
diff --git a/examples/research_projects/README.md b/examples/research_projects/README.md
new file mode 100644
index 000000000..32d7fee04
--- /dev/null
+++ b/examples/research_projects/README.md
@@ -0,0 +1,28 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Research projects
+
+This folder contains various research projects using 🤗 Transformers. They are not maintained and require a specific
+version of 🤗 Transformers that is indicated in the requirements file of each folder. Updating them to the most recent version of the library will require some work.
+
+To use any of them, just run the command
+```
+pip install -r requirements.txt
+```
+inside the folder of your choice.
+
+If you need help with any of those, contact the author(s), indicated at the top of the `README` of each folder.
diff --git a/examples/adversarial/README.md b/examples/research_projects/adversarial/README.md
similarity index 100%
rename from examples/adversarial/README.md
rename to examples/research_projects/adversarial/README.md
diff --git a/examples/research_projects/adversarial/requirements.txt b/examples/research_projects/adversarial/requirements.txt
new file mode 100644
index 000000000..f6332785e
--- /dev/null
+++ b/examples/research_projects/adversarial/requirements.txt
@@ -0,0 +1 @@
+transformers == 3.5.1
diff --git a/examples/adversarial/run_hans.py b/examples/research_projects/adversarial/run_hans.py
similarity index 100%
rename from examples/adversarial/run_hans.py
rename to examples/research_projects/adversarial/run_hans.py
diff --git a/examples/adversarial/utils_hans.py b/examples/research_projects/adversarial/utils_hans.py
similarity index 100%
rename from examples/adversarial/utils_hans.py
rename to examples/research_projects/adversarial/utils_hans.py
diff --git a/examples/bert-loses-patience/README.md b/examples/research_projects/bert-loses-patience/README.md
similarity index 100%
rename from examples/bert-loses-patience/README.md
rename to examples/research_projects/bert-loses-patience/README.md
diff --git a/examples/deebert/src/__init__.py b/examples/research_projects/bert-loses-patience/pabee/__init__.py
similarity index 100%
rename from examples/deebert/src/__init__.py
rename to examples/research_projects/bert-loses-patience/pabee/__init__.py
diff --git a/examples/bert-loses-patience/pabee/modeling_pabee_albert.py b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py
similarity index 100%
rename from examples/bert-loses-patience/pabee/modeling_pabee_albert.py
rename to examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py
diff --git a/examples/bert-loses-patience/pabee/modeling_pabee_bert.py b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py
similarity index 100%
rename from examples/bert-loses-patience/pabee/modeling_pabee_bert.py
rename to examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py
diff --git a/examples/research_projects/bert-loses-patience/requirements.txt b/examples/research_projects/bert-loses-patience/requirements.txt
new file mode 100644
index 000000000..3c01e97e7
--- /dev/null
+++ b/examples/research_projects/bert-loses-patience/requirements.txt
@@ -0,0 +1 @@
+transformers == 3.5.1
\ No newline at end of file
diff --git a/examples/bert-loses-patience/run_glue_with_pabee.py b/examples/research_projects/bert-loses-patience/run_glue_with_pabee.py
similarity index 100%
rename from examples/bert-loses-patience/run_glue_with_pabee.py
rename to examples/research_projects/bert-loses-patience/run_glue_with_pabee.py
diff --git a/examples/bert-loses-patience/test_run_glue_with_pabee.py b/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py
similarity index 100%
rename from examples/bert-loses-patience/test_run_glue_with_pabee.py
rename to examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py
diff --git a/examples/seq2seq/bertabs/README.md b/examples/research_projects/bertabs/README.md
similarity index 100%
rename from examples/seq2seq/bertabs/README.md
rename to examples/research_projects/bertabs/README.md
diff --git a/examples/seq2seq/bertabs/__init__.py b/examples/research_projects/bertabs/__init__.py
similarity index 100%
rename from examples/seq2seq/bertabs/__init__.py
rename to examples/research_projects/bertabs/__init__.py
diff --git a/examples/seq2seq/bertabs/configuration_bertabs.py b/examples/research_projects/bertabs/configuration_bertabs.py
similarity index 100%
rename from examples/seq2seq/bertabs/configuration_bertabs.py
rename to examples/research_projects/bertabs/configuration_bertabs.py
diff --git a/examples/seq2seq/bertabs/convert_bertabs_original_pytorch_checkpoint.py b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
similarity index 100%
rename from examples/seq2seq/bertabs/convert_bertabs_original_pytorch_checkpoint.py
rename to examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py
diff --git a/examples/seq2seq/bertabs/modeling_bertabs.py b/examples/research_projects/bertabs/modeling_bertabs.py
similarity index 100%
rename from examples/seq2seq/bertabs/modeling_bertabs.py
rename to examples/research_projects/bertabs/modeling_bertabs.py
diff --git a/examples/seq2seq/bertabs/requirements.txt b/examples/research_projects/bertabs/requirements.txt
similarity index 55%
rename from examples/seq2seq/bertabs/requirements.txt
rename to examples/research_projects/bertabs/requirements.txt
index f984af489..cdbfb260c 100644
--- a/examples/seq2seq/bertabs/requirements.txt
+++ b/examples/research_projects/bertabs/requirements.txt
@@ -1,4 +1,4 @@
-transformers
+transformers == 3.5.1
 
 # For ROUGE
 nltk
diff --git a/examples/seq2seq/bertabs/run_summarization.py b/examples/research_projects/bertabs/run_summarization.py
similarity index 100%
rename from examples/seq2seq/bertabs/run_summarization.py
rename to examples/research_projects/bertabs/run_summarization.py
diff --git a/examples/seq2seq/bertabs/test_utils_summarization.py b/examples/research_projects/bertabs/test_utils_summarization.py
similarity index 100%
rename from examples/seq2seq/bertabs/test_utils_summarization.py
rename to examples/research_projects/bertabs/test_utils_summarization.py
diff --git a/examples/seq2seq/bertabs/utils_summarization.py b/examples/research_projects/bertabs/utils_summarization.py
similarity index 100%
rename from examples/seq2seq/bertabs/utils_summarization.py
rename to examples/research_projects/bertabs/utils_summarization.py
diff --git a/examples/research_projects/bertology/requirements.txt b/examples/research_projects/bertology/requirements.txt
new file mode 100644
index 000000000..f6332785e
--- /dev/null
+++ b/examples/research_projects/bertology/requirements.txt
@@ -0,0 +1 @@
+transformers == 3.5.1
diff --git a/examples/bertology/run_bertology.py b/examples/research_projects/bertology/run_bertology.py
similarity index 100%
rename from examples/bertology/run_bertology.py
rename to examples/research_projects/bertology/run_bertology.py
diff --git a/examples/deebert/README.md b/examples/research_projects/deebert/README.md
similarity index 100%
rename from examples/deebert/README.md
rename to examples/research_projects/deebert/README.md
diff --git a/examples/deebert/entropy_eval.sh b/examples/research_projects/deebert/entropy_eval.sh
similarity index 100%
rename from examples/deebert/entropy_eval.sh
rename to examples/research_projects/deebert/entropy_eval.sh
diff --git a/examples/deebert/eval_deebert.sh b/examples/research_projects/deebert/eval_deebert.sh
similarity index 100%
rename from examples/deebert/eval_deebert.sh
rename to examples/research_projects/deebert/eval_deebert.sh
diff --git a/examples/research_projects/deebert/requirements.txt b/examples/research_projects/deebert/requirements.txt
new file mode 100644
index 000000000..f6332785e
--- /dev/null
+++ b/examples/research_projects/deebert/requirements.txt
@@ -0,0 +1 @@
+transformers == 3.5.1
diff --git a/examples/deebert/run_glue_deebert.py b/examples/research_projects/deebert/run_glue_deebert.py
similarity index 100%
rename from examples/deebert/run_glue_deebert.py
rename to examples/research_projects/deebert/run_glue_deebert.py
diff --git a/examples/research_projects/deebert/src/__init__.py b/examples/research_projects/deebert/src/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/examples/deebert/src/modeling_highway_bert.py b/examples/research_projects/deebert/src/modeling_highway_bert.py
similarity index 100%
rename from examples/deebert/src/modeling_highway_bert.py
rename to examples/research_projects/deebert/src/modeling_highway_bert.py
diff --git a/examples/deebert/src/modeling_highway_roberta.py b/examples/research_projects/deebert/src/modeling_highway_roberta.py
similarity index 100%
rename from examples/deebert/src/modeling_highway_roberta.py
rename to examples/research_projects/deebert/src/modeling_highway_roberta.py
diff --git a/examples/deebert/test_glue_deebert.py b/examples/research_projects/deebert/test_glue_deebert.py
similarity index 100%
rename from examples/deebert/test_glue_deebert.py
rename to examples/research_projects/deebert/test_glue_deebert.py
diff --git a/examples/deebert/train_deebert.sh b/examples/research_projects/deebert/train_deebert.sh
similarity index 100%
rename from examples/deebert/train_deebert.sh
rename to examples/research_projects/deebert/train_deebert.sh
diff --git a/examples/distillation/README.md b/examples/research_projects/distillation/README.md
similarity index 99%
rename from examples/distillation/README.md
rename to examples/research_projects/distillation/README.md
index 766ce217a..3dc2c53a1 100644
--- a/examples/distillation/README.md
+++ b/examples/research_projects/distillation/README.md
@@ -1,5 +1,7 @@
 # Distil*
 
+Author: @VictorSanh
+
 This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
 
 **January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs.
diff --git a/examples/distillation/distiller.py b/examples/research_projects/distillation/distiller.py
similarity index 100%
rename from examples/distillation/distiller.py
rename to examples/research_projects/distillation/distiller.py
diff --git a/examples/distillation/grouped_batch_sampler.py b/examples/research_projects/distillation/grouped_batch_sampler.py
similarity index 100%
rename from examples/distillation/grouped_batch_sampler.py
rename to examples/research_projects/distillation/grouped_batch_sampler.py
diff --git a/examples/distillation/lm_seqs_dataset.py b/examples/research_projects/distillation/lm_seqs_dataset.py
similarity index 100%
rename from examples/distillation/lm_seqs_dataset.py
rename to examples/research_projects/distillation/lm_seqs_dataset.py
diff --git a/examples/distillation/requirements.txt b/examples/research_projects/distillation/requirements.txt
similarity index 100%
rename from examples/distillation/requirements.txt
rename to examples/research_projects/distillation/requirements.txt
diff --git a/examples/distillation/run_squad_w_distillation.py b/examples/research_projects/distillation/run_squad_w_distillation.py
similarity index 100%
rename from examples/distillation/run_squad_w_distillation.py
rename to examples/research_projects/distillation/run_squad_w_distillation.py
diff --git a/examples/distillation/scripts/binarized_data.py b/examples/research_projects/distillation/scripts/binarized_data.py
similarity index 100%
rename from examples/distillation/scripts/binarized_data.py
rename to examples/research_projects/distillation/scripts/binarized_data.py
diff --git a/examples/distillation/scripts/extract.py b/examples/research_projects/distillation/scripts/extract.py
similarity index 100%
rename from examples/distillation/scripts/extract.py
rename to examples/research_projects/distillation/scripts/extract.py
diff --git a/examples/distillation/scripts/extract_distilbert.py b/examples/research_projects/distillation/scripts/extract_distilbert.py
similarity index 100%
rename from examples/distillation/scripts/extract_distilbert.py
rename to examples/research_projects/distillation/scripts/extract_distilbert.py
diff --git a/examples/distillation/scripts/token_counts.py b/examples/research_projects/distillation/scripts/token_counts.py
similarity index 100%
rename from examples/distillation/scripts/token_counts.py
rename to examples/research_projects/distillation/scripts/token_counts.py
diff --git a/examples/distillation/train.py b/examples/research_projects/distillation/train.py
similarity index 100%
rename from examples/distillation/train.py
rename to examples/research_projects/distillation/train.py
diff --git a/examples/distillation/training_configs/distilbert-base-cased.json b/examples/research_projects/distillation/training_configs/distilbert-base-cased.json
similarity index 100%
rename from examples/distillation/training_configs/distilbert-base-cased.json
rename to examples/research_projects/distillation/training_configs/distilbert-base-cased.json
diff --git a/examples/distillation/training_configs/distilbert-base-multilingual-cased.json b/examples/research_projects/distillation/training_configs/distilbert-base-multilingual-cased.json
similarity index 100%
rename from examples/distillation/training_configs/distilbert-base-multilingual-cased.json
rename to examples/research_projects/distillation/training_configs/distilbert-base-multilingual-cased.json
diff --git a/examples/distillation/training_configs/distilbert-base-uncased.json b/examples/research_projects/distillation/training_configs/distilbert-base-uncased.json
similarity index 100%
rename from examples/distillation/training_configs/distilbert-base-uncased.json
rename to examples/research_projects/distillation/training_configs/distilbert-base-uncased.json
diff --git a/examples/distillation/training_configs/distilgpt2.json b/examples/research_projects/distillation/training_configs/distilgpt2.json
similarity index 100%
rename from examples/distillation/training_configs/distilgpt2.json
rename to examples/research_projects/distillation/training_configs/distilgpt2.json
diff --git a/examples/distillation/training_configs/distilroberta-base.json b/examples/research_projects/distillation/training_configs/distilroberta-base.json
similarity index 100%
rename from examples/distillation/training_configs/distilroberta-base.json
rename to examples/research_projects/distillation/training_configs/distilroberta-base.json
diff --git a/examples/distillation/utils.py b/examples/research_projects/distillation/utils.py
similarity index 100%
rename from examples/distillation/utils.py
rename to examples/research_projects/distillation/utils.py
diff --git a/examples/longform-qa/README.md b/examples/research_projects/longform-qa/README.md
similarity index 97%
rename from examples/longform-qa/README.md
rename to examples/research_projects/longform-qa/README.md
index 888d5a782..eaa29d454 100644
--- a/examples/longform-qa/README.md
+++ b/examples/research_projects/longform-qa/README.md
@@ -1,5 +1,7 @@
 # Long Form Question Answering
 
+Author: @yjernite
+
 This folder contains the code for the Long Form Question answering [demo](http://35.226.96.115:8080/) as well as methods to train and use a fully end-to-end Long Form Question Answering system using the [🤗transformers](https://github.com/huggingface/transformers) and [🤗datasets](https://github.com/huggingface/datasets) libraries.
 
 You can use these methods to train your own system by following along the associate [notebook](https://github.com/huggingface/notebooks/blob/master/longform-qa/Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb) or [blog post](https://yjernite.github.io/lfqa.html).
diff --git a/examples/longform-qa/eli5_app.py b/examples/research_projects/longform-qa/eli5_app.py
similarity index 100%
rename from examples/longform-qa/eli5_app.py
rename to examples/research_projects/longform-qa/eli5_app.py
diff --git a/examples/longform-qa/eli5_utils.py b/examples/research_projects/longform-qa/eli5_utils.py
similarity index 100%
rename from examples/longform-qa/eli5_utils.py
rename to examples/research_projects/longform-qa/eli5_utils.py
diff --git a/examples/research_projects/longform-qa/requirements.txt b/examples/research_projects/longform-qa/requirements.txt
new file mode 100644
index 000000000..a21b64d33
--- /dev/null
+++ b/examples/research_projects/longform-qa/requirements.txt
@@ -0,0 +1,4 @@
+datasets >= 1.1.3
+faiss-cpu
+streamlit
+elasticsearch
diff --git a/examples/contrib/mm-imdb/README.md b/examples/research_projects/mm-imdb/README.md
similarity index 100%
rename from examples/contrib/mm-imdb/README.md
rename to examples/research_projects/mm-imdb/README.md
diff --git a/examples/contrib/mm-imdb/run_mmimdb.py b/examples/research_projects/mm-imdb/run_mmimdb.py
similarity index 100%
rename from examples/contrib/mm-imdb/run_mmimdb.py
rename to examples/research_projects/mm-imdb/run_mmimdb.py
diff --git a/examples/contrib/mm-imdb/utils_mmimdb.py b/examples/research_projects/mm-imdb/utils_mmimdb.py
similarity index 100%
rename from examples/contrib/mm-imdb/utils_mmimdb.py
rename to examples/research_projects/mm-imdb/utils_mmimdb.py
diff --git a/examples/movement-pruning/README.md b/examples/research_projects/movement-pruning/README.md
similarity index 99%
rename from examples/movement-pruning/README.md
rename to examples/research_projects/movement-pruning/README.md
index fd6c0085e..38c11c015 100644
--- a/examples/movement-pruning/README.md
+++ b/examples/research_projects/movement-pruning/README.md
@@ -1,5 +1,7 @@
 # Movement Pruning: Adaptive Sparsity by Fine-Tuning
 
+Author: @VictorSanh
+
 *Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
 
 | Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning      | L0 Regularization      | Movement Pruning       | Soft Movement Pruning          |
diff --git a/examples/movement-pruning/Saving_PruneBERT.ipynb b/examples/research_projects/movement-pruning/Saving_PruneBERT.ipynb
similarity index 100%
rename from examples/movement-pruning/Saving_PruneBERT.ipynb
rename to examples/research_projects/movement-pruning/Saving_PruneBERT.ipynb
diff --git a/examples/movement-pruning/bertarize.py b/examples/research_projects/movement-pruning/bertarize.py
similarity index 100%
rename from examples/movement-pruning/bertarize.py
rename to examples/research_projects/movement-pruning/bertarize.py
diff --git a/examples/movement-pruning/counts_parameters.py b/examples/research_projects/movement-pruning/counts_parameters.py
similarity index 100%
rename from examples/movement-pruning/counts_parameters.py
rename to examples/research_projects/movement-pruning/counts_parameters.py
diff --git a/examples/movement-pruning/emmental/__init__.py b/examples/research_projects/movement-pruning/emmental/__init__.py
similarity index 100%
rename from examples/movement-pruning/emmental/__init__.py
rename to examples/research_projects/movement-pruning/emmental/__init__.py
diff --git a/examples/movement-pruning/emmental/configuration_bert_masked.py b/examples/research_projects/movement-pruning/emmental/configuration_bert_masked.py
similarity index 100%
rename from examples/movement-pruning/emmental/configuration_bert_masked.py
rename to examples/research_projects/movement-pruning/emmental/configuration_bert_masked.py
diff --git a/examples/movement-pruning/emmental/modeling_bert_masked.py b/examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py
similarity index 100%
rename from examples/movement-pruning/emmental/modeling_bert_masked.py
rename to examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py
diff --git a/examples/movement-pruning/emmental/modules/__init__.py b/examples/research_projects/movement-pruning/emmental/modules/__init__.py
similarity index 100%
rename from examples/movement-pruning/emmental/modules/__init__.py
rename to examples/research_projects/movement-pruning/emmental/modules/__init__.py
diff --git a/examples/movement-pruning/emmental/modules/binarizer.py b/examples/research_projects/movement-pruning/emmental/modules/binarizer.py
similarity index 100%
rename from examples/movement-pruning/emmental/modules/binarizer.py
rename to examples/research_projects/movement-pruning/emmental/modules/binarizer.py
diff --git a/examples/movement-pruning/emmental/modules/masked_nn.py b/examples/research_projects/movement-pruning/emmental/modules/masked_nn.py
similarity index 100%
rename from examples/movement-pruning/emmental/modules/masked_nn.py
rename to examples/research_projects/movement-pruning/emmental/modules/masked_nn.py
diff --git a/examples/lxmert/README.md b/examples/research_projects/movement-pruning/lxmert/README.md
similarity index 100%
rename from examples/lxmert/README.md
rename to examples/research_projects/movement-pruning/lxmert/README.md
diff --git a/examples/lxmert/demo.ipynb b/examples/research_projects/movement-pruning/lxmert/demo.ipynb
similarity index 100%
rename from examples/lxmert/demo.ipynb
rename to examples/research_projects/movement-pruning/lxmert/demo.ipynb
diff --git a/examples/lxmert/extracting_data.py b/examples/research_projects/movement-pruning/lxmert/extracting_data.py
similarity index 100%
rename from examples/lxmert/extracting_data.py
rename to examples/research_projects/movement-pruning/lxmert/extracting_data.py
diff --git a/examples/lxmert/modeling_frcnn.py b/examples/research_projects/movement-pruning/lxmert/modeling_frcnn.py
similarity index 100%
rename from examples/lxmert/modeling_frcnn.py
rename to examples/research_projects/movement-pruning/lxmert/modeling_frcnn.py
diff --git a/examples/lxmert/processing_image.py b/examples/research_projects/movement-pruning/lxmert/processing_image.py
similarity index 100%
rename from examples/lxmert/processing_image.py
rename to examples/research_projects/movement-pruning/lxmert/processing_image.py
diff --git a/examples/lxmert/requirements.txt b/examples/research_projects/movement-pruning/lxmert/requirements.txt
similarity index 96%
rename from examples/lxmert/requirements.txt
rename to examples/research_projects/movement-pruning/lxmert/requirements.txt
index bd7dada2d..a45405320 100644
--- a/examples/lxmert/requirements.txt
+++ b/examples/research_projects/movement-pruning/lxmert/requirements.txt
@@ -90,7 +90,7 @@ torchvision==0.7.0
 tornado==6.0.4
 tqdm==4.48.2
 traitlets
-git+https://github.com/huggingface/transformers.git
+transformers==3.5.1
 urllib3==1.25.8
 wcwidth==0.2.5
 webencodings==0.5.1
diff --git a/examples/lxmert/utils.py b/examples/research_projects/movement-pruning/lxmert/utils.py
similarity index 100%
rename from examples/lxmert/utils.py
rename to examples/research_projects/movement-pruning/lxmert/utils.py
diff --git a/examples/lxmert/visualizing_image.py b/examples/research_projects/movement-pruning/lxmert/visualizing_image.py
similarity index 100%
rename from examples/lxmert/visualizing_image.py
rename to examples/research_projects/movement-pruning/lxmert/visualizing_image.py
diff --git a/examples/movement-pruning/masked_run_glue.py b/examples/research_projects/movement-pruning/masked_run_glue.py
similarity index 100%
rename from examples/movement-pruning/masked_run_glue.py
rename to examples/research_projects/movement-pruning/masked_run_glue.py
diff --git a/examples/movement-pruning/masked_run_squad.py b/examples/research_projects/movement-pruning/masked_run_squad.py
similarity index 100%
rename from examples/movement-pruning/masked_run_squad.py
rename to examples/research_projects/movement-pruning/masked_run_squad.py
diff --git a/examples/movement-pruning/requirements.txt b/examples/research_projects/movement-pruning/requirements.txt
similarity index 100%
rename from examples/movement-pruning/requirements.txt
rename to examples/research_projects/movement-pruning/requirements.txt
diff --git a/examples/text-generation/pplm/README.md b/examples/research_projects/pplm/README.md
similarity index 100%
rename from examples/text-generation/pplm/README.md
rename to examples/research_projects/pplm/README.md
diff --git a/examples/text-generation/pplm/imgs/headfigure.png b/examples/research_projects/pplm/imgs/headfigure.png
similarity index 100%
rename from examples/text-generation/pplm/imgs/headfigure.png
rename to examples/research_projects/pplm/imgs/headfigure.png
diff --git a/examples/text-generation/pplm/imgs/wooly.png b/examples/research_projects/pplm/imgs/wooly.png
similarity index 100%
rename from examples/text-generation/pplm/imgs/wooly.png
rename to examples/research_projects/pplm/imgs/wooly.png
diff --git a/examples/text-generation/pplm/pplm_classification_head.py b/examples/research_projects/pplm/pplm_classification_head.py
similarity index 100%
rename from examples/text-generation/pplm/pplm_classification_head.py
rename to examples/research_projects/pplm/pplm_classification_head.py
diff --git a/examples/research_projects/pplm/requirements.txt b/examples/research_projects/pplm/requirements.txt
new file mode 100644
index 000000000..62092cc30
--- /dev/null
+++ b/examples/research_projects/pplm/requirements.txt
@@ -0,0 +1,22 @@
+tensorboard
+scikit-learn
+seqeval
+psutil
+sacrebleu
+rouge-score
+tensorflow_datasets
+pytorch-lightning==1.0.4
+matplotlib
+git-python==1.0.3
+faiss-cpu
+streamlit
+elasticsearch
+nltk
+pandas
+datasets >= 1.1.3
+fire
+pytest
+conllu
+sentencepiece != 0.1.92
+protobuf
+transformers==3.5.1
diff --git a/examples/text-generation/pplm/run_pplm.py b/examples/research_projects/pplm/run_pplm.py
similarity index 100%
rename from examples/text-generation/pplm/run_pplm.py
rename to examples/research_projects/pplm/run_pplm.py
diff --git a/examples/text-generation/pplm/run_pplm_discrim_train.py b/examples/research_projects/pplm/run_pplm_discrim_train.py
similarity index 100%
rename from examples/text-generation/pplm/run_pplm_discrim_train.py
rename to examples/research_projects/pplm/run_pplm_discrim_train.py
diff --git a/examples/rag/README.md b/examples/research_projects/rag/README.md
similarity index 99%
rename from examples/rag/README.md
rename to examples/research_projects/rag/README.md
index 38e2071f2..12da66fa7 100644
--- a/examples/rag/README.md
+++ b/examples/research_projects/rag/README.md
@@ -1,4 +1,7 @@
 # Intro
+
+Authors: @patrickvonplaten and @lhoestq
+
 Aimed at tackling the knowledge-intensive NLP tasks (think tasks a human wouldn't be expected to solve without access to external knowledge sources), RAG models are seq2seq models with access to a retrieval mechanism providing relevant context documents at training and evaluation time.
 
 A RAG model encapsulates two core components: a question encoder and a generator.
diff --git a/examples/rag/__init__.py b/examples/research_projects/rag/__init__.py
similarity index 100%
rename from examples/rag/__init__.py
rename to examples/research_projects/rag/__init__.py
diff --git a/examples/rag/test_finetune_rag.py b/examples/research_projects/rag/_test_finetune_rag.py
similarity index 100%
rename from examples/rag/test_finetune_rag.py
rename to examples/research_projects/rag/_test_finetune_rag.py
diff --git a/examples/rag/callbacks_rag.py b/examples/research_projects/rag/callbacks_rag.py
similarity index 100%
rename from examples/rag/callbacks_rag.py
rename to examples/research_projects/rag/callbacks_rag.py
diff --git a/examples/rag/consolidate_rag_checkpoint.py b/examples/research_projects/rag/consolidate_rag_checkpoint.py
similarity index 100%
rename from examples/rag/consolidate_rag_checkpoint.py
rename to examples/research_projects/rag/consolidate_rag_checkpoint.py
diff --git a/examples/rag/distributed_retriever.py b/examples/research_projects/rag/distributed_retriever.py
similarity index 100%
rename from examples/rag/distributed_retriever.py
rename to examples/research_projects/rag/distributed_retriever.py
diff --git a/examples/rag/eval_rag.py b/examples/research_projects/rag/eval_rag.py
similarity index 99%
rename from examples/rag/eval_rag.py
rename to examples/research_projects/rag/eval_rag.py
index cf858e6b3..d479537ff 100644
--- a/examples/rag/eval_rag.py
+++ b/examples/research_projects/rag/eval_rag.py
@@ -15,7 +15,7 @@ from transformers import logging as transformers_logging
 
 
 sys.path.append(os.path.join(os.getcwd()))  # noqa: E402 # isort:skip
-from utils import exact_match_score, f1_score  # noqa: E402 # isort:skip
+from utils_rag import exact_match_score, f1_score  # noqa: E402 # isort:skip
 
 
 logger = logging.getLogger(__name__)
diff --git a/examples/rag/finetune_rag.py b/examples/research_projects/rag/finetune_rag.py
similarity index 100%
rename from examples/rag/finetune_rag.py
rename to examples/research_projects/rag/finetune_rag.py
diff --git a/examples/rag/finetune_rag.sh b/examples/research_projects/rag/finetune_rag.sh
similarity index 100%
rename from examples/rag/finetune_rag.sh
rename to examples/research_projects/rag/finetune_rag.sh
diff --git a/examples/research_projects/rag/lightning_base.py b/examples/research_projects/rag/lightning_base.py
new file mode 100644
index 000000000..a9a05fbf9
--- /dev/null
+++ b/examples/research_projects/rag/lightning_base.py
@@ -0,0 +1,391 @@
+import argparse
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict
+
+import pytorch_lightning as pl
+from pytorch_lightning.utilities import rank_zero_info
+
+from transformers import (
+    AdamW,
+    AutoConfig,
+    AutoModel,
+    AutoModelForPreTraining,
+    AutoModelForQuestionAnswering,
+    AutoModelForSeq2SeqLM,
+    AutoModelForSequenceClassification,
+    AutoModelForTokenClassification,
+    AutoModelWithLMHead,
+    AutoTokenizer,
+    PretrainedConfig,
+    PreTrainedTokenizer,
+)
+from transformers.optimization import (
+    Adafactor,
+    get_cosine_schedule_with_warmup,
+    get_cosine_with_hard_restarts_schedule_with_warmup,
+    get_linear_schedule_with_warmup,
+    get_polynomial_decay_schedule_with_warmup,
+)
+from transformers.utils.versions import require_version_examples
+
+
+logger = logging.getLogger(__name__)
+
+require_version_examples("pytorch_lightning>=1.0.4")
+
+MODEL_MODES = {
+    "base": AutoModel,
+    "sequence-classification": AutoModelForSequenceClassification,
+    "question-answering": AutoModelForQuestionAnswering,
+    "pretraining": AutoModelForPreTraining,
+    "token-classification": AutoModelForTokenClassification,
+    "language-modeling": AutoModelWithLMHead,
+    "summarization": AutoModelForSeq2SeqLM,
+    "translation": AutoModelForSeq2SeqLM,
+}
+
+
+# update this and the import above to support new schedulers from transformers.optimization
+arg_to_scheduler = {
+    "linear": get_linear_schedule_with_warmup,
+    "cosine": get_cosine_schedule_with_warmup,
+    "cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
+    "polynomial": get_polynomial_decay_schedule_with_warmup,
+    # '': get_constant_schedule,             # not supported for now
+    # '': get_constant_schedule_with_warmup, # not supported for now
+}
+arg_to_scheduler_choices = sorted(arg_to_scheduler.keys())
+arg_to_scheduler_metavar = "{" + ", ".join(arg_to_scheduler_choices) + "}"
+
+
+class BaseTransformer(pl.LightningModule):
+    def __init__(
+        self,
+        hparams: argparse.Namespace,
+        num_labels=None,
+        mode="base",
+        config=None,
+        tokenizer=None,
+        model=None,
+        **config_kwargs
+    ):
+        """Initialize a model, tokenizer and config."""
+        super().__init__()
+        # TODO: move to self.save_hyperparameters()
+        # self.save_hyperparameters()
+        # can also expand arguments into trainer signature for easier reading
+
+        self.save_hyperparameters(hparams)
+        self.step_count = 0
+        self.output_dir = Path(self.hparams.output_dir)
+        cache_dir = self.hparams.cache_dir if self.hparams.cache_dir else None
+        if config is None:
+            self.config = AutoConfig.from_pretrained(
+                self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
+                **({"num_labels": num_labels} if num_labels is not None else {}),
+                cache_dir=cache_dir,
+                **config_kwargs,
+            )
+        else:
+            self.config: PretrainedConfig = config
+
+        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
+        for p in extra_model_params:
+            if getattr(self.hparams, p, None):
+                assert hasattr(self.config, p), f"model config doesn't have a `{p}` attribute"
+                setattr(self.config, p, getattr(self.hparams, p))
+
+        if tokenizer is None:
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
+                cache_dir=cache_dir,
+            )
+        else:
+            self.tokenizer: PreTrainedTokenizer = tokenizer
+        self.model_type = MODEL_MODES[mode]
+        if model is None:
+            self.model = self.model_type.from_pretrained(
+                self.hparams.model_name_or_path,
+                from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
+                config=self.config,
+                cache_dir=cache_dir,
+            )
+        else:
+            self.model = model
+
+    def load_hf_checkpoint(self, *args, **kwargs):
+        self.model = self.model_type.from_pretrained(*args, **kwargs)
+
+    def get_lr_scheduler(self):
+        get_schedule_func = arg_to_scheduler[self.hparams.lr_scheduler]
+        scheduler = get_schedule_func(
+            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=self.total_steps()
+        )
+        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
+        return scheduler
+
+    def configure_optimizers(self):
+        """Prepare optimizer and schedule (linear warmup and decay)"""
+        model = self.model
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": self.hparams.weight_decay,
+            },
+            {
+                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        if self.hparams.adafactor:
+            optimizer = Adafactor(
+                optimizer_grouped_parameters, lr=self.hparams.learning_rate, scale_parameter=False, relative_step=False
+            )
+
+        else:
+            optimizer = AdamW(
+                optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon
+            )
+        self.opt = optimizer
+
+        scheduler = self.get_lr_scheduler()
+
+        return [optimizer], [scheduler]
+
+    def test_step(self, batch, batch_nb):
+        return self.validation_step(batch, batch_nb)
+
+    def test_epoch_end(self, outputs):
+        return self.validation_end(outputs)
+
+    def total_steps(self) -> int:
+        """The number of total training steps that will be run. Used for lr scheduler purposes."""
+        num_devices = max(1, self.hparams.gpus)  # TODO: consider num_tpu_cores
+        effective_batch_size = self.hparams.train_batch_size * self.hparams.accumulate_grad_batches * num_devices
+        return (self.dataset_size / effective_batch_size) * self.hparams.max_epochs
+
+    def setup(self, mode):
+        if mode == "test":
+            self.dataset_size = len(self.test_dataloader().dataset)
+        else:
+            self.train_loader = self.get_dataloader("train", self.hparams.train_batch_size, shuffle=True)
+            self.dataset_size = len(self.train_dataloader().dataset)
+
+    def get_dataloader(self, type_path: str, batch_size: int, shuffle: bool = False):
+        raise NotImplementedError("You must implement this for your task")
+
+    def train_dataloader(self):
+        return self.train_loader
+
+    def val_dataloader(self):
+        return self.get_dataloader("dev", self.hparams.eval_batch_size, shuffle=False)
+
+    def test_dataloader(self):
+        return self.get_dataloader("test", self.hparams.eval_batch_size, shuffle=False)
+
+    def _feature_file(self, mode):
+        return os.path.join(
+            self.hparams.data_dir,
+            "cached_{}_{}_{}".format(
+                mode,
+                list(filter(None, self.hparams.model_name_or_path.split("/"))).pop(),
+                str(self.hparams.max_seq_length),
+            ),
+        )
+
+    @pl.utilities.rank_zero_only
+    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
+        save_path = self.output_dir.joinpath("best_tfmr")
+        self.model.config.save_step = self.step_count
+        self.model.save_pretrained(save_path)
+        self.tokenizer.save_pretrained(save_path)
+
+    @staticmethod
+    def add_model_specific_args(parser, root_dir):
+        parser.add_argument(
+            "--model_name_or_path",
+            default=None,
+            type=str,
+            required=True,
+            help="Path to pretrained model or model identifier from huggingface.co/models",
+        )
+        parser.add_argument(
+            "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
+        )
+        parser.add_argument(
+            "--tokenizer_name",
+            default=None,
+            type=str,
+            help="Pretrained tokenizer name or path if not the same as model_name",
+        )
+        parser.add_argument(
+            "--cache_dir",
+            default="",
+            type=str,
+            help="Where do you want to store the pre-trained models downloaded from huggingface.co",
+        )
+        parser.add_argument(
+            "--encoder_layerdrop",
+            type=float,
+            help="Encoder layer dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--decoder_layerdrop",
+            type=float,
+            help="Decoder layer dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--dropout",
+            type=float,
+            help="Dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--attention_dropout",
+            type=float,
+            help="Attention dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+        parser.add_argument(
+            "--lr_scheduler",
+            default="linear",
+            choices=arg_to_scheduler_choices,
+            metavar=arg_to_scheduler_metavar,
+            type=str,
+            help="Learning rate scheduler",
+        )
+        parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+        parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+        parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+        parser.add_argument("--num_workers", default=4, type=int, help="kwarg passed to DataLoader")
+        parser.add_argument("--num_train_epochs", dest="max_epochs", default=3, type=int)
+        parser.add_argument("--train_batch_size", default=32, type=int)
+        parser.add_argument("--eval_batch_size", default=32, type=int)
+        parser.add_argument("--adafactor", action="store_true")
+
+
+class LoggingCallback(pl.Callback):
+    def on_batch_end(self, trainer, pl_module):
+        lr_scheduler = trainer.lr_schedulers[0]["scheduler"]
+        lrs = {f"lr_group_{i}": lr for i, lr in enumerate(lr_scheduler.get_lr())}
+        pl_module.logger.log_metrics(lrs)
+
+    def on_validation_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
+        rank_zero_info("***** Validation results *****")
+        metrics = trainer.callback_metrics
+        # Log results
+        for key in sorted(metrics):
+            if key not in ["log", "progress_bar"]:
+                rank_zero_info("{} = {}\n".format(key, str(metrics[key])))
+
+    def on_test_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
+        rank_zero_info("***** Test results *****")
+        metrics = trainer.callback_metrics
+        # Log and save results to file
+        output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
+        with open(output_test_results_file, "w") as writer:
+            for key in sorted(metrics):
+                if key not in ["log", "progress_bar"]:
+                    rank_zero_info("{} = {}\n".format(key, str(metrics[key])))
+                    writer.write("{} = {}\n".format(key, str(metrics[key])))
+
+
+def add_generic_args(parser, root_dir) -> None:
+    #  To allow all pl args uncomment the following line
+    #  parser = pl.Trainer.add_argparse_args(parser)
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+    )
+
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O2",
+        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://nvidia.github.io/apex/amp.html",
+    )
+    parser.add_argument("--n_tpu_cores", dest="tpu_cores", type=int)
+    parser.add_argument("--max_grad_norm", dest="gradient_clip_val", default=1.0, type=float, help="Max gradient norm")
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to run predictions on the test set.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        dest="accumulate_grad_batches",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.",
+    )
+
+
+def generic_train(
+    model: BaseTransformer,
+    args: argparse.Namespace,
+    early_stopping_callback=None,
+    logger=True,  # can pass WandbLogger() here
+    extra_callbacks=[],
+    checkpoint_callback=None,
+    logging_callback=None,
+    **extra_train_kwargs
+):
+    pl.seed_everything(args.seed)
+
+    # init model
+    odir = Path(model.hparams.output_dir)
+    odir.mkdir(exist_ok=True)
+
+    # add custom checkpoints
+    if checkpoint_callback is None:
+        checkpoint_callback = pl.callbacks.ModelCheckpoint(
+            filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=1
+        )
+    if early_stopping_callback:
+        extra_callbacks.append(early_stopping_callback)
+    if logging_callback is None:
+        logging_callback = LoggingCallback()
+
+    train_params = {}
+
+    # TODO: remove with PyTorch 1.6 since pl uses native amp
+    if args.fp16:
+        train_params["precision"] = 16
+        train_params["amp_level"] = args.fp16_opt_level
+
+    if args.gpus > 1:
+        train_params["distributed_backend"] = "ddp"
+
+    train_params["accumulate_grad_batches"] = args.accumulate_grad_batches
+    train_params["accelerator"] = extra_train_kwargs.get("accelerator", None)
+    train_params["profiler"] = extra_train_kwargs.get("profiler", None)
+
+    trainer = pl.Trainer.from_argparse_args(
+        args,
+        weights_summary=None,
+        callbacks=[logging_callback] + extra_callbacks,
+        logger=logger,
+        checkpoint_callback=checkpoint_callback,
+        **train_params,
+    )
+
+    if args.do_train:
+        trainer.fit(model)
+
+    return trainer
diff --git a/examples/rag/parse_dpr_relevance_data.py b/examples/research_projects/rag/parse_dpr_relevance_data.py
similarity index 100%
rename from examples/rag/parse_dpr_relevance_data.py
rename to examples/research_projects/rag/parse_dpr_relevance_data.py
diff --git a/examples/rag/requirements.txt b/examples/research_projects/rag/requirements.txt
similarity index 50%
rename from examples/rag/requirements.txt
rename to examples/research_projects/rag/requirements.txt
index 9f754bf2b..8bed6ba90 100644
--- a/examples/rag/requirements.txt
+++ b/examples/research_projects/rag/requirements.txt
@@ -1,4 +1,6 @@
 faiss-cpu >= 1.6.3
 datasets >= 1.0.1
 psutil >= 5.7.0
-torch >= 1.4.0
\ No newline at end of file
+torch >= 1.4.0
+transformers
+pytorch-lightning==1.0.4
diff --git a/examples/rag/test_data/my_knowledge_dataset.csv b/examples/research_projects/rag/test_data/my_knowledge_dataset.csv
similarity index 100%
rename from examples/rag/test_data/my_knowledge_dataset.csv
rename to examples/research_projects/rag/test_data/my_knowledge_dataset.csv
diff --git a/examples/rag/test_distributed_retriever.py b/examples/research_projects/rag/test_distributed_retriever.py
similarity index 100%
rename from examples/rag/test_distributed_retriever.py
rename to examples/research_projects/rag/test_distributed_retriever.py
diff --git a/examples/rag/use_own_knowledge_dataset.py b/examples/research_projects/rag/use_own_knowledge_dataset.py
similarity index 100%
rename from examples/rag/use_own_knowledge_dataset.py
rename to examples/research_projects/rag/use_own_knowledge_dataset.py
diff --git a/examples/rag/utils_rag.py b/examples/research_projects/rag/utils_rag.py
similarity index 100%
rename from examples/rag/utils_rag.py
rename to examples/research_projects/rag/utils_rag.py
diff --git a/examples/research_projects/seq2seq-distillation/README.md b/examples/research_projects/seq2seq-distillation/README.md
new file mode 100644
index 000000000..8157f753f
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/README.md
@@ -0,0 +1,430 @@
+## Sequence to Sequence Training and Evaluation
+
+This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
+
+Author: Sam Shleifer (https://github.com/sshleifer)
+
+### Supported Architectures
+
+- `BartForConditionalGeneration` (and anything that inherits from it)
+- `MarianMTModel`
+- `PegasusForConditionalGeneration`
+- `MBartForConditionalGeneration`
+- `FSMTForConditionalGeneration`
+- `T5ForConditionalGeneration`
+
+## Datasets
+
+#### XSUM
+
+```bash
+cd examples/contrib/pytorch-lightning/seq2seq
+wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
+tar -xzvf xsum.tar.gz
+export XSUM_DIR=${PWD}/xsum
+```
+this should make a directory called `xsum/` with files like `test.source`.
+To use your own data, copy that files format. Each article to be summarized is on its own line.
+
+#### CNN/DailyMail
+
+```bash
+cd examples/contrib/pytorch-lightning/seq2seq
+wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
+tar -xzvf cnn_dm_v2.tgz  # empty lines removed
+mv cnn_cln cnn_dm
+export CNN_DIR=${PWD}/cnn_dm
+```
+this should make a directory called `cnn_dm/` with 6 files.
+
+#### WMT16 English-Romanian Translation Data
+
+download with this command:
+```bash
+wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
+tar -xzvf wmt_en_ro.tar.gz
+export ENRO_DIR=${PWD}/wmt_en_ro
+```
+this should make a directory called `wmt_en_ro/` with 6 files.
+
+#### WMT English-German
+
+```bash
+wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
+tar -xzvf wmt_en_de.tgz
+export DATA_DIR=${PWD}/wmt_en_de
+```
+
+#### FSMT datasets (wmt)
+
+Refer to the scripts starting with `eval_` under:
+https://github.com/huggingface/transformers/tree/master/scripts/fsmt
+
+#### Pegasus (multiple datasets)
+
+Multiple eval datasets are available for download from: 
+https://github.com/stas00/porting/tree/master/datasets/pegasus
+
+
+#### Your Data
+
+If you are using your own data, it must be formatted as one directory with 6 files:
+```
+train.source
+train.target
+val.source
+val.target
+test.source
+test.target
+```
+The `.source` files are the input, the `.target` files are the desired output.
+
+### Potential issues
+
+- native AMP (`--fp16` and no apex) may lead to a huge memory leak and require 10x gpu memory. This has been fixed in pytorch-nightly and the minimal official version to have this fix will be pytorch-1.8. Until then if you have to use mixed precision please use AMP only with pytorch-nightly or NVIDIA's apex. Reference: https://github.com/huggingface/transformers/issues/8403
+
+
+### Tips and Tricks
+
+General Tips:
+- since you need to run from this folder, and likely need to modify code, the easiest workflow is fork transformers, clone your fork, and run `pip install -e .` before you get started.
+- try `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size.  (3hr per epoch with bs=8, see the "xsum_shared_task" command below)
+- `fp16_opt_level=O1` (the default works best).
+- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
+Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
+- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
+- This warning can be safely ignored:
+    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
+- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
+- Read scripts before you run them!
+
+Summarization Tips:
+- (summ) 1 epoch at batch size 1 for bart-large takes 24 hours and requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
+- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
+- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
+- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
+- `wandb` can be used by specifying `--logger_name wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
+- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
+(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
+
+**Update 2018-07-18**
+Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prepare_seq2seq_batch` method. Otherwise, `Seq2SeqDataset` will be used.
+Future work/help wanted: A new dataset to support multilingual tasks.
+
+
+### Finetuning Scripts
+All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.
+
+To see all the possible command line options, run:
+
+```bash
+./finetune.py --help
+```
+
+### Finetuning Training Params
+
+To override the pretrained model's training params, you can pass them to `./finetune.sh`:
+
+```bash
+./finetune.sh \
+    [...]
+    --encoder_layerdrop 0.1 \
+    --decoder_layerdrop 0.1 \
+    --dropout 0.1 \
+    --attention_dropout 0.1 \
+```
+
+### Summarization Finetuning
+Run/modify `finetune.sh`
+
+The following command should work on a 16GB GPU:
+```bash
+./finetune.sh \
+    --data_dir $XSUM_DIR \
+    --train_batch_size=1 \
+    --eval_batch_size=1 \
+    --output_dir=xsum_results \
+    --num_train_epochs 6 \
+    --model_name_or_path facebook/bart-large
+```
+
+There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.
+
+### Translation Finetuning
+
+First, follow the wmt_en_ro download instructions.
+Then you can finetune mbart_cc25 on english-romanian with the following command.
+**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
+
+Best performing command:
+```bash
+# optionally
+export ENRO_DIR='wmt_en_ro' # Download instructions above
+# export WANDB_PROJECT="MT" # optional
+export MAX_LEN=128
+export BS=4
+./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --label_smoothing 0.1 --fp16_opt_level=O1 --logger_name wandb --sortish_sampler
+```
+This should take < 6h/epoch on a 16GB v100 and achieve test BLEU above 26
+To get results in line with fairseq, you need to do some postprocessing. (see `romanian_postprocessing.md`)
+
+MultiGPU command
+(using 8 GPUS as an example)
+```bash
+export ENRO_DIR='wmt_en_ro' # Download instructions above
+ # export WANDB_PROJECT="MT" # optional
+export MAX_LEN=128
+export BS=4
+./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --gpus 8 --logger_name wandb
+```
+### Finetuning Outputs
+As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
+Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
+
+```bash
+output_dir
+├── best_tfmr  # this is a huggingface checkpoint generated by save_pretrained. It is the same model as the PL .ckpt file below
+│   ├── config.json
+│   ├── merges.txt
+│   ├── pytorch_model.bin
+│   ├── special_tokens_map.json
+│   ├── tokenizer_config.json
+│   └── vocab.json
+├── git_log.json   # repo, branch, and commit hash
+├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
+├── metrics.json  # new validation metrics will continually be appended to this
+├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
+│   ├── config.json
+│   └── pytorch_model.bin
+├── test_generations.txt
+# ^^ are the summaries or translations produced by your best checkpoint on the test data. Populated when training is done
+├── test_results.txt  # a convenience file with the test set metrics. This data is also in metrics.json['test']
+├── hparams.pkl  # the command line args passed after some light preprocessing. Should be saved fairly quickly.
+```
+After training, you can recover the best checkpoint by running
+```python
+from transformers import AutoModelForSeq2SeqLM
+model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
+```
+
+### Converting pytorch-lightning checkpoints
+pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it.
+
+This should be done for you, with a file called `{save_dir}/best_tfmr`. 
+
+If that file doesn't exist but you have a lightning `.ckpt` file, you can run
+```bash
+python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT  randomly_initialized_hf_model_path save_dir/best_tfmr
+```
+Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections)
+
+
+# Experimental Features 
+These features are harder to use and not always useful.
+
+###  Dynamic Batch Size for MT
+`finetune.py` has a command line arg `--max_tokens_per_batch` that allows batches to be dynamically sized.
+This feature can only be used:
+- with fairseq installed
+- on 1 GPU
+- without sortish sampler
+- after calling `./save_len_file.py $tok $data_dir`
+
+For example, 
+```bash
+./save_len_file.py Helsinki-NLP/opus-mt-en-ro  wmt_en_ro
+./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs
+```
+splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100.
+
+For comparison,
+```bash
+./dynamic_bs_example.sh --sortish_sampler --train_batch_size 48
+```
+uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.
+
+The feature is still experimental, because:
++ we can make it much more robust if we have memory mapped/preprocessed datasets.
++ The speedup over sortish sampler is not that large at the moment.
+
+# DistilBART
+<!---It should be called distilling bart and pegasus, but I don't want to break the link in the paper.-->
+This section describes all code and artifacts from our [Paper](http://arxiv.org/abs/2010.13002)
+
+![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
+
++ For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works, which we call "Shrink and Fine-tune", or SFT.
+you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way. 
++ For the XSUM dataset, training on pseudo-labels worked best for Pegasus (`sshleifer/distill-pegasus-16-4`), while training with KD worked best for `distilbart-xsum-12-6`
++ For `sshleifer/dbart-xsum-12-3`
++ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`. 
++ You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
+and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).
+
+### Evaluation
+
+use [run_distributed_eval](./run_distributed_eval.py), with the following convenient alias
+```bash
+deval () {
+	proc=$1
+	m=$2
+	dd=$3
+	sd=$4
+	shift
+	shift
+	shift
+	shift
+	python -m torch.distributed.launch --nproc_per_node=$proc  run_distributed_eval.py \
+		--model_name $m  --save_dir $sd --data_dir $dd $@
+}
+```
+On a 1 GPU system, here are four commands (that assume `xsum`, `cnn_dm` are downloaded, cmd-F for those links in this file).
+
+`distilBART`:
+```bash
+deval 1 sshleifer/distilbart-xsum-12-3 xsum dbart_12_3_xsum_eval --fp16  # --help for more choices.
+deval 1 sshleifer/distilbart-cnn_dm-12-6 cnn_dm dbart_12_6_cnn_eval --fp16
+```
+
+`distill-pegasus`:
+```bash
+deval 1 sshleifer/distill-pegasus-cnn-16-4 cnn_dm dpx_cnn_eval
+deval 1 sshleifer/distill-pegasus-xsum-16-4 xsum dpx_xsum_eval
+```
+
+### Distillation
++ For all of the following commands, you can get roughly equivalent result and faster run times by passing `--num_beams=4`. That's not what we did for the paper.
++ Besides the KD section, you can also run commands with the built-in transformers trainer. See, for example, [builtin_trainer/train_distilbart_cnn.sh](./builtin_trainer/train_distilbart_cnn.sh).
++ Large performance deviations (> 5X slower or more than 0.5 Rouge-2 worse), should be reported.
++ Multi-gpu (controlled with `--gpus` should work, but might require more epochs).
+
+#### Recommended Workflow
++ Get your dataset in the right format. (see 6 files above).
++ Find a teacher model [Pegasus](https://huggingface.co/models?search=pegasus) (slower, better ROUGE) or `facebook/bart-large-xsum`/`facebook/bart-large-cnn` (faster, slightly lower.).
+Choose the checkpoint where the corresponding dataset is most similar (or identical to) your dataset.
++ Follow the sections in order below. You can stop after SFT if you are satisfied, or move on to pseudo-labeling if you want more performance.
++ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4. 
++ If your SFT run starts at a validation ROUGE-2 that is more than 10 pts below the teacher's validation ROUGE-2,  you have a bug. Switching to a more expensive technique will not help. Try setting a breakpoint and looking at generation and truncation defaults/hyper-parameters, and share your experience on the forums!
+
+  
+#### Initialization
+We use [make_student.py](./make_student.py) to copy alternating layers from the teacher, and save the resulting model to disk
+```bash
+python make_student.py facebook/bart-large-xsum --save_path dbart_xsum_12_3  -e 12 -d 3
+```
+or for `pegasus-xsum`
+```bash
+python make_student.py google/pegasus-xsum --save_path dpx_xsum_16_4  --e 16 --d 4
+```
+we now have an initialized student saved to  `dbart_xsum_12_3`, which we will use for the following commands.
++ Extension: To replicate more complicated initialize experiments in section 6.1, or try your own. Use the `create_student_by_copying_alternating_layers` function.
+
+#### Pegasus 
++ The following commands are written for BART and will require, at minimum, the following modifications
++ reduce batch size, and increase gradient accumulation steps so that the product `gpus * batch size * gradient_accumulation_steps = 256`. We used `--learning-rate` = 1e-4 * gradient accumulation steps.
++ don't use fp16
++ `--tokenizer_name google/pegasus-large`
+
+### SFT (No Teacher Distillation)
+You don't need `distillation.py`, you can just run:
+
+```bash
+python finetune.py \
+  --data_dir xsum \
+  --freeze_encoder --freeze_embeds \
+  --learning_rate=3e-4 \
+  --do_train \
+  --do_predict \
+  --fp16 --fp16_opt_level=O1 \
+  --val_check_interval 0.1 --n_val 1000 --eval_beams 2 --length_penalty=0.5 \
+  --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
+  --model_name_or_path dbart_xsum_12_3 \
+  --train_batch_size=64 --eval_batch_size=64 \
+  --sortish_sampler \
+  --num_train_epochs=6 \
+  --warmup_steps 500 \
+  --output_dir distilbart_xsum_sft_12_3 --gpus 1
+```
+
++ Note: The command that produced `sshleifer/distilbart-cnn-12-6` is at [train_distilbart_cnn.sh](./[train_distilbart_cnn.sh)
+
+```bash
+./train_distilbart_cnn.sh
+```
+<!--- runtime: 6H on NVIDIA RTX 24GB GPU -->
++ Tip: You can get the same simple distillation logic by using `distillation.py --no_teacher ` followed by identical arguments as the ones in `train_distilbart_cnn.sh`.
+If you are using `wandb` and comparing the two distillation methods, using this entry point will make your logs consistent,
+because you will have the same hyper-parameters logged in every run.
+
+### Pseudo-Labeling
++ You don't need `distillation.py`.
++ Instructions to generate pseudo-labels and use pre-computed pseudo-labels can be found [here](./precomputed_pseudo_labels.md).
+Simply run `finetune.py` with one of those pseudo-label datasets as `--data_dir` (`DATA`, below).
+
+```bash
+python finetune.py \
+  --teacher facebook/bart-large-xsum --data_dir DATA \
+  --freeze_encoder --freeze_embeds \
+  --learning_rate=3e-4 \
+  --do_train \
+  --do_predict \
+  --fp16 --fp16_opt_level=O1 \
+  --val_check_interval 0.1 --n_val 1000 --eval_beams 2 --length_penalty=0.5 \
+  --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
+  --model_name_or_path dbart_xsum_12_3 \
+  --train_batch_size=32 --eval_batch_size=32 \
+  --sortish_sampler \
+  --num_train_epochs=5 \
+  --warmup_steps 500 \
+  --output_dir dbart_xsum_12_3_PL --gpus 1 --logger_name wandb
+```
+
+ 
+
+To combine datasets, as in Section 6.2, try something like:
+```bash
+curl -S https://cdn-datasets.huggingface.co/pseudo/xsum/bart_xsum_pl.tgz | tar -xvz -C .
+curl -S https://cdn-datasets.huggingface.co/pseudo/xsum/pegasus_xsum.tgz | tar -xvz -C .
+curl -S https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz | tar -xvz -C .
+mkdir all_pl
+cat bart_xsum_pl/train.source pegasus_xsum/train.source xsum/train.source > all_pl/train.source
+cat bart_xsum_pl/train.target pegasus_xsum/train.target xsum/train.target > all_pl/train.target
+cp xsum/val* all_pl
+cp xsum/test* all_pl
+```
+then use `all_pl` as DATA in the command above.
+
+#### Direct Knowledge Distillation (KD)
++ In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `SummarizationDistiller`.
++ This method was used for `sshleifer/distilbart-xsum-12-6`, `6-6`, and `9-6` checkpoints were produced.
++ You must use [`distillation.py`](./distillation.py). Note that this command initializes the student for you.
+
+The command that produced `sshleifer/distilbart-xsum-12-6` is at [./train_distilbart_xsum.sh](train_distilbart_xsum.sh)
+```bash
+./train_distilbart_xsum.sh --logger_name wandb --gpus 1
+```
+
++ Expected ROUGE-2 between 21.3 and 21.6, run time ~13H.
++ direct KD + Pegasus is VERY slow and works best with `--supervise_forward --normalize_hidden`.
+
+<!--- runtime: 13H on V-100 16GB GPU. -->
+
+### Citation
+
+```bibtex
+@misc{shleifer2020pretrained,
+      title={Pre-trained Summarization Distillation}, 
+      author={Sam Shleifer and Alexander M. Rush},
+      year={2020},
+      eprint={2010.13002},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+@article{Wolf2019HuggingFacesTS,
+  title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
+  author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush},
+  journal={ArXiv},
+  year={2019},
+  volume={abs/1910.03771}
+}
+```
diff --git a/examples/seq2seq/test_bash_script.py b/examples/research_projects/seq2seq-distillation/_test_bash_script.py
similarity index 100%
rename from examples/seq2seq/test_bash_script.py
rename to examples/research_projects/seq2seq-distillation/_test_bash_script.py
diff --git a/examples/seq2seq/test_make_student.py b/examples/research_projects/seq2seq-distillation/_test_make_student.py
similarity index 100%
rename from examples/seq2seq/test_make_student.py
rename to examples/research_projects/seq2seq-distillation/_test_make_student.py
diff --git a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py
new file mode 100644
index 000000000..57e99e30e
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py
@@ -0,0 +1,443 @@
+import argparse
+import logging
+import os
+import sys
+import tempfile
+from pathlib import Path
+
+import pytest
+import pytorch_lightning as pl
+import torch
+
+import lightning_base
+from convert_pl_checkpoint_to_hf import convert_pl_to_hf
+from distillation import distill_main
+from finetune import SummarizationModule, main
+from parameterized import parameterized
+from run_eval import generate_summaries_or_translations
+from transformers import AutoConfig, AutoModelForSeq2SeqLM
+from transformers.hf_api import HfApi
+from transformers.testing_utils import CaptureStderr, CaptureStdout, TestCasePlus, require_torch_gpu, slow
+from utils import label_smoothed_nll_loss, lmap, load_json
+
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger()
+CUDA_AVAILABLE = torch.cuda.is_available()
+CHEAP_ARGS = {
+    "max_tokens_per_batch": None,
+    "supervise_forward": True,
+    "normalize_hidden": True,
+    "label_smoothing": 0.2,
+    "eval_max_gen_length": None,
+    "eval_beams": 1,
+    "val_metric": "loss",
+    "save_top_k": 1,
+    "adafactor": True,
+    "early_stopping_patience": 2,
+    "logger_name": "default",
+    "length_penalty": 0.5,
+    "cache_dir": "",
+    "task": "summarization",
+    "num_workers": 2,
+    "alpha_hid": 0,
+    "freeze_embeds": True,
+    "enc_only": False,
+    "tgt_suffix": "",
+    "resume_from_checkpoint": None,
+    "sortish_sampler": True,
+    "student_decoder_layers": 1,
+    "val_check_interval": 1.0,
+    "output_dir": "",
+    "fp16": False,  # TODO(SS): set this to CUDA_AVAILABLE if ci installs apex or start using native amp
+    "no_teacher": False,
+    "fp16_opt_level": "O1",
+    "gpus": 1 if CUDA_AVAILABLE else 0,
+    "n_tpu_cores": 0,
+    "max_grad_norm": 1.0,
+    "do_train": True,
+    "do_predict": True,
+    "accumulate_grad_batches": 1,
+    "server_ip": "",
+    "server_port": "",
+    "seed": 42,
+    "model_name_or_path": "sshleifer/bart-tiny-random",
+    "config_name": "",
+    "tokenizer_name": "facebook/bart-large",
+    "do_lower_case": False,
+    "learning_rate": 0.3,
+    "lr_scheduler": "linear",
+    "weight_decay": 0.0,
+    "adam_epsilon": 1e-08,
+    "warmup_steps": 0,
+    "max_epochs": 1,
+    "train_batch_size": 2,
+    "eval_batch_size": 2,
+    "max_source_length": 12,
+    "max_target_length": 12,
+    "val_max_target_length": 12,
+    "test_max_target_length": 12,
+    "fast_dev_run": False,
+    "no_cache": False,
+    "n_train": -1,
+    "n_val": -1,
+    "n_test": -1,
+    "student_encoder_layers": 1,
+    "freeze_encoder": False,
+    "auto_scale_batch_size": False,
+    "overwrite_output_dir": False,
+    "student": None,
+}
+
+
+def _dump_articles(path: Path, articles: list):
+    content = "\n".join(articles)
+    Path(path).open("w").writelines(content)
+
+
+ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."]
+SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
+T5_TINY = "patrickvonplaten/t5-tiny-random"
+T5_TINIER = "sshleifer/t5-tinier-random"
+BART_TINY = "sshleifer/bart-tiny-random"
+MBART_TINY = "sshleifer/tiny-mbart"
+MARIAN_TINY = "sshleifer/tiny-marian-en-de"
+FSMT_TINY = "stas/tiny-wmt19-en-de"
+
+
+stream_handler = logging.StreamHandler(sys.stdout)
+logger.addHandler(stream_handler)
+logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
+
+
+def make_test_data_dir(tmp_dir):
+    for split in ["train", "val", "test"]:
+        _dump_articles(os.path.join(tmp_dir, f"{split}.source"), ARTICLES)
+        _dump_articles(os.path.join(tmp_dir, f"{split}.target"), SUMMARIES)
+    return tmp_dir
+
+
+class TestSummarizationDistiller(TestCasePlus):
+    @classmethod
+    def setUpClass(cls):
+        logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
+        return cls
+
+    @slow
+    @require_torch_gpu
+    def test_hub_configs(self):
+        """I put require_torch_gpu cause I only want this to run with self-scheduled."""
+
+        model_list = HfApi().model_list()
+        org = "sshleifer"
+        model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
+        allowed_to_be_broken = ["sshleifer/blenderbot-3B", "sshleifer/blenderbot-90M"]
+        failures = []
+        for m in model_ids:
+            if m in allowed_to_be_broken:
+                continue
+            try:
+                AutoConfig.from_pretrained(m)
+            except Exception:
+                failures.append(m)
+        assert not failures, f"The following models could not be loaded through AutoConfig: {failures}"
+
+    def test_distill_no_teacher(self):
+        updates = dict(student_encoder_layers=2, student_decoder_layers=1, no_teacher=True)
+        self._test_distiller_cli(updates)
+
+    def test_distill_checkpointing_with_teacher(self):
+        updates = dict(
+            student_encoder_layers=2,
+            student_decoder_layers=1,
+            max_epochs=4,
+            val_check_interval=0.25,
+            alpha_hid=2.0,
+            model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED",
+        )
+        model = self._test_distiller_cli(updates, check_contents=False)
+
+        ckpts = list(Path(model.output_dir).glob("*.ckpt"))
+        self.assertEqual(1, len(ckpts))
+        transformer_ckpts = list(Path(model.output_dir).glob("**/*.bin"))
+        self.assertEqual(len(transformer_ckpts), 2)
+        examples = lmap(str.strip, Path(model.hparams.data_dir).joinpath("test.source").open().readlines())
+        out_path = tempfile.mktemp()  # XXX: not being cleaned up
+        generate_summaries_or_translations(examples, out_path, str(model.output_dir / "best_tfmr"))
+        self.assertTrue(Path(out_path).exists())
+
+        out_path_new = self.get_auto_remove_tmp_dir()
+        convert_pl_to_hf(ckpts[0], transformer_ckpts[0].parent, out_path_new)
+        assert os.path.exists(os.path.join(out_path_new, "pytorch_model.bin"))
+
+    def test_loss_fn(self):
+        model = AutoModelForSeq2SeqLM.from_pretrained(BART_TINY)
+        input_ids, mask = model.dummy_inputs["input_ids"], model.dummy_inputs["attention_mask"]
+        target_ids = torch.tensor([[0, 4, 8, 2], [0, 8, 2, 1]], dtype=torch.long, device=model.device)
+        decoder_input_ids = target_ids[:, :-1].contiguous()  # Why this line?
+        lm_labels = target_ids[:, 1:].clone()  # why clone?
+        model_computed_loss = model(
+            input_ids, attention_mask=mask, decoder_input_ids=decoder_input_ids, labels=lm_labels, use_cache=False
+        ).loss
+
+        logits = model(input_ids, attention_mask=mask, decoder_input_ids=decoder_input_ids, use_cache=False).logits
+
+        lprobs = torch.nn.functional.log_softmax(logits, dim=-1)
+        smoothed_loss, nll_loss = label_smoothed_nll_loss(
+            lprobs, lm_labels, 0.1, ignore_index=model.config.pad_token_id
+        )
+        with self.assertRaises(AssertionError):
+            # TODO: understand why this breaks
+            self.assertEqual(nll_loss, model_computed_loss)
+
+    def test_distill_mbart(self):
+        updates = dict(
+            student_encoder_layers=2,
+            student_decoder_layers=1,
+            num_train_epochs=4,
+            val_check_interval=0.25,
+            alpha_hid=2.0,
+            task="translation",
+            model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED",
+            tokenizer_name=MBART_TINY,
+            teacher=MBART_TINY,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+        )
+        model = self._test_distiller_cli(updates, check_contents=False)
+        assert model.model.config.model_type == "mbart"
+
+        ckpts = list(Path(model.output_dir).glob("*.ckpt"))
+        self.assertEqual(1, len(ckpts))
+        transformer_ckpts = list(Path(model.output_dir).glob("**/*.bin"))
+        all_files = list(Path(model.output_dir).glob("best_tfmr/*"))
+        assert len(all_files) > 2
+        self.assertEqual(len(transformer_ckpts), 2)
+
+    def test_distill_t5(self):
+        updates = dict(
+            student_encoder_layers=1,
+            student_decoder_layers=1,
+            alpha_hid=2.0,
+            teacher=T5_TINY,
+            model_name_or_path=T5_TINY,
+            tokenizer_name=T5_TINY,
+        )
+        self._test_distiller_cli(updates)
+
+    def test_distill_different_base_models(self):
+        updates = dict(
+            teacher=T5_TINY,
+            student=T5_TINIER,
+            model_name_or_path=T5_TINIER,
+            tokenizer_name=T5_TINIER,
+        )
+        self._test_distiller_cli(updates)
+
+    def _test_distiller_cli(self, updates, check_contents=True):
+        default_updates = dict(
+            label_smoothing=0.0,
+            early_stopping_patience=-1,
+            train_batch_size=1,
+            eval_batch_size=2,
+            max_epochs=2,
+            alpha_mlm=0.2,
+            alpha_ce=0.8,
+            do_predict=True,
+            model_name_or_path="sshleifer/tinier_bart",
+            teacher=CHEAP_ARGS["model_name_or_path"],
+            val_check_interval=0.5,
+        )
+        default_updates.update(updates)
+        args_d: dict = CHEAP_ARGS.copy()
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        output_dir = self.get_auto_remove_tmp_dir()
+
+        args_d.update(data_dir=tmp_dir, output_dir=output_dir, **default_updates)
+        model = distill_main(argparse.Namespace(**args_d))
+        if not check_contents:
+            return model
+        contents = os.listdir(output_dir)
+        contents = {os.path.basename(p) for p in contents}
+        ckpt_files = [p for p in contents if p.endswith("ckpt")]
+        assert len(ckpt_files) > 0
+
+        self.assertIn("test_generations.txt", contents)
+        self.assertIn("test_results.txt", contents)
+
+        metrics = load_json(model.metrics_save_path)
+        last_step_stats = metrics["val"][-1]
+        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
+        self.assertGreaterEqual(1.0, last_step_stats["val_avg_gen_time"])
+        self.assertIsInstance(last_step_stats[f"val_avg_{model.val_metric}"], float)
+        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) + 1)
+        self.assertEqual(len(metrics["val"]), desired_n_evals)
+        self.assertEqual(len(metrics["test"]), 1)
+        return model
+
+
+class TestTheRest(TestCasePlus):
+    @parameterized.expand(
+        [T5_TINY, BART_TINY, MBART_TINY, MARIAN_TINY, FSMT_TINY],
+    )
+    def test_finetune(self, model):
+        args_d: dict = CHEAP_ARGS.copy()
+        task = "translation" if model in [MBART_TINY, MARIAN_TINY, FSMT_TINY] else "summarization"
+        args_d["label_smoothing"] = 0.1 if task == "translation" else 0
+
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d.update(
+            data_dir=tmp_dir,
+            model_name_or_path=model,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            output_dir=output_dir,
+            do_predict=True,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )
+        assert "n_train" in args_d
+        args = argparse.Namespace(**args_d)
+        module = main(args)
+
+        input_embeds = module.model.get_input_embeddings()
+        assert not input_embeds.weight.requires_grad
+        if model == T5_TINY:
+            lm_head = module.model.lm_head
+            assert not lm_head.weight.requires_grad
+            assert (lm_head.weight == input_embeds.weight).all().item()
+        elif model == FSMT_TINY:
+            fsmt = module.model.model
+            embed_pos = fsmt.decoder.embed_positions
+            assert not embed_pos.weight.requires_grad
+            assert not fsmt.decoder.embed_tokens.weight.requires_grad
+            # check that embeds are not the same
+            assert fsmt.decoder.embed_tokens != fsmt.encoder.embed_tokens
+        else:
+            bart = module.model.model
+            embed_pos = bart.decoder.embed_positions
+            assert not embed_pos.weight.requires_grad
+            assert not bart.shared.weight.requires_grad
+            # check that embeds are the same
+            assert bart.decoder.embed_tokens == bart.encoder.embed_tokens
+            assert bart.decoder.embed_tokens == bart.shared
+
+        example_batch = load_json(module.output_dir / "text_batch.json")
+        assert isinstance(example_batch, dict)
+        assert len(example_batch) >= 4
+
+    def test_finetune_extra_model_args(self):
+        args_d: dict = CHEAP_ARGS.copy()
+
+        task = "summarization"
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+
+        args_d.update(
+            data_dir=tmp_dir,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            do_predict=False,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )
+
+        # test models whose config includes the extra_model_args
+        model = BART_TINY
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d1 = args_d.copy()
+        args_d1.update(
+            model_name_or_path=model,
+            output_dir=output_dir,
+        )
+        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
+        for p in extra_model_params:
+            args_d1[p] = 0.5
+        args = argparse.Namespace(**args_d1)
+        model = main(args)
+        for p in extra_model_params:
+            assert getattr(model.config, p) == 0.5, f"failed to override the model config for param {p}"
+
+        # test models whose config doesn't include the extra_model_args
+        model = T5_TINY
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d2 = args_d.copy()
+        args_d2.update(
+            model_name_or_path=model,
+            output_dir=output_dir,
+        )
+        unsupported_param = "encoder_layerdrop"
+        args_d2[unsupported_param] = 0.5
+        args = argparse.Namespace(**args_d2)
+        with pytest.raises(Exception) as excinfo:
+            model = main(args)
+        assert str(excinfo.value) == f"model config doesn't have a `{unsupported_param}` attribute"
+
+    def test_finetune_lr_schedulers(self):
+        args_d: dict = CHEAP_ARGS.copy()
+
+        task = "summarization"
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+
+        model = BART_TINY
+        output_dir = self.get_auto_remove_tmp_dir()
+
+        args_d.update(
+            data_dir=tmp_dir,
+            model_name_or_path=model,
+            output_dir=output_dir,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            do_predict=False,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )
+
+        # emulate finetune.py
+        parser = argparse.ArgumentParser()
+        parser = pl.Trainer.add_argparse_args(parser)
+        parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
+        args = {"--help": True}
+
+        # --help test
+        with pytest.raises(SystemExit) as excinfo:
+            with CaptureStdout() as cs:
+                args = parser.parse_args(args)
+            assert False, "--help is expected to sys.exit"
+        assert excinfo.type == SystemExit
+        expected = lightning_base.arg_to_scheduler_metavar
+        assert expected in cs.out, "--help is expected to list the supported schedulers"
+
+        # --lr_scheduler=non_existing_scheduler test
+        unsupported_param = "non_existing_scheduler"
+        args = {f"--lr_scheduler={unsupported_param}"}
+        with pytest.raises(SystemExit) as excinfo:
+            with CaptureStderr() as cs:
+                args = parser.parse_args(args)
+            assert False, "invalid argument is expected to sys.exit"
+        assert excinfo.type == SystemExit
+        expected = f"invalid choice: '{unsupported_param}'"
+        assert expected in cs.err, f"should have bailed on invalid choice of scheduler {unsupported_param}"
+
+        # --lr_scheduler=existing_scheduler test
+        supported_param = "cosine"
+        args_d1 = args_d.copy()
+        args_d1["lr_scheduler"] = supported_param
+        args = argparse.Namespace(**args_d1)
+        model = main(args)
+        assert (
+            getattr(model.hparams, "lr_scheduler") == supported_param
+        ), f"lr_scheduler={supported_param} shouldn't fail"
diff --git a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py
new file mode 100644
index 000000000..af6ae24bf
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py
@@ -0,0 +1,164 @@
+# as due to their complexity multi-gpu tests could impact other tests, and to aid debug we have those in a separate module.
+
+import os
+import sys
+from pathlib import Path
+
+import torch
+
+from transformers.testing_utils import TestCasePlus, execute_subprocess_async, require_torch_multi_gpu
+from utils import load_json
+
+
+CUDA_AVAILABLE = torch.cuda.is_available()
+ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."]
+SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
+CHEAP_ARGS = {
+    "max_tokens_per_batch": None,
+    "supervise_forward": True,
+    "normalize_hidden": True,
+    "label_smoothing": 0.2,
+    "eval_max_gen_length": None,
+    "eval_beams": 1,
+    "val_metric": "loss",
+    "save_top_k": 1,
+    "adafactor": True,
+    "early_stopping_patience": 2,
+    "logger_name": "default",
+    "length_penalty": 0.5,
+    "cache_dir": "",
+    "task": "summarization",
+    "num_workers": 2,
+    "alpha_hid": 0,
+    "freeze_embeds": True,
+    "enc_only": False,
+    "tgt_suffix": "",
+    "resume_from_checkpoint": None,
+    "sortish_sampler": True,
+    "student_decoder_layers": 1,
+    "val_check_interval": 1.0,
+    "output_dir": "",
+    "fp16": False,  # TODO(SS): set this to CUDA_AVAILABLE if ci installs apex or start using native amp
+    "no_teacher": False,
+    "fp16_opt_level": "O1",
+    "gpus": 1 if CUDA_AVAILABLE else 0,
+    "n_tpu_cores": 0,
+    "max_grad_norm": 1.0,
+    "do_train": True,
+    "do_predict": True,
+    "accumulate_grad_batches": 1,
+    "server_ip": "",
+    "server_port": "",
+    "seed": 42,
+    "model_name_or_path": "sshleifer/bart-tiny-random",
+    "config_name": "",
+    "tokenizer_name": "facebook/bart-large",
+    "do_lower_case": False,
+    "learning_rate": 0.3,
+    "lr_scheduler": "linear",
+    "weight_decay": 0.0,
+    "adam_epsilon": 1e-08,
+    "warmup_steps": 0,
+    "max_epochs": 1,
+    "train_batch_size": 2,
+    "eval_batch_size": 2,
+    "max_source_length": 12,
+    "max_target_length": 12,
+    "val_max_target_length": 12,
+    "test_max_target_length": 12,
+    "fast_dev_run": False,
+    "no_cache": False,
+    "n_train": -1,
+    "n_val": -1,
+    "n_test": -1,
+    "student_encoder_layers": 1,
+    "freeze_encoder": False,
+    "auto_scale_batch_size": False,
+    "overwrite_output_dir": False,
+    "student": None,
+}
+
+
+def _dump_articles(path: Path, articles: list):
+    content = "\n".join(articles)
+    Path(path).open("w").writelines(content)
+
+
+def make_test_data_dir(tmp_dir):
+    for split in ["train", "val", "test"]:
+        _dump_articles(os.path.join(tmp_dir, f"{split}.source"), ARTICLES)
+        _dump_articles(os.path.join(tmp_dir, f"{split}.target"), SUMMARIES)
+    return tmp_dir
+
+
+class TestSummarizationDistillerMultiGPU(TestCasePlus):
+    @classmethod
+    def setUpClass(cls):
+        return cls
+
+    @require_torch_multi_gpu
+    def test_multi_gpu(self):
+
+        updates = dict(
+            no_teacher=True,
+            freeze_encoder=True,
+            gpus=2,
+            overwrite_output_dir=True,
+            sortish_sampler=True,
+        )
+        self._test_distiller_cli_fork(updates, check_contents=False)
+
+    def _test_distiller_cli_fork(self, updates, check_contents=True):
+        default_updates = dict(
+            label_smoothing=0.0,
+            early_stopping_patience=-1,
+            train_batch_size=1,
+            eval_batch_size=2,
+            max_epochs=2,
+            alpha_mlm=0.2,
+            alpha_ce=0.8,
+            do_predict=True,
+            model_name_or_path="sshleifer/tinier_bart",
+            teacher=CHEAP_ARGS["model_name_or_path"],
+            val_check_interval=0.5,
+        )
+        default_updates.update(updates)
+        args_d: dict = CHEAP_ARGS.copy()
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d.update(data_dir=tmp_dir, output_dir=output_dir, **default_updates)
+
+        def convert(k, v):
+            if k in ["tgt_suffix", "server_ip", "server_port", "out", "n_tpu_cores"]:
+                return ""
+            if v is False or v is None:
+                return ""
+            if v is True:  # or len(str(v))==0:
+                return f"--{k}"
+            return f"--{k}={v}"
+
+        cli_args = [x for x in (convert(k, v) for k, v in args_d.items()) if len(x)]
+        cmd = [sys.executable, f"{self.test_file_dir}/distillation.py"] + cli_args
+        execute_subprocess_async(cmd, env=self.get_env())
+
+        contents = os.listdir(output_dir)
+        contents = {os.path.basename(p) for p in contents}
+        ckpt_files = [p for p in contents if p.endswith("ckpt")]
+        assert len(ckpt_files) > 0
+
+        self.assertIn("test_generations.txt", contents)
+        self.assertIn("test_results.txt", contents)
+
+        # get the following from the module, (we don't have access to `model` here)
+        metrics_save_path = os.path.join(output_dir, "metrics.json")
+        val_metric = "rouge2"
+
+        metrics = load_json(metrics_save_path)
+        # {'test': [{'test_avg_loss': 10.63731575012207, 'test_avg_rouge1': 0.0, 'test_avg_rouge2': 0.0, 'test_avg_rougeL': 0.0, 'test_avg_gen_time': 0.1822289228439331, 'test_avg_gen_len': 142.0, 'step_count': 1}]}
+        print(metrics)
+        last_step_stats = metrics["val"][-1]
+        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
+        self.assertIsInstance(last_step_stats[f"val_avg_{val_metric}"], float)
+        self.assertEqual(len(metrics["test"]), 1)
+        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) / 2 + 1)
+        self.assertEqual(len(metrics["val"]), desired_n_evals)
diff --git a/examples/seq2seq/callbacks.py b/examples/research_projects/seq2seq-distillation/callbacks.py
similarity index 100%
rename from examples/seq2seq/callbacks.py
rename to examples/research_projects/seq2seq-distillation/callbacks.py
diff --git a/examples/seq2seq/convert_pl_checkpoint_to_hf.py b/examples/research_projects/seq2seq-distillation/convert_pl_checkpoint_to_hf.py
similarity index 100%
rename from examples/seq2seq/convert_pl_checkpoint_to_hf.py
rename to examples/research_projects/seq2seq-distillation/convert_pl_checkpoint_to_hf.py
diff --git a/examples/seq2seq/distil_marian_enro_teacher.sh b/examples/research_projects/seq2seq-distillation/distil_marian_enro_teacher.sh
similarity index 100%
rename from examples/seq2seq/distil_marian_enro_teacher.sh
rename to examples/research_projects/seq2seq-distillation/distil_marian_enro_teacher.sh
diff --git a/examples/seq2seq/distil_marian_no_teacher.sh b/examples/research_projects/seq2seq-distillation/distil_marian_no_teacher.sh
similarity index 100%
rename from examples/seq2seq/distil_marian_no_teacher.sh
rename to examples/research_projects/seq2seq-distillation/distil_marian_no_teacher.sh
diff --git a/examples/seq2seq/distillation.py b/examples/research_projects/seq2seq-distillation/distillation.py
similarity index 100%
rename from examples/seq2seq/distillation.py
rename to examples/research_projects/seq2seq-distillation/distillation.py
diff --git a/examples/seq2seq/dynamic_bs_example.sh b/examples/research_projects/seq2seq-distillation/dynamic_bs_example.sh
similarity index 100%
rename from examples/seq2seq/dynamic_bs_example.sh
rename to examples/research_projects/seq2seq-distillation/dynamic_bs_example.sh
diff --git a/examples/seq2seq/finetune.py b/examples/research_projects/seq2seq-distillation/finetune.py
similarity index 100%
rename from examples/seq2seq/finetune.py
rename to examples/research_projects/seq2seq-distillation/finetune.py
diff --git a/examples/research_projects/seq2seq-distillation/finetune.sh b/examples/research_projects/seq2seq-distillation/finetune.sh
new file mode 100755
index 000000000..683c2d775
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/finetune.sh
@@ -0,0 +1,11 @@
+# the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
+# run ./finetune.sh --help to see all the possible options
+python finetune.py \
+    --learning_rate=3e-5 \
+    --fp16 \
+    --gpus 1 \
+    --do_train \
+    --do_predict \
+    --n_val 1000 \
+    --val_check_interval 0.1 \
+    "$@"
diff --git a/examples/seq2seq/finetune_bart_tiny.sh b/examples/research_projects/seq2seq-distillation/finetune_bart_tiny.sh
similarity index 100%
rename from examples/seq2seq/finetune_bart_tiny.sh
rename to examples/research_projects/seq2seq-distillation/finetune_bart_tiny.sh
diff --git a/examples/seq2seq/finetune_pegasus_xsum.sh b/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh
similarity index 100%
rename from examples/seq2seq/finetune_pegasus_xsum.sh
rename to examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh
diff --git a/examples/seq2seq/finetune_t5.sh b/examples/research_projects/seq2seq-distillation/finetune_t5.sh
similarity index 100%
rename from examples/seq2seq/finetune_t5.sh
rename to examples/research_projects/seq2seq-distillation/finetune_t5.sh
diff --git a/examples/research_projects/seq2seq-distillation/lightning_base.py b/examples/research_projects/seq2seq-distillation/lightning_base.py
new file mode 100644
index 000000000..a9a05fbf9
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/lightning_base.py
@@ -0,0 +1,391 @@
+import argparse
+import logging
+import os
+from pathlib import Path
+from typing import Any, Dict
+
+import pytorch_lightning as pl
+from pytorch_lightning.utilities import rank_zero_info
+
+from transformers import (
+    AdamW,
+    AutoConfig,
+    AutoModel,
+    AutoModelForPreTraining,
+    AutoModelForQuestionAnswering,
+    AutoModelForSeq2SeqLM,
+    AutoModelForSequenceClassification,
+    AutoModelForTokenClassification,
+    AutoModelWithLMHead,
+    AutoTokenizer,
+    PretrainedConfig,
+    PreTrainedTokenizer,
+)
+from transformers.optimization import (
+    Adafactor,
+    get_cosine_schedule_with_warmup,
+    get_cosine_with_hard_restarts_schedule_with_warmup,
+    get_linear_schedule_with_warmup,
+    get_polynomial_decay_schedule_with_warmup,
+)
+from transformers.utils.versions import require_version_examples
+
+
+logger = logging.getLogger(__name__)
+
+require_version_examples("pytorch_lightning>=1.0.4")
+
+MODEL_MODES = {
+    "base": AutoModel,
+    "sequence-classification": AutoModelForSequenceClassification,
+    "question-answering": AutoModelForQuestionAnswering,
+    "pretraining": AutoModelForPreTraining,
+    "token-classification": AutoModelForTokenClassification,
+    "language-modeling": AutoModelWithLMHead,
+    "summarization": AutoModelForSeq2SeqLM,
+    "translation": AutoModelForSeq2SeqLM,
+}
+
+
+# update this and the import above to support new schedulers from transformers.optimization
+arg_to_scheduler = {
+    "linear": get_linear_schedule_with_warmup,
+    "cosine": get_cosine_schedule_with_warmup,
+    "cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
+    "polynomial": get_polynomial_decay_schedule_with_warmup,
+    # '': get_constant_schedule,             # not supported for now
+    # '': get_constant_schedule_with_warmup, # not supported for now
+}
+arg_to_scheduler_choices = sorted(arg_to_scheduler.keys())
+arg_to_scheduler_metavar = "{" + ", ".join(arg_to_scheduler_choices) + "}"
+
+
+class BaseTransformer(pl.LightningModule):
+    def __init__(
+        self,
+        hparams: argparse.Namespace,
+        num_labels=None,
+        mode="base",
+        config=None,
+        tokenizer=None,
+        model=None,
+        **config_kwargs
+    ):
+        """Initialize a model, tokenizer and config."""
+        super().__init__()
+        # TODO: move to self.save_hyperparameters()
+        # self.save_hyperparameters()
+        # can also expand arguments into trainer signature for easier reading
+
+        self.save_hyperparameters(hparams)
+        self.step_count = 0
+        self.output_dir = Path(self.hparams.output_dir)
+        cache_dir = self.hparams.cache_dir if self.hparams.cache_dir else None
+        if config is None:
+            self.config = AutoConfig.from_pretrained(
+                self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
+                **({"num_labels": num_labels} if num_labels is not None else {}),
+                cache_dir=cache_dir,
+                **config_kwargs,
+            )
+        else:
+            self.config: PretrainedConfig = config
+
+        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
+        for p in extra_model_params:
+            if getattr(self.hparams, p, None):
+                assert hasattr(self.config, p), f"model config doesn't have a `{p}` attribute"
+                setattr(self.config, p, getattr(self.hparams, p))
+
+        if tokenizer is None:
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
+                cache_dir=cache_dir,
+            )
+        else:
+            self.tokenizer: PreTrainedTokenizer = tokenizer
+        self.model_type = MODEL_MODES[mode]
+        if model is None:
+            self.model = self.model_type.from_pretrained(
+                self.hparams.model_name_or_path,
+                from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
+                config=self.config,
+                cache_dir=cache_dir,
+            )
+        else:
+            self.model = model
+
+    def load_hf_checkpoint(self, *args, **kwargs):
+        self.model = self.model_type.from_pretrained(*args, **kwargs)
+
+    def get_lr_scheduler(self):
+        get_schedule_func = arg_to_scheduler[self.hparams.lr_scheduler]
+        scheduler = get_schedule_func(
+            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=self.total_steps()
+        )
+        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
+        return scheduler
+
+    def configure_optimizers(self):
+        """Prepare optimizer and schedule (linear warmup and decay)"""
+        model = self.model
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": self.hparams.weight_decay,
+            },
+            {
+                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        if self.hparams.adafactor:
+            optimizer = Adafactor(
+                optimizer_grouped_parameters, lr=self.hparams.learning_rate, scale_parameter=False, relative_step=False
+            )
+
+        else:
+            optimizer = AdamW(
+                optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon
+            )
+        self.opt = optimizer
+
+        scheduler = self.get_lr_scheduler()
+
+        return [optimizer], [scheduler]
+
+    def test_step(self, batch, batch_nb):
+        return self.validation_step(batch, batch_nb)
+
+    def test_epoch_end(self, outputs):
+        return self.validation_end(outputs)
+
+    def total_steps(self) -> int:
+        """The number of total training steps that will be run. Used for lr scheduler purposes."""
+        num_devices = max(1, self.hparams.gpus)  # TODO: consider num_tpu_cores
+        effective_batch_size = self.hparams.train_batch_size * self.hparams.accumulate_grad_batches * num_devices
+        return (self.dataset_size / effective_batch_size) * self.hparams.max_epochs
+
+    def setup(self, mode):
+        if mode == "test":
+            self.dataset_size = len(self.test_dataloader().dataset)
+        else:
+            self.train_loader = self.get_dataloader("train", self.hparams.train_batch_size, shuffle=True)
+            self.dataset_size = len(self.train_dataloader().dataset)
+
+    def get_dataloader(self, type_path: str, batch_size: int, shuffle: bool = False):
+        raise NotImplementedError("You must implement this for your task")
+
+    def train_dataloader(self):
+        return self.train_loader
+
+    def val_dataloader(self):
+        return self.get_dataloader("dev", self.hparams.eval_batch_size, shuffle=False)
+
+    def test_dataloader(self):
+        return self.get_dataloader("test", self.hparams.eval_batch_size, shuffle=False)
+
+    def _feature_file(self, mode):
+        return os.path.join(
+            self.hparams.data_dir,
+            "cached_{}_{}_{}".format(
+                mode,
+                list(filter(None, self.hparams.model_name_or_path.split("/"))).pop(),
+                str(self.hparams.max_seq_length),
+            ),
+        )
+
+    @pl.utilities.rank_zero_only
+    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
+        save_path = self.output_dir.joinpath("best_tfmr")
+        self.model.config.save_step = self.step_count
+        self.model.save_pretrained(save_path)
+        self.tokenizer.save_pretrained(save_path)
+
+    @staticmethod
+    def add_model_specific_args(parser, root_dir):
+        parser.add_argument(
+            "--model_name_or_path",
+            default=None,
+            type=str,
+            required=True,
+            help="Path to pretrained model or model identifier from huggingface.co/models",
+        )
+        parser.add_argument(
+            "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
+        )
+        parser.add_argument(
+            "--tokenizer_name",
+            default=None,
+            type=str,
+            help="Pretrained tokenizer name or path if not the same as model_name",
+        )
+        parser.add_argument(
+            "--cache_dir",
+            default="",
+            type=str,
+            help="Where do you want to store the pre-trained models downloaded from huggingface.co",
+        )
+        parser.add_argument(
+            "--encoder_layerdrop",
+            type=float,
+            help="Encoder layer dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--decoder_layerdrop",
+            type=float,
+            help="Decoder layer dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--dropout",
+            type=float,
+            help="Dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument(
+            "--attention_dropout",
+            type=float,
+            help="Attention dropout probability (Optional). Goes into model.config",
+        )
+        parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+        parser.add_argument(
+            "--lr_scheduler",
+            default="linear",
+            choices=arg_to_scheduler_choices,
+            metavar=arg_to_scheduler_metavar,
+            type=str,
+            help="Learning rate scheduler",
+        )
+        parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+        parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+        parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+        parser.add_argument("--num_workers", default=4, type=int, help="kwarg passed to DataLoader")
+        parser.add_argument("--num_train_epochs", dest="max_epochs", default=3, type=int)
+        parser.add_argument("--train_batch_size", default=32, type=int)
+        parser.add_argument("--eval_batch_size", default=32, type=int)
+        parser.add_argument("--adafactor", action="store_true")
+
+
+class LoggingCallback(pl.Callback):
+    def on_batch_end(self, trainer, pl_module):
+        lr_scheduler = trainer.lr_schedulers[0]["scheduler"]
+        lrs = {f"lr_group_{i}": lr for i, lr in enumerate(lr_scheduler.get_lr())}
+        pl_module.logger.log_metrics(lrs)
+
+    def on_validation_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
+        rank_zero_info("***** Validation results *****")
+        metrics = trainer.callback_metrics
+        # Log results
+        for key in sorted(metrics):
+            if key not in ["log", "progress_bar"]:
+                rank_zero_info("{} = {}\n".format(key, str(metrics[key])))
+
+    def on_test_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
+        rank_zero_info("***** Test results *****")
+        metrics = trainer.callback_metrics
+        # Log and save results to file
+        output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
+        with open(output_test_results_file, "w") as writer:
+            for key in sorted(metrics):
+                if key not in ["log", "progress_bar"]:
+                    rank_zero_info("{} = {}\n".format(key, str(metrics[key])))
+                    writer.write("{} = {}\n".format(key, str(metrics[key])))
+
+
+def add_generic_args(parser, root_dir) -> None:
+    #  To allow all pl args uncomment the following line
+    #  parser = pl.Trainer.add_argparse_args(parser)
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+    )
+
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O2",
+        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://nvidia.github.io/apex/amp.html",
+    )
+    parser.add_argument("--n_tpu_cores", dest="tpu_cores", type=int)
+    parser.add_argument("--max_grad_norm", dest="gradient_clip_val", default=1.0, type=float, help="Max gradient norm")
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to run predictions on the test set.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        dest="accumulate_grad_batches",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.",
+    )
+
+
+def generic_train(
+    model: BaseTransformer,
+    args: argparse.Namespace,
+    early_stopping_callback=None,
+    logger=True,  # can pass WandbLogger() here
+    extra_callbacks=[],
+    checkpoint_callback=None,
+    logging_callback=None,
+    **extra_train_kwargs
+):
+    pl.seed_everything(args.seed)
+
+    # init model
+    odir = Path(model.hparams.output_dir)
+    odir.mkdir(exist_ok=True)
+
+    # add custom checkpoints
+    if checkpoint_callback is None:
+        checkpoint_callback = pl.callbacks.ModelCheckpoint(
+            filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=1
+        )
+    if early_stopping_callback:
+        extra_callbacks.append(early_stopping_callback)
+    if logging_callback is None:
+        logging_callback = LoggingCallback()
+
+    train_params = {}
+
+    # TODO: remove with PyTorch 1.6 since pl uses native amp
+    if args.fp16:
+        train_params["precision"] = 16
+        train_params["amp_level"] = args.fp16_opt_level
+
+    if args.gpus > 1:
+        train_params["distributed_backend"] = "ddp"
+
+    train_params["accumulate_grad_batches"] = args.accumulate_grad_batches
+    train_params["accelerator"] = extra_train_kwargs.get("accelerator", None)
+    train_params["profiler"] = extra_train_kwargs.get("profiler", None)
+
+    trainer = pl.Trainer.from_argparse_args(
+        args,
+        weights_summary=None,
+        callbacks=[logging_callback] + extra_callbacks,
+        logger=logger,
+        checkpoint_callback=checkpoint_callback,
+        **train_params,
+    )
+
+    if args.do_train:
+        trainer.fit(model)
+
+    return trainer
diff --git a/examples/seq2seq/make_student.py b/examples/research_projects/seq2seq-distillation/make_student.py
similarity index 100%
rename from examples/seq2seq/make_student.py
rename to examples/research_projects/seq2seq-distillation/make_student.py
diff --git a/examples/seq2seq/precomputed_pseudo_labels.md b/examples/research_projects/seq2seq-distillation/precomputed_pseudo_labels.md
similarity index 100%
rename from examples/seq2seq/precomputed_pseudo_labels.md
rename to examples/research_projects/seq2seq-distillation/precomputed_pseudo_labels.md
diff --git a/examples/research_projects/seq2seq-distillation/requirements.txt b/examples/research_projects/seq2seq-distillation/requirements.txt
new file mode 100644
index 000000000..0cd973d4d
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/requirements.txt
@@ -0,0 +1,20 @@
+tensorboard
+scikit-learn
+psutil
+sacrebleu
+rouge-score
+tensorflow_datasets
+pytorch-lightning==1.0.4
+matplotlib
+git-python==1.0.3
+faiss-cpu
+streamlit
+elasticsearch
+nltk
+pandas
+datasets >= 1.1.3
+fire
+pytest
+conllu
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/research_projects/seq2seq-distillation/run_eval.py b/examples/research_projects/seq2seq-distillation/run_eval.py
new file mode 100755
index 000000000..910d430bd
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/run_eval.py
@@ -0,0 +1,163 @@
+#!/usr/bin/env python
+
+import argparse
+import datetime
+import json
+import time
+import warnings
+from logging import getLogger
+from pathlib import Path
+from typing import Dict, List
+
+import torch
+from tqdm import tqdm
+
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+from utils import calculate_bleu, calculate_rouge, chunks, parse_numeric_n_bool_cl_kwargs, use_task_specific_params
+
+
+logger = getLogger(__name__)
+
+
+DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+def generate_summaries_or_translations(
+    examples: List[str],
+    out_file: str,
+    model_name: str,
+    batch_size: int = 8,
+    device: str = DEFAULT_DEVICE,
+    fp16=False,
+    task="summarization",
+    prefix=None,
+    **generate_kwargs,
+) -> Dict:
+    """Save model.generate results to <out_file>, and return how long it took."""
+    fout = Path(out_file).open("w", encoding="utf-8")
+    model_name = str(model_name)
+    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
+    if fp16:
+        model = model.half()
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    logger.info(f"Inferred tokenizer type: {tokenizer.__class__}")  # if this is wrong, check config.model_type.
+
+    start_time = time.time()
+    # update config with task specific params
+    use_task_specific_params(model, task)
+    if prefix is None:
+        prefix = prefix or getattr(model.config, "prefix", "") or ""
+    for examples_chunk in tqdm(list(chunks(examples, batch_size))):
+        examples_chunk = [prefix + text for text in examples_chunk]
+        batch = tokenizer(examples_chunk, return_tensors="pt", truncation=True, padding="longest").to(device)
+        summaries = model.generate(
+            input_ids=batch.input_ids,
+            attention_mask=batch.attention_mask,
+            **generate_kwargs,
+        )
+        dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        for hypothesis in dec:
+            fout.write(hypothesis + "\n")
+            fout.flush()
+    fout.close()
+    runtime = int(time.time() - start_time)  # seconds
+    n_obs = len(examples)
+    return dict(n_obs=n_obs, runtime=runtime, seconds_per_sample=round(runtime / n_obs, 4))
+
+
+def datetime_now():
+    return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+
+
+def run_generate(verbose=True):
+    """
+
+    Takes input text, generates output, and then using reference calculates the BLEU scores.
+
+    The results are saved to a file and returned to the caller, and printed out unless ``verbose=False`` is passed.
+
+    Args:
+        verbose (:obj:`bool`, `optional`, defaults to :obj:`True`): print results to stdout
+
+    Returns:
+        a tuple: ``(scores, params}``
+        - ``scores``: a dict of scores data ``{'bleu': 39.6501, 'n_obs': 2000, 'runtime': 186, 'seconds_per_sample': 0.093}``
+        - ``params``: a dict of custom params, e.g. ``{'num_beams': 5, 'length_penalty': 0.8}``
+    """
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.")
+    parser.add_argument("input_path", type=str, help="like cnn_dm/test.source")
+    parser.add_argument("save_path", type=str, help="where to save summaries")
+    parser.add_argument("--reference_path", type=str, required=False, help="like cnn_dm/test.target")
+    parser.add_argument("--score_path", type=str, required=False, default="metrics.json", help="where to save metrics")
+    parser.add_argument("--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.")
+    parser.add_argument(
+        "--prefix", type=str, required=False, default=None, help="will be added to the begininng of src examples"
+    )
+    parser.add_argument("--task", type=str, default="summarization", help="used for task_specific_params + metrics")
+    parser.add_argument("--bs", type=int, default=8, required=False, help="batch size")
+    parser.add_argument(
+        "--n_obs", type=int, default=-1, required=False, help="How many observations. Defaults to all."
+    )
+    parser.add_argument("--fp16", action="store_true")
+    parser.add_argument("--dump-args", action="store_true", help="print the custom hparams with the results")
+    parser.add_argument(
+        "--info",
+        nargs="?",
+        type=str,
+        const=datetime_now(),
+        help="use in conjunction w/ --dump-args to print with the results whatever other info you'd like, e.g. lang=en-ru. If no value is passed, the current datetime string will be used.",
+    )
+    # Unspecified args like --num_beams=2 --decoder_start_token_id=4 are passed to model.generate
+    args, rest = parser.parse_known_args()
+    parsed_args = parse_numeric_n_bool_cl_kwargs(rest)
+    if parsed_args and verbose:
+        print(f"parsed the following generate kwargs: {parsed_args}")
+    examples = [" " + x.rstrip() if "t5" in args.model_name else x.rstrip() for x in open(args.input_path).readlines()]
+    if args.n_obs > 0:
+        examples = examples[: args.n_obs]
+    Path(args.save_path).parent.mkdir(exist_ok=True)
+    if args.reference_path is None and Path(args.score_path).exists():
+        warnings.warn(f"score_path {args.score_path} will be overwritten unless you type ctrl-c.")
+    runtime_metrics = generate_summaries_or_translations(
+        examples,
+        args.save_path,
+        args.model_name,
+        batch_size=args.bs,
+        device=args.device,
+        fp16=args.fp16,
+        task=args.task,
+        prefix=args.prefix,
+        **parsed_args,
+    )
+
+    if args.reference_path is None:
+        return {}
+
+    # Compute scores
+    score_fn = calculate_bleu if "translation" in args.task else calculate_rouge
+    output_lns = [x.rstrip() for x in open(args.save_path).readlines()]
+    reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()][: len(output_lns)]
+    scores: dict = score_fn(output_lns, reference_lns)
+    scores.update(runtime_metrics)
+
+    if args.dump_args:
+        scores.update(parsed_args)
+    if args.info:
+        scores["info"] = args.info
+
+    if verbose:
+        print(scores)
+
+    if args.score_path is not None:
+        json.dump(scores, open(args.score_path, "w"))
+
+    return scores
+
+
+if __name__ == "__main__":
+    # Usage for MT:
+    # python run_eval.py MODEL_NAME $DATA_DIR/test.source $save_dir/test_translations.txt --reference_path $DATA_DIR/test.target --score_path $save_dir/test_bleu.json  --task translation $@
+    run_generate(verbose=True)
diff --git a/examples/research_projects/seq2seq-distillation/sentence_splitter.py b/examples/research_projects/seq2seq-distillation/sentence_splitter.py
new file mode 100644
index 000000000..c5acec739
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/sentence_splitter.py
@@ -0,0 +1,22 @@
+import re
+
+from filelock import FileLock
+
+
+try:
+    import nltk
+
+    NLTK_AVAILABLE = True
+except (ImportError, ModuleNotFoundError):
+    NLTK_AVAILABLE = False
+
+if NLTK_AVAILABLE:
+    with FileLock(".lock") as lock:
+        nltk.download("punkt", quiet=True)
+
+
+def add_newline_to_end_of_each_sentence(x: str) -> str:
+    """This was added to get rougeLsum scores matching published rougeL scores for BART and PEGASUS."""
+    re.sub("<n>", "", x)  # remove pegasus newline char
+    assert NLTK_AVAILABLE, "nltk must be installed to separate newlines between sentences. (pip install nltk)"
+    return "\n".join(nltk.sent_tokenize(x))
diff --git a/examples/research_projects/seq2seq-distillation/train_distilbart_cnn.sh b/examples/research_projects/seq2seq-distillation/train_distilbart_cnn.sh
new file mode 100755
index 000000000..6a1bafbdc
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/train_distilbart_cnn.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+export PYTHONPATH="../":"${PYTHONPATH}"
+
+export BS=32
+export GAS=1
+
+python finetune.py \
+    --learning_rate=3e-5 \
+    --fp16 \
+    --gpus 1 \
+    --do_train \
+    --do_predict \
+    --val_check_interval 0.25 \
+    --n_val 500 \
+    --num_train_epochs 2 \
+    --freeze_encoder --freeze_embeds --data_dir cnn_dm \
+    --max_target_length 142 --val_max_target_length=142 \
+    --train_batch_size=$BS --eval_batch_size=$BS --gradient_accumulation_steps=$GAS \
+    --model_name_or_path sshleifer/student_cnn_12_6 \
+    --tokenizer_name facebook/bart-large \
+    --warmup_steps 500 \
+    --output_dir distilbart-cnn-12-6 \
+    "$@"
+
diff --git a/examples/seq2seq/train_distilbart_xsum.sh b/examples/research_projects/seq2seq-distillation/train_distilbart_xsum.sh
similarity index 100%
rename from examples/seq2seq/train_distilbart_xsum.sh
rename to examples/research_projects/seq2seq-distillation/train_distilbart_xsum.sh
diff --git a/examples/research_projects/seq2seq-distillation/train_mbart_cc25_enro.sh b/examples/research_projects/seq2seq-distillation/train_mbart_cc25_enro.sh
new file mode 100755
index 000000000..54e7935ff
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/train_mbart_cc25_enro.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+export PYTHONPATH="../":"${PYTHONPATH}"
+
+python finetune.py \
+    --learning_rate=3e-5 \
+    --fp16 \
+    --do_train \
+    --val_check_interval=0.25 \
+    --adam_eps 1e-06 \
+    --num_train_epochs 6 --src_lang en_XX --tgt_lang ro_RO \
+    --data_dir $ENRO_DIR \
+    --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
+    --train_batch_size=$BS --eval_batch_size=$BS \
+    --task translation \
+    --warmup_steps 500 \
+    --freeze_embeds \
+    --model_name_or_path=facebook/mbart-large-cc25 \
+    "$@"
diff --git a/examples/research_projects/seq2seq-distillation/utils copy.py b/examples/research_projects/seq2seq-distillation/utils copy.py
new file mode 100644
index 000000000..b6994a183
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/utils copy.py	
@@ -0,0 +1,645 @@
+import itertools
+import json
+import linecache
+import math
+import os
+import pickle
+import socket
+from logging import getLogger
+from pathlib import Path
+from typing import Callable, Dict, Iterable, List, Tuple, Union
+
+import git
+import numpy as np
+import torch
+import torch.distributed as dist
+from rouge_score import rouge_scorer, scoring
+from sacrebleu import corpus_bleu
+from torch import nn
+from torch.utils.data import Dataset, Sampler
+
+from sentence_splitter import add_newline_to_end_of_each_sentence
+from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer
+from transformers.file_utils import cached_property
+from transformers.models.bart.modeling_bart import shift_tokens_right
+
+
+try:
+    from fairseq.data.data_utils import batch_by_size
+
+    FAIRSEQ_AVAILABLE = True
+except (ImportError, ModuleNotFoundError):
+    FAIRSEQ_AVAILABLE = False
+
+
+def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
+    """From fairseq"""
+    if target.dim() == lprobs.dim() - 1:
+        target = target.unsqueeze(-1)
+    nll_loss = -lprobs.gather(dim=-1, index=target)
+    smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
+    if ignore_index is not None:
+        pad_mask = target.eq(ignore_index)
+        nll_loss.masked_fill_(pad_mask, 0.0)
+        smooth_loss.masked_fill_(pad_mask, 0.0)
+    else:
+        nll_loss = nll_loss.squeeze(-1)
+        smooth_loss = smooth_loss.squeeze(-1)
+
+    nll_loss = nll_loss.sum()  # mean()? Scared to break other math.
+    smooth_loss = smooth_loss.sum()
+    eps_i = epsilon / lprobs.size(-1)
+    loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss
+    return loss, nll_loss
+
+
+def lmap(f: Callable, x: Iterable) -> List:
+    """list(map(f, x))"""
+    return list(map(f, x))
+
+
+def calculate_bleu(output_lns, refs_lns, **kwargs) -> dict:
+    """Uses sacrebleu's corpus_bleu implementation."""
+    return {"bleu": round(corpus_bleu(output_lns, [refs_lns], **kwargs).score, 4)}
+
+
+def build_compute_metrics_fn(task_name: str, tokenizer: PreTrainedTokenizer) -> Callable[[EvalPrediction], Dict]:
+    def non_pad_len(tokens: np.ndarray) -> int:
+        return np.count_nonzero(tokens != tokenizer.pad_token_id)
+
+    def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]:
+        pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
+        label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
+        pred_str = lmap(str.strip, pred_str)
+        label_str = lmap(str.strip, label_str)
+        return pred_str, label_str
+
+    def summarization_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        rouge: Dict = calculate_rouge(pred_str, label_str)
+        summ_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        rouge.update({"gen_len": summ_len})
+        return rouge
+
+    def translation_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        bleu: Dict = calculate_bleu(pred_str, label_str)
+        gen_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        bleu.update({"gen_len": gen_len})
+        return bleu
+
+    compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics
+    return compute_metrics_fn
+
+
+def trim_batch(
+    input_ids,
+    pad_token_id,
+    attention_mask=None,
+):
+    """Remove columns that are populated exclusively by pad_token_id"""
+    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
+    if attention_mask is None:
+        return input_ids[:, keep_column_mask]
+    else:
+        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])
+
+
+class AbstractSeq2SeqDataset(Dataset):
+    def __init__(
+        self,
+        tokenizer,
+        data_dir,
+        max_source_length,
+        max_target_length,
+        type_path="train",
+        n_obs=None,
+        prefix="",
+        **dataset_kwargs
+    ):
+        super().__init__()
+        self.src_file = Path(data_dir).joinpath(type_path + ".source")
+        self.tgt_file = Path(data_dir).joinpath(type_path + ".target")
+        self.len_file = Path(data_dir).joinpath(type_path + ".len")
+        if os.path.exists(self.len_file):
+            self.src_lens = pickle_load(self.len_file)
+            self.used_char_len = False
+        else:
+            self.src_lens = self.get_char_lens(self.src_file)
+            self.used_char_len = True
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+        assert min(self.src_lens) > 0, f"found empty line in {self.src_file}"
+        self.tokenizer = tokenizer
+        self.prefix = prefix if prefix is not None else ""
+
+        if n_obs is not None:
+            self.src_lens = self.src_lens[:n_obs]
+        self.pad_token_id = self.tokenizer.pad_token_id
+        self.dataset_kwargs = dataset_kwargs
+        dataset_kwargs.update({"add_prefix_space": True} if isinstance(self.tokenizer, BartTokenizer) else {})
+
+    def __len__(self):
+        return len(self.src_lens)
+
+    @staticmethod
+    def get_char_lens(data_file):
+        return [len(x) for x in Path(data_file).open().readlines()]
+
+    @cached_property
+    def tgt_lens(self):
+        """Length in characters of target documents"""
+        return self.get_char_lens(self.tgt_file)
+
+    def make_sortish_sampler(self, batch_size, distributed=False, shuffle=True, **kwargs):
+        if distributed:
+            return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs)
+        else:
+            return SortishSampler(self.src_lens, batch_size, shuffle=shuffle)
+
+    def make_dynamic_sampler(self, max_tokens_per_batch=1024, **kwargs):
+        assert FAIRSEQ_AVAILABLE, "Dynamic batch size requires `pip install fairseq`"
+        assert not self.used_char_len, "You must call  python make_len_file.py before calling make_dynamic_sampler"
+        sorted_indices = list(self.make_sortish_sampler(1024, shuffle=False))
+
+        def num_tokens_in_example(i):
+            return min(self.src_lens[i], self.max_target_length)
+
+        # call fairseq cython function
+        batch_sampler: List[List[int]] = batch_by_size(
+            sorted_indices,
+            num_tokens_fn=num_tokens_in_example,
+            max_tokens=max_tokens_per_batch,
+            required_batch_size_multiple=64,
+        )
+        shuffled_batches = [batch_sampler[i] for i in np.random.permutation(range(len(batch_sampler)))]
+        # move the largest batch to the front to OOM quickly (uses an approximation for padding)
+        approximate_toks_per_batch = [max(self.src_lens[i] for i in batch) * len(batch) for batch in shuffled_batches]
+        largest_batch_idx = np.argmax(approximate_toks_per_batch)
+        shuffled_batches[0], shuffled_batches[largest_batch_idx] = (
+            shuffled_batches[largest_batch_idx],
+            shuffled_batches[0],
+        )
+        return shuffled_batches
+
+    def __getitem__(self, item):
+        raise NotImplementedError("You must implement this")
+
+    def collate_fn(self, batch):
+        raise NotImplementedError("You must implement this")
+
+
+class LegacySeq2SeqDataset(AbstractSeq2SeqDataset):
+    def __getitem__(self, index) -> Dict[str, torch.Tensor]:
+        """Call tokenizer on src and tgt_lines"""
+        index = index + 1  # linecache starts at 1
+        source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
+        tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
+        assert source_line, f"empty source line for index {index}"
+        assert tgt_line, f"empty tgt line for index {index}"
+        source_inputs = self.encode_line(self.tokenizer, source_line, self.max_source_length)
+        target_inputs = self.encode_line(self.tokenizer, tgt_line, self.max_target_length)
+
+        source_ids = source_inputs["input_ids"].squeeze()
+        target_ids = target_inputs["input_ids"].squeeze()
+        src_mask = source_inputs["attention_mask"].squeeze()
+        return {
+            "input_ids": source_ids,
+            "attention_mask": src_mask,
+            "labels": target_ids,
+        }
+
+    def encode_line(self, tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"):
+        """Only used by LegacyDataset"""
+        return tokenizer(
+            [line],
+            max_length=max_length,
+            padding="max_length" if pad_to_max_length else None,
+            truncation=True,
+            return_tensors=return_tensors,
+            **self.dataset_kwargs,
+        )
+
+    def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
+        input_ids = torch.stack([x["input_ids"] for x in batch])
+        masks = torch.stack([x["attention_mask"] for x in batch])
+        target_ids = torch.stack([x["labels"] for x in batch])
+        pad_token_id = self.pad_token_id
+        y = trim_batch(target_ids, pad_token_id)
+        source_ids, source_mask = trim_batch(input_ids, pad_token_id, attention_mask=masks)
+        batch = {
+            "input_ids": source_ids,
+            "attention_mask": source_mask,
+            "labels": y,
+        }
+        return batch
+
+
+class Seq2SeqDataset(AbstractSeq2SeqDataset):
+    """A dataset that calls prepare_seq2seq_batch."""
+
+    def __getitem__(self, index) -> Dict[str, str]:
+        index = index + 1  # linecache starts at 1
+        source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
+        tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
+        assert source_line, f"empty source line for index {index}"
+        assert tgt_line, f"empty tgt line for index {index}"
+        return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
+
+    def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
+        """Call prepare_seq2seq_batch."""
+        batch_encoding: Dict[str, torch.Tensor] = self.tokenizer.prepare_seq2seq_batch(
+            [x["src_texts"] for x in batch],
+            tgt_texts=[x["tgt_texts"] for x in batch],
+            max_length=self.max_source_length,
+            max_target_length=self.max_target_length,
+            return_tensors="pt",
+            **self.dataset_kwargs,
+        ).data
+        batch_encoding["ids"] = torch.tensor([x["id"] for x in batch])
+        return batch_encoding
+
+
+class Seq2SeqDataCollator:
+    def __init__(self, tokenizer, data_args, tpu_num_cores=None):
+        self.tokenizer = tokenizer
+        self.pad_token_id = tokenizer.pad_token_id
+        assert (
+            self.pad_token_id is not None
+        ), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
+        self.data_args = data_args
+        self.tpu_num_cores = tpu_num_cores
+        self.dataset_kwargs = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {}
+        if data_args.src_lang is not None:
+            self.dataset_kwargs["src_lang"] = data_args.src_lang
+        if data_args.tgt_lang is not None:
+            self.dataset_kwargs["tgt_lang"] = data_args.tgt_lang
+
+    def __call__(self, batch) -> Dict[str, torch.Tensor]:
+        if hasattr(self.tokenizer, "prepare_seq2seq_batch"):
+            batch = self._encode(batch)
+            input_ids, attention_mask, labels = (
+                batch["input_ids"],
+                batch["attention_mask"],
+                batch["labels"],
+            )
+        else:
+            input_ids = torch.stack([x["input_ids"] for x in batch])
+            attention_mask = torch.stack([x["attention_mask"] for x in batch])
+            labels = torch.stack([x["labels"] for x in batch])
+
+            labels = trim_batch(labels, self.pad_token_id)
+            input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
+
+        if isinstance(self.tokenizer, T5Tokenizer):
+            decoder_input_ids = self._shift_right_t5(labels)
+        else:
+            decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
+
+        batch = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "labels": labels,
+        }
+        return batch
+
+    def _shift_right_t5(self, input_ids):
+        # shift inputs to the right
+        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = self.pad_token_id
+        return shifted_input_ids
+
+    def _encode(self, batch) -> Dict[str, torch.Tensor]:
+        batch_encoding = self.tokenizer.prepare_seq2seq_batch(
+            [x["src_texts"] for x in batch],
+            tgt_texts=[x["tgt_texts"] for x in batch],
+            max_length=self.data_args.max_source_length,
+            max_target_length=self.data_args.max_target_length,
+            padding="max_length" if self.tpu_num_cores is not None else "longest",  # TPU hack
+            return_tensors="pt",
+            **self.dataset_kwargs,
+        )
+        return batch_encoding.data
+
+
+class SortishSampler(Sampler):
+    "Go through the text data by order of src length with a bit of randomness. From fastai repo."
+
+    def __init__(self, data, batch_size, shuffle=True):
+        self.data, self.bs, self.shuffle = data, batch_size, shuffle
+
+    def __len__(self) -> int:
+        return len(self.data)
+
+    def __iter__(self):
+        return iter(sortish_sampler_indices(self.data, self.bs, shuffle=self.shuffle))
+
+
+def sortish_sampler_indices(data: List, bs: int, shuffle=True) -> np.array:
+    "Go through the text data by order of src length with a bit of randomness. From fastai repo."
+    if not shuffle:
+        return np.argsort(np.array(data) * -1)
+
+    def key_fn(i):
+        return data[i]
+
+    idxs = np.random.permutation(len(data))
+    sz = bs * 50
+    ck_idx = [idxs[i : i + sz] for i in range(0, len(idxs), sz)]
+    sort_idx = np.concatenate([sorted(s, key=key_fn, reverse=True) for s in ck_idx])
+    sz = bs
+    ck_idx = [sort_idx[i : i + sz] for i in range(0, len(sort_idx), sz)]
+    max_ck = np.argmax([key_fn(ck[0]) for ck in ck_idx])  # find the chunk with the largest key,
+    ck_idx[0], ck_idx[max_ck] = ck_idx[max_ck], ck_idx[0]  # then make sure it goes first.
+    sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([], dtype=np.int)
+    sort_idx = np.concatenate((ck_idx[0], sort_idx))
+    return sort_idx
+
+
+class DistributedSortishSampler(Sampler):
+    """Copied from torch DistributedSampler"""
+
+    def __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True):
+        if num_replicas is None:
+            if not dist.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            num_replicas = dist.get_world_size()
+        if rank is None:
+            if not dist.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            rank = dist.get_rank()
+        self.dataset = dataset
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        if add_extra_examples:
+            self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
+            self.total_size = self.num_samples * self.num_replicas
+        else:
+            self.total_size = len(dataset)
+            self.num_samples = len(self.available_indices)
+        self.batch_size = batch_size
+        self.add_extra_examples = add_extra_examples
+        self.shuffle = shuffle
+
+    def __iter__(self) -> Iterable:
+        g = torch.Generator()
+        g.manual_seed(self.epoch)
+
+        sortish_data = [self.dataset.src_lens[i] for i in self.available_indices]
+        sortish_indices = sortish_sampler_indices(sortish_data, self.batch_size, shuffle=self.shuffle)
+        indices = [self.available_indices[i] for i in sortish_indices]
+        assert len(indices) == self.num_samples
+        return iter(indices)
+
+    @cached_property
+    def available_indices(self) -> np.array:
+        indices = list(range(len(self.dataset)))
+        # add extra samples to make it evenly divisible
+        indices += indices[: (self.total_size - len(indices))]
+        assert len(indices) == self.total_size
+        # subsample
+        available_indices = indices[self.rank : self.total_size : self.num_replicas]
+        return available_indices
+
+    def __len__(self):
+        return self.num_samples
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+
+logger = getLogger(__name__)
+
+
+def use_task_specific_params(model, task):
+    """Update config with summarization specific params."""
+    task_specific_params = model.config.task_specific_params
+
+    if task_specific_params is not None:
+        pars = task_specific_params.get(task, {})
+        logger.info(f"using task specific params for {task}: {pars}")
+        model.config.update(pars)
+
+
+def pickle_load(path):
+    """pickle.load(path)"""
+    with open(path, "rb") as f:
+        return pickle.load(f)
+
+
+def pickle_save(obj, path):
+    """pickle.dump(obj, path)"""
+    with open(path, "wb") as f:
+        return pickle.dump(obj, f)
+
+
+def flatten_list(summary_ids: List[List]):
+    return [x for x in itertools.chain.from_iterable(summary_ids)]
+
+
+def save_git_info(folder_path: str) -> None:
+    """Save git information to output_dir/git_log.json"""
+    repo_infos = get_git_info()
+    save_json(repo_infos, os.path.join(folder_path, "git_log.json"))
+
+
+def save_json(content, path, indent=4, **json_dump_kwargs):
+    with open(path, "w") as f:
+        json.dump(content, f, indent=indent, **json_dump_kwargs)
+
+
+def load_json(path):
+    with open(path) as f:
+        return json.load(f)
+
+
+def get_git_info():
+    try:
+        repo = git.Repo(search_parent_directories=True)
+        repo_infos = {
+            "repo_id": str(repo),
+            "repo_sha": str(repo.head.object.hexsha),
+            "repo_branch": str(repo.active_branch),
+            "hostname": str(socket.gethostname()),
+        }
+        return repo_infos
+    except TypeError:
+        return {
+            "repo_id": None,
+            "repo_sha": None,
+            "repo_branch": None,
+            "hostname": None,
+        }
+
+
+ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
+
+
+def extract_rouge_mid_statistics(dct):
+    new_dict = {}
+    for k1, v1 in dct.items():
+        mid = v1.mid
+        new_dict[k1] = {stat: round(getattr(mid, stat), 4) for stat in ["precision", "recall", "fmeasure"]}
+    return new_dict
+
+
+def calculate_rouge(
+    pred_lns: List[str],
+    tgt_lns: List[str],
+    use_stemmer=True,
+    rouge_keys=ROUGE_KEYS,
+    return_precision_and_recall=False,
+    bootstrap_aggregation=True,
+    newline_sep=True,
+) -> Dict:
+    """Calculate rouge using rouge_scorer package.
+
+    Args:
+        pred_lns: list of summaries generated by model
+        tgt_lns: list of groundtruth summaries (e.g. contents of val.target)
+        use_stemmer:  Bool indicating whether Porter stemmer should be used to
+        strip word suffixes to improve matching.
+        rouge_keys:  which metrics to compute, defaults to rouge1, rouge2, rougeL, rougeLsum
+        return_precision_and_recall: (False) whether to also return precision and recall.
+        bootstrap_aggregation: whether to do the typical bootstrap resampling of scores. Defaults to True, if False
+            this function returns a collections.defaultdict[metric: list of values for each observation for each subscore]``
+        newline_sep:(default=True) whether to add newline between sentences. This is essential for calculation rougeL
+        on multi sentence summaries (CNN/DM dataset).
+
+    Returns:
+         Dict[score: value] if aggregate else defaultdict(list) keyed by rouge_keys
+
+    """
+    scorer = rouge_scorer.RougeScorer(rouge_keys, use_stemmer=use_stemmer)
+    aggregator = scoring.BootstrapAggregator()
+    for pred, tgt in zip(tgt_lns, pred_lns):
+        # rougeLsum expects "\n" separated sentences within a summary
+        if newline_sep:
+            pred = add_newline_to_end_of_each_sentence(pred)
+            tgt = add_newline_to_end_of_each_sentence(tgt)
+        scores = scorer.score(pred, tgt)
+        aggregator.add_scores(scores)
+
+    if bootstrap_aggregation:
+        result = aggregator.aggregate()
+        if return_precision_and_recall:
+            return extract_rouge_mid_statistics(result)  # here we return dict
+        else:
+            return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
+
+    else:
+        return aggregator._scores  # here we return defaultdict(list)
+
+
+# Utilities for freezing parameters and checking whether they are frozen
+
+
+def freeze_params(model: nn.Module):
+    """Set requires_grad=False for each of model.parameters()"""
+    for par in model.parameters():
+        par.requires_grad = False
+
+
+def freeze_embeds(model):
+    """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
+    model_type = model.config.model_type
+
+    if model_type == "t5":
+        freeze_params(model.shared)
+        for d in [model.encoder, model.decoder]:
+            freeze_params(d.embed_tokens)
+    elif model_type == "fsmt":
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+    else:
+        freeze_params(model.model.shared)
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+
+
+def grad_status(model: nn.Module) -> Iterable:
+    return (par.requires_grad for par in model.parameters())
+
+
+def any_requires_grad(model: nn.Module) -> bool:
+    return any(grad_status(model))
+
+
+def assert_all_frozen(model):
+    model_grads: List[bool] = list(grad_status(model))
+    n_require_grad = sum(lmap(int, model_grads))
+    npars = len(model_grads)
+    assert not any(model_grads), f"{n_require_grad/npars:.1%} of {npars} weights require grad"
+
+
+def assert_not_all_frozen(model):
+    model_grads: List[bool] = list(grad_status(model))
+    npars = len(model_grads)
+    assert any(model_grads), f"none of {npars} weights require grad"
+
+
+def parse_numeric_n_bool_cl_kwargs(unparsed_args: List[str]) -> Dict[str, Union[int, float, bool]]:
+    """
+    Parse an argv list of unspecified command line args to a dict.
+    Assumes all values are either numeric or boolean in the form of true/false.
+    """
+    result = {}
+    assert len(unparsed_args) % 2 == 0, f"got odd number of unparsed args: {unparsed_args}"
+    num_pairs = len(unparsed_args) // 2
+    for pair_num in range(num_pairs):
+        i = 2 * pair_num
+        assert unparsed_args[i].startswith("--")
+        if unparsed_args[i + 1].lower() == "true":
+            value = True
+        elif unparsed_args[i + 1].lower() == "false":
+            value = False
+        else:
+            try:
+                value = int(unparsed_args[i + 1])
+            except ValueError:
+                value = float(unparsed_args[i + 1])  # this can raise another informative ValueError
+
+        result[unparsed_args[i][2:]] = value
+    return result
+
+
+def write_txt_file(ordered_tgt, path):
+    f = Path(path).open("w")
+    for ln in ordered_tgt:
+        f.write(ln + "\n")
+        f.flush()
+
+
+def chunks(lst, n):
+    """Yield successive n-sized chunks from lst."""
+    for i in range(0, len(lst), n):
+        yield lst[i : i + n]
+
+
+def check_output_dir(args, expected_items=0):
+    """
+    Checks whether to bail out if output_dir already exists and has more than expected_items in it
+
+    `args`: needs to have the following attributes of `args`:
+      - output_dir
+      - do_train
+      - overwrite_output_dir
+
+    `expected_items`: normally 0 (default) - i.e. empty dir, but in some cases a few files are expected (e.g. recovery from OOM)
+    """
+    if (
+        os.path.exists(args.output_dir)
+        and len(os.listdir(args.output_dir)) > expected_items
+        and args.do_train
+        and not args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({args.output_dir}) already exists and "
+            f"has {len(os.listdir(args.output_dir))} items in it (expected {expected_items} items). "
+            "Use --overwrite_output_dir to overcome."
+        )
diff --git a/examples/research_projects/seq2seq-distillation/utils.py b/examples/research_projects/seq2seq-distillation/utils.py
new file mode 100644
index 000000000..b6994a183
--- /dev/null
+++ b/examples/research_projects/seq2seq-distillation/utils.py
@@ -0,0 +1,645 @@
+import itertools
+import json
+import linecache
+import math
+import os
+import pickle
+import socket
+from logging import getLogger
+from pathlib import Path
+from typing import Callable, Dict, Iterable, List, Tuple, Union
+
+import git
+import numpy as np
+import torch
+import torch.distributed as dist
+from rouge_score import rouge_scorer, scoring
+from sacrebleu import corpus_bleu
+from torch import nn
+from torch.utils.data import Dataset, Sampler
+
+from sentence_splitter import add_newline_to_end_of_each_sentence
+from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer
+from transformers.file_utils import cached_property
+from transformers.models.bart.modeling_bart import shift_tokens_right
+
+
+try:
+    from fairseq.data.data_utils import batch_by_size
+
+    FAIRSEQ_AVAILABLE = True
+except (ImportError, ModuleNotFoundError):
+    FAIRSEQ_AVAILABLE = False
+
+
+def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
+    """From fairseq"""
+    if target.dim() == lprobs.dim() - 1:
+        target = target.unsqueeze(-1)
+    nll_loss = -lprobs.gather(dim=-1, index=target)
+    smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
+    if ignore_index is not None:
+        pad_mask = target.eq(ignore_index)
+        nll_loss.masked_fill_(pad_mask, 0.0)
+        smooth_loss.masked_fill_(pad_mask, 0.0)
+    else:
+        nll_loss = nll_loss.squeeze(-1)
+        smooth_loss = smooth_loss.squeeze(-1)
+
+    nll_loss = nll_loss.sum()  # mean()? Scared to break other math.
+    smooth_loss = smooth_loss.sum()
+    eps_i = epsilon / lprobs.size(-1)
+    loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss
+    return loss, nll_loss
+
+
+def lmap(f: Callable, x: Iterable) -> List:
+    """list(map(f, x))"""
+    return list(map(f, x))
+
+
+def calculate_bleu(output_lns, refs_lns, **kwargs) -> dict:
+    """Uses sacrebleu's corpus_bleu implementation."""
+    return {"bleu": round(corpus_bleu(output_lns, [refs_lns], **kwargs).score, 4)}
+
+
+def build_compute_metrics_fn(task_name: str, tokenizer: PreTrainedTokenizer) -> Callable[[EvalPrediction], Dict]:
+    def non_pad_len(tokens: np.ndarray) -> int:
+        return np.count_nonzero(tokens != tokenizer.pad_token_id)
+
+    def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]:
+        pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
+        label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
+        pred_str = lmap(str.strip, pred_str)
+        label_str = lmap(str.strip, label_str)
+        return pred_str, label_str
+
+    def summarization_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        rouge: Dict = calculate_rouge(pred_str, label_str)
+        summ_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        rouge.update({"gen_len": summ_len})
+        return rouge
+
+    def translation_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        bleu: Dict = calculate_bleu(pred_str, label_str)
+        gen_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        bleu.update({"gen_len": gen_len})
+        return bleu
+
+    compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics
+    return compute_metrics_fn
+
+
+def trim_batch(
+    input_ids,
+    pad_token_id,
+    attention_mask=None,
+):
+    """Remove columns that are populated exclusively by pad_token_id"""
+    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
+    if attention_mask is None:
+        return input_ids[:, keep_column_mask]
+    else:
+        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])
+
+
+class AbstractSeq2SeqDataset(Dataset):
+    def __init__(
+        self,
+        tokenizer,
+        data_dir,
+        max_source_length,
+        max_target_length,
+        type_path="train",
+        n_obs=None,
+        prefix="",
+        **dataset_kwargs
+    ):
+        super().__init__()
+        self.src_file = Path(data_dir).joinpath(type_path + ".source")
+        self.tgt_file = Path(data_dir).joinpath(type_path + ".target")
+        self.len_file = Path(data_dir).joinpath(type_path + ".len")
+        if os.path.exists(self.len_file):
+            self.src_lens = pickle_load(self.len_file)
+            self.used_char_len = False
+        else:
+            self.src_lens = self.get_char_lens(self.src_file)
+            self.used_char_len = True
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+        assert min(self.src_lens) > 0, f"found empty line in {self.src_file}"
+        self.tokenizer = tokenizer
+        self.prefix = prefix if prefix is not None else ""
+
+        if n_obs is not None:
+            self.src_lens = self.src_lens[:n_obs]
+        self.pad_token_id = self.tokenizer.pad_token_id
+        self.dataset_kwargs = dataset_kwargs
+        dataset_kwargs.update({"add_prefix_space": True} if isinstance(self.tokenizer, BartTokenizer) else {})
+
+    def __len__(self):
+        return len(self.src_lens)
+
+    @staticmethod
+    def get_char_lens(data_file):
+        return [len(x) for x in Path(data_file).open().readlines()]
+
+    @cached_property
+    def tgt_lens(self):
+        """Length in characters of target documents"""
+        return self.get_char_lens(self.tgt_file)
+
+    def make_sortish_sampler(self, batch_size, distributed=False, shuffle=True, **kwargs):
+        if distributed:
+            return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs)
+        else:
+            return SortishSampler(self.src_lens, batch_size, shuffle=shuffle)
+
+    def make_dynamic_sampler(self, max_tokens_per_batch=1024, **kwargs):
+        assert FAIRSEQ_AVAILABLE, "Dynamic batch size requires `pip install fairseq`"
+        assert not self.used_char_len, "You must call  python make_len_file.py before calling make_dynamic_sampler"
+        sorted_indices = list(self.make_sortish_sampler(1024, shuffle=False))
+
+        def num_tokens_in_example(i):
+            return min(self.src_lens[i], self.max_target_length)
+
+        # call fairseq cython function
+        batch_sampler: List[List[int]] = batch_by_size(
+            sorted_indices,
+            num_tokens_fn=num_tokens_in_example,
+            max_tokens=max_tokens_per_batch,
+            required_batch_size_multiple=64,
+        )
+        shuffled_batches = [batch_sampler[i] for i in np.random.permutation(range(len(batch_sampler)))]
+        # move the largest batch to the front to OOM quickly (uses an approximation for padding)
+        approximate_toks_per_batch = [max(self.src_lens[i] for i in batch) * len(batch) for batch in shuffled_batches]
+        largest_batch_idx = np.argmax(approximate_toks_per_batch)
+        shuffled_batches[0], shuffled_batches[largest_batch_idx] = (
+            shuffled_batches[largest_batch_idx],
+            shuffled_batches[0],
+        )
+        return shuffled_batches
+
+    def __getitem__(self, item):
+        raise NotImplementedError("You must implement this")
+
+    def collate_fn(self, batch):
+        raise NotImplementedError("You must implement this")
+
+
+class LegacySeq2SeqDataset(AbstractSeq2SeqDataset):
+    def __getitem__(self, index) -> Dict[str, torch.Tensor]:
+        """Call tokenizer on src and tgt_lines"""
+        index = index + 1  # linecache starts at 1
+        source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
+        tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
+        assert source_line, f"empty source line for index {index}"
+        assert tgt_line, f"empty tgt line for index {index}"
+        source_inputs = self.encode_line(self.tokenizer, source_line, self.max_source_length)
+        target_inputs = self.encode_line(self.tokenizer, tgt_line, self.max_target_length)
+
+        source_ids = source_inputs["input_ids"].squeeze()
+        target_ids = target_inputs["input_ids"].squeeze()
+        src_mask = source_inputs["attention_mask"].squeeze()
+        return {
+            "input_ids": source_ids,
+            "attention_mask": src_mask,
+            "labels": target_ids,
+        }
+
+    def encode_line(self, tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"):
+        """Only used by LegacyDataset"""
+        return tokenizer(
+            [line],
+            max_length=max_length,
+            padding="max_length" if pad_to_max_length else None,
+            truncation=True,
+            return_tensors=return_tensors,
+            **self.dataset_kwargs,
+        )
+
+    def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
+        input_ids = torch.stack([x["input_ids"] for x in batch])
+        masks = torch.stack([x["attention_mask"] for x in batch])
+        target_ids = torch.stack([x["labels"] for x in batch])
+        pad_token_id = self.pad_token_id
+        y = trim_batch(target_ids, pad_token_id)
+        source_ids, source_mask = trim_batch(input_ids, pad_token_id, attention_mask=masks)
+        batch = {
+            "input_ids": source_ids,
+            "attention_mask": source_mask,
+            "labels": y,
+        }
+        return batch
+
+
+class Seq2SeqDataset(AbstractSeq2SeqDataset):
+    """A dataset that calls prepare_seq2seq_batch."""
+
+    def __getitem__(self, index) -> Dict[str, str]:
+        index = index + 1  # linecache starts at 1
+        source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
+        tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
+        assert source_line, f"empty source line for index {index}"
+        assert tgt_line, f"empty tgt line for index {index}"
+        return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
+
+    def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
+        """Call prepare_seq2seq_batch."""
+        batch_encoding: Dict[str, torch.Tensor] = self.tokenizer.prepare_seq2seq_batch(
+            [x["src_texts"] for x in batch],
+            tgt_texts=[x["tgt_texts"] for x in batch],
+            max_length=self.max_source_length,
+            max_target_length=self.max_target_length,
+            return_tensors="pt",
+            **self.dataset_kwargs,
+        ).data
+        batch_encoding["ids"] = torch.tensor([x["id"] for x in batch])
+        return batch_encoding
+
+
+class Seq2SeqDataCollator:
+    def __init__(self, tokenizer, data_args, tpu_num_cores=None):
+        self.tokenizer = tokenizer
+        self.pad_token_id = tokenizer.pad_token_id
+        assert (
+            self.pad_token_id is not None
+        ), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
+        self.data_args = data_args
+        self.tpu_num_cores = tpu_num_cores
+        self.dataset_kwargs = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {}
+        if data_args.src_lang is not None:
+            self.dataset_kwargs["src_lang"] = data_args.src_lang
+        if data_args.tgt_lang is not None:
+            self.dataset_kwargs["tgt_lang"] = data_args.tgt_lang
+
+    def __call__(self, batch) -> Dict[str, torch.Tensor]:
+        if hasattr(self.tokenizer, "prepare_seq2seq_batch"):
+            batch = self._encode(batch)
+            input_ids, attention_mask, labels = (
+                batch["input_ids"],
+                batch["attention_mask"],
+                batch["labels"],
+            )
+        else:
+            input_ids = torch.stack([x["input_ids"] for x in batch])
+            attention_mask = torch.stack([x["attention_mask"] for x in batch])
+            labels = torch.stack([x["labels"] for x in batch])
+
+            labels = trim_batch(labels, self.pad_token_id)
+            input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
+
+        if isinstance(self.tokenizer, T5Tokenizer):
+            decoder_input_ids = self._shift_right_t5(labels)
+        else:
+            decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
+
+        batch = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "labels": labels,
+        }
+        return batch
+
+    def _shift_right_t5(self, input_ids):
+        # shift inputs to the right
+        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = self.pad_token_id
+        return shifted_input_ids
+
+    def _encode(self, batch) -> Dict[str, torch.Tensor]:
+        batch_encoding = self.tokenizer.prepare_seq2seq_batch(
+            [x["src_texts"] for x in batch],
+            tgt_texts=[x["tgt_texts"] for x in batch],
+            max_length=self.data_args.max_source_length,
+            max_target_length=self.data_args.max_target_length,
+            padding="max_length" if self.tpu_num_cores is not None else "longest",  # TPU hack
+            return_tensors="pt",
+            **self.dataset_kwargs,
+        )
+        return batch_encoding.data
+
+
+class SortishSampler(Sampler):
+    "Go through the text data by order of src length with a bit of randomness. From fastai repo."
+
+    def __init__(self, data, batch_size, shuffle=True):
+        self.data, self.bs, self.shuffle = data, batch_size, shuffle
+
+    def __len__(self) -> int:
+        return len(self.data)
+
+    def __iter__(self):
+        return iter(sortish_sampler_indices(self.data, self.bs, shuffle=self.shuffle))
+
+
+def sortish_sampler_indices(data: List, bs: int, shuffle=True) -> np.array:
+    "Go through the text data by order of src length with a bit of randomness. From fastai repo."
+    if not shuffle:
+        return np.argsort(np.array(data) * -1)
+
+    def key_fn(i):
+        return data[i]
+
+    idxs = np.random.permutation(len(data))
+    sz = bs * 50
+    ck_idx = [idxs[i : i + sz] for i in range(0, len(idxs), sz)]
+    sort_idx = np.concatenate([sorted(s, key=key_fn, reverse=True) for s in ck_idx])
+    sz = bs
+    ck_idx = [sort_idx[i : i + sz] for i in range(0, len(sort_idx), sz)]
+    max_ck = np.argmax([key_fn(ck[0]) for ck in ck_idx])  # find the chunk with the largest key,
+    ck_idx[0], ck_idx[max_ck] = ck_idx[max_ck], ck_idx[0]  # then make sure it goes first.
+    sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([], dtype=np.int)
+    sort_idx = np.concatenate((ck_idx[0], sort_idx))
+    return sort_idx
+
+
+class DistributedSortishSampler(Sampler):
+    """Copied from torch DistributedSampler"""
+
+    def __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True):
+        if num_replicas is None:
+            if not dist.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            num_replicas = dist.get_world_size()
+        if rank is None:
+            if not dist.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            rank = dist.get_rank()
+        self.dataset = dataset
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        if add_extra_examples:
+            self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
+            self.total_size = self.num_samples * self.num_replicas
+        else:
+            self.total_size = len(dataset)
+            self.num_samples = len(self.available_indices)
+        self.batch_size = batch_size
+        self.add_extra_examples = add_extra_examples
+        self.shuffle = shuffle
+
+    def __iter__(self) -> Iterable:
+        g = torch.Generator()
+        g.manual_seed(self.epoch)
+
+        sortish_data = [self.dataset.src_lens[i] for i in self.available_indices]
+        sortish_indices = sortish_sampler_indices(sortish_data, self.batch_size, shuffle=self.shuffle)
+        indices = [self.available_indices[i] for i in sortish_indices]
+        assert len(indices) == self.num_samples
+        return iter(indices)
+
+    @cached_property
+    def available_indices(self) -> np.array:
+        indices = list(range(len(self.dataset)))
+        # add extra samples to make it evenly divisible
+        indices += indices[: (self.total_size - len(indices))]
+        assert len(indices) == self.total_size
+        # subsample
+        available_indices = indices[self.rank : self.total_size : self.num_replicas]
+        return available_indices
+
+    def __len__(self):
+        return self.num_samples
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+
+logger = getLogger(__name__)
+
+
+def use_task_specific_params(model, task):
+    """Update config with summarization specific params."""
+    task_specific_params = model.config.task_specific_params
+
+    if task_specific_params is not None:
+        pars = task_specific_params.get(task, {})
+        logger.info(f"using task specific params for {task}: {pars}")
+        model.config.update(pars)
+
+
+def pickle_load(path):
+    """pickle.load(path)"""
+    with open(path, "rb") as f:
+        return pickle.load(f)
+
+
+def pickle_save(obj, path):
+    """pickle.dump(obj, path)"""
+    with open(path, "wb") as f:
+        return pickle.dump(obj, f)
+
+
+def flatten_list(summary_ids: List[List]):
+    return [x for x in itertools.chain.from_iterable(summary_ids)]
+
+
+def save_git_info(folder_path: str) -> None:
+    """Save git information to output_dir/git_log.json"""
+    repo_infos = get_git_info()
+    save_json(repo_infos, os.path.join(folder_path, "git_log.json"))
+
+
+def save_json(content, path, indent=4, **json_dump_kwargs):
+    with open(path, "w") as f:
+        json.dump(content, f, indent=indent, **json_dump_kwargs)
+
+
+def load_json(path):
+    with open(path) as f:
+        return json.load(f)
+
+
+def get_git_info():
+    try:
+        repo = git.Repo(search_parent_directories=True)
+        repo_infos = {
+            "repo_id": str(repo),
+            "repo_sha": str(repo.head.object.hexsha),
+            "repo_branch": str(repo.active_branch),
+            "hostname": str(socket.gethostname()),
+        }
+        return repo_infos
+    except TypeError:
+        return {
+            "repo_id": None,
+            "repo_sha": None,
+            "repo_branch": None,
+            "hostname": None,
+        }
+
+
+ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
+
+
+def extract_rouge_mid_statistics(dct):
+    new_dict = {}
+    for k1, v1 in dct.items():
+        mid = v1.mid
+        new_dict[k1] = {stat: round(getattr(mid, stat), 4) for stat in ["precision", "recall", "fmeasure"]}
+    return new_dict
+
+
+def calculate_rouge(
+    pred_lns: List[str],
+    tgt_lns: List[str],
+    use_stemmer=True,
+    rouge_keys=ROUGE_KEYS,
+    return_precision_and_recall=False,
+    bootstrap_aggregation=True,
+    newline_sep=True,
+) -> Dict:
+    """Calculate rouge using rouge_scorer package.
+
+    Args:
+        pred_lns: list of summaries generated by model
+        tgt_lns: list of groundtruth summaries (e.g. contents of val.target)
+        use_stemmer:  Bool indicating whether Porter stemmer should be used to
+        strip word suffixes to improve matching.
+        rouge_keys:  which metrics to compute, defaults to rouge1, rouge2, rougeL, rougeLsum
+        return_precision_and_recall: (False) whether to also return precision and recall.
+        bootstrap_aggregation: whether to do the typical bootstrap resampling of scores. Defaults to True, if False
+            this function returns a collections.defaultdict[metric: list of values for each observation for each subscore]``
+        newline_sep:(default=True) whether to add newline between sentences. This is essential for calculation rougeL
+        on multi sentence summaries (CNN/DM dataset).
+
+    Returns:
+         Dict[score: value] if aggregate else defaultdict(list) keyed by rouge_keys
+
+    """
+    scorer = rouge_scorer.RougeScorer(rouge_keys, use_stemmer=use_stemmer)
+    aggregator = scoring.BootstrapAggregator()
+    for pred, tgt in zip(tgt_lns, pred_lns):
+        # rougeLsum expects "\n" separated sentences within a summary
+        if newline_sep:
+            pred = add_newline_to_end_of_each_sentence(pred)
+            tgt = add_newline_to_end_of_each_sentence(tgt)
+        scores = scorer.score(pred, tgt)
+        aggregator.add_scores(scores)
+
+    if bootstrap_aggregation:
+        result = aggregator.aggregate()
+        if return_precision_and_recall:
+            return extract_rouge_mid_statistics(result)  # here we return dict
+        else:
+            return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
+
+    else:
+        return aggregator._scores  # here we return defaultdict(list)
+
+
+# Utilities for freezing parameters and checking whether they are frozen
+
+
+def freeze_params(model: nn.Module):
+    """Set requires_grad=False for each of model.parameters()"""
+    for par in model.parameters():
+        par.requires_grad = False
+
+
+def freeze_embeds(model):
+    """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
+    model_type = model.config.model_type
+
+    if model_type == "t5":
+        freeze_params(model.shared)
+        for d in [model.encoder, model.decoder]:
+            freeze_params(d.embed_tokens)
+    elif model_type == "fsmt":
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+    else:
+        freeze_params(model.model.shared)
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+
+
+def grad_status(model: nn.Module) -> Iterable:
+    return (par.requires_grad for par in model.parameters())
+
+
+def any_requires_grad(model: nn.Module) -> bool:
+    return any(grad_status(model))
+
+
+def assert_all_frozen(model):
+    model_grads: List[bool] = list(grad_status(model))
+    n_require_grad = sum(lmap(int, model_grads))
+    npars = len(model_grads)
+    assert not any(model_grads), f"{n_require_grad/npars:.1%} of {npars} weights require grad"
+
+
+def assert_not_all_frozen(model):
+    model_grads: List[bool] = list(grad_status(model))
+    npars = len(model_grads)
+    assert any(model_grads), f"none of {npars} weights require grad"
+
+
+def parse_numeric_n_bool_cl_kwargs(unparsed_args: List[str]) -> Dict[str, Union[int, float, bool]]:
+    """
+    Parse an argv list of unspecified command line args to a dict.
+    Assumes all values are either numeric or boolean in the form of true/false.
+    """
+    result = {}
+    assert len(unparsed_args) % 2 == 0, f"got odd number of unparsed args: {unparsed_args}"
+    num_pairs = len(unparsed_args) // 2
+    for pair_num in range(num_pairs):
+        i = 2 * pair_num
+        assert unparsed_args[i].startswith("--")
+        if unparsed_args[i + 1].lower() == "true":
+            value = True
+        elif unparsed_args[i + 1].lower() == "false":
+            value = False
+        else:
+            try:
+                value = int(unparsed_args[i + 1])
+            except ValueError:
+                value = float(unparsed_args[i + 1])  # this can raise another informative ValueError
+
+        result[unparsed_args[i][2:]] = value
+    return result
+
+
+def write_txt_file(ordered_tgt, path):
+    f = Path(path).open("w")
+    for ln in ordered_tgt:
+        f.write(ln + "\n")
+        f.flush()
+
+
+def chunks(lst, n):
+    """Yield successive n-sized chunks from lst."""
+    for i in range(0, len(lst), n):
+        yield lst[i : i + n]
+
+
+def check_output_dir(args, expected_items=0):
+    """
+    Checks whether to bail out if output_dir already exists and has more than expected_items in it
+
+    `args`: needs to have the following attributes of `args`:
+      - output_dir
+      - do_train
+      - overwrite_output_dir
+
+    `expected_items`: normally 0 (default) - i.e. empty dir, but in some cases a few files are expected (e.g. recovery from OOM)
+    """
+    if (
+        os.path.exists(args.output_dir)
+        and len(os.listdir(args.output_dir)) > expected_items
+        and args.do_train
+        and not args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({args.output_dir}) already exists and "
+            f"has {len(os.listdir(args.output_dir))} items in it (expected {expected_items} items). "
+            "Use --overwrite_output_dir to overcome."
+        )
diff --git a/examples/seq2seq/README.md b/examples/seq2seq/README.md
index 6ac3cf8d7..c2860ae1a 100644
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Sequence to Sequence Training and Evaluation
 
 This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
@@ -112,101 +128,6 @@ Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prep
 Future work/help wanted: A new dataset to support multilingual tasks.
 
 
-### Finetuning Scripts
-All finetuning bash scripts call finetune.py (or distillation.py) with reasonable command line arguments. They usually require extra command line arguments to work.
-
-To see all the possible command line options, run:
-
-```bash
-./finetune.py --help
-```
-
-### Finetuning Training Params
-
-To override the pretrained model's training params, you can pass them to `./finetune.sh`:
-
-```bash
-./finetune.sh \
-    [...]
-    --encoder_layerdrop 0.1 \
-    --decoder_layerdrop 0.1 \
-    --dropout 0.1 \
-    --attention_dropout 0.1 \
-```
-
-### Summarization Finetuning
-Run/modify `finetune.sh`
-
-The following command should work on a 16GB GPU:
-```bash
-./finetune.sh \
-    --data_dir $XSUM_DIR \
-    --train_batch_size=1 \
-    --eval_batch_size=1 \
-    --output_dir=xsum_results \
-    --num_train_epochs 6 \
-    --model_name_or_path facebook/bart-large
-```
-
-There is a starter finetuning script for pegasus at `finetune_pegasus_xsum.sh`.
-
-### Translation Finetuning
-
-First, follow the wmt_en_ro download instructions.
-Then you can finetune mbart_cc25 on english-romanian with the following command.
-**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
-
-Best performing command:
-```bash
-# optionally
-export ENRO_DIR='wmt_en_ro' # Download instructions above
-# export WANDB_PROJECT="MT" # optional
-export MAX_LEN=128
-export BS=4
-./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --label_smoothing 0.1 --fp16_opt_level=O1 --logger_name wandb --sortish_sampler
-```
-This should take < 6h/epoch on a 16GB v100 and achieve test BLEU above 26
-To get results in line with fairseq, you need to do some postprocessing. (see `romanian_postprocessing.md`)
-
-MultiGPU command
-(using 8 GPUS as an example)
-```bash
-export ENRO_DIR='wmt_en_ro' # Download instructions above
- # export WANDB_PROJECT="MT" # optional
-export MAX_LEN=128
-export BS=4
-./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --gpus 8 --logger_name wandb
-```
-### Finetuning Outputs
-As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
-Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
-
-```bash
-output_dir
-├── best_tfmr  # this is a huggingface checkpoint generated by save_pretrained. It is the same model as the PL .ckpt file below
-│   ├── config.json
-│   ├── merges.txt
-│   ├── pytorch_model.bin
-│   ├── special_tokens_map.json
-│   ├── tokenizer_config.json
-│   └── vocab.json
-├── git_log.json   # repo, branch, and commit hash
-├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
-├── metrics.json  # new validation metrics will continually be appended to this
-├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
-│   ├── config.json
-│   └── pytorch_model.bin
-├── test_generations.txt
-# ^^ are the summaries or translations produced by your best checkpoint on the test data. Populated when training is done
-├── test_results.txt  # a convenience file with the test set metrics. This data is also in metrics.json['test']
-├── hparams.pkl  # the command line args passed after some light preprocessing. Should be saved fairly quickly.
-```
-After training, you can recover the best checkpoint by running
-```python
-from transformers import AutoModelForSeq2SeqLM
-model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
-```
-
 ### Fine-tuning using Seq2SeqTrainer
 To use `Seq2SeqTrainer` for fine-tuning you should use the `finetune_trainer.py` script. It subclasses `Trainer` to extend it for seq2seq training. Except the `Trainer`-related `TrainingArguments`, it shares the same argument names as that of `finetune.py` file. One notable difference is that calculating generative metrics (BLEU, ROUGE) is optional and is controlled using the `--predict_with_generate` argument.
 
@@ -242,190 +163,6 @@ The following command fine-tunes `sshleifer/student_marian_en_ro_6_3` on TPU V3-
 ./builtin_trainer/train_distil_marian_enro_tpu.sh
 ```
 
-# DistilBART
-<!---It should be called distilling bart and pegasus, but I don't want to break the link in the paper.-->
-This section describes all code and artifacts from our [Paper](http://arxiv.org/abs/2010.13002)
-
-![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
-
-+ For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works, which we call "Shrink and Fine-tune", or SFT.
-you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way. 
-+ For the XSUM dataset, training on pseudo-labels worked best for Pegasus (`sshleifer/distill-pegasus-16-4`), while training with KD worked best for `distilbart-xsum-12-6`
-+ For `sshleifer/dbart-xsum-12-3`
-+ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`. 
-+ You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
-and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).
-
-### Evaluation
-
-use [run_distributed_eval](./run_distributed_eval.py), with the following convenient alias
-```bash
-deval () {
-	proc=$1
-	m=$2
-	dd=$3
-	sd=$4
-	shift
-	shift
-	shift
-	shift
-	python -m torch.distributed.launch --nproc_per_node=$proc  run_distributed_eval.py \
-		--model_name $m  --save_dir $sd --data_dir $dd $@
-}
-```
-On a 1 GPU system, here are four commands (that assume `xsum`, `cnn_dm` are downloaded, cmd-F for those links in this file).
-
-`distilBART`:
-```bash
-deval 1 sshleifer/distilbart-xsum-12-3 xsum dbart_12_3_xsum_eval --fp16  # --help for more choices.
-deval 1 sshleifer/distilbart-cnn_dm-12-6 cnn_dm dbart_12_6_cnn_eval --fp16
-```
-
-`distill-pegasus`:
-```bash
-deval 1 sshleifer/distill-pegasus-cnn-16-4 cnn_dm dpx_cnn_eval
-deval 1 sshleifer/distill-pegasus-xsum-16-4 xsum dpx_xsum_eval
-```
-
-### Distillation
-+ For all of the following commands, you can get roughly equivalent result and faster run times by passing `--num_beams=4`. That's not what we did for the paper.
-+ Besides the KD section, you can also run commands with the built-in transformers trainer. See, for example, [builtin_trainer/train_distilbart_cnn.sh](./builtin_trainer/train_distilbart_cnn.sh).
-+ Large performance deviations (> 5X slower or more than 0.5 Rouge-2 worse), should be reported.
-+ Multi-gpu (controlled with `--gpus` should work, but might require more epochs).
-
-#### Recommended Workflow
-+ Get your dataset in the right format. (see 6 files above).
-+ Find a teacher model [Pegasus](https://huggingface.co/models?search=pegasus) (slower, better ROUGE) or `facebook/bart-large-xsum`/`facebook/bart-large-cnn` (faster, slightly lower.).
-Choose the checkpoint where the corresponding dataset is most similar (or identical to) your dataset.
-+ Follow the sections in order below. You can stop after SFT if you are satisfied, or move on to pseudo-labeling if you want more performance.
-+ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4. 
-+ If your SFT run starts at a validation ROUGE-2 that is more than 10 pts below the teacher's validation ROUGE-2,  you have a bug. Switching to a more expensive technique will not help. Try setting a breakpoint and looking at generation and truncation defaults/hyper-parameters, and share your experience on the forums!
-
-  
-#### Initialization
-We use [make_student.py](./make_student.py) to copy alternating layers from the teacher, and save the resulting model to disk
-```bash
-python make_student.py facebook/bart-large-xsum --save_path dbart_xsum_12_3  -e 12 -d 3
-```
-or for `pegasus-xsum`
-```bash
-python make_student.py google/pegasus-xsum --save_path dpx_xsum_16_4  --e 16 --d 4
-```
-we now have an initialized student saved to  `dbart_xsum_12_3`, which we will use for the following commands.
-+ Extension: To replicate more complicated initialize experiments in section 6.1, or try your own. Use the `create_student_by_copying_alternating_layers` function.
-
-#### Pegasus 
-+ The following commands are written for BART and will require, at minimum, the following modifications
-+ reduce batch size, and increase gradient accumulation steps so that the product `gpus * batch size * gradient_accumulation_steps = 256`. We used `--learning-rate` = 1e-4 * gradient accumulation steps.
-+ don't use fp16
-+ `--tokenizer_name google/pegasus-large`
-
-### SFT (No Teacher Distillation)
-You don't need `distillation.py`, you can just run:
-
-```bash
-python finetune.py \
-  --data_dir xsum \
-  --freeze_encoder --freeze_embeds \
-  --learning_rate=3e-4 \
-  --do_train \
-  --do_predict \
-  --fp16 --fp16_opt_level=O1 \
-  --val_check_interval 0.1 --n_val 1000 --eval_beams 2 --length_penalty=0.5 \
-  --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
-  --model_name_or_path dbart_xsum_12_3 \
-  --train_batch_size=64 --eval_batch_size=64 \
-  --sortish_sampler \
-  --num_train_epochs=6 \
-  --warmup_steps 500 \
-  --output_dir distilbart_xsum_sft_12_3 --gpus 1
-```
-
-+ Note: The command that produced `sshleifer/distilbart-cnn-12-6` is at [train_distilbart_cnn.sh](./[train_distilbart_cnn.sh)
-
-```bash
-./train_distilbart_cnn.sh
-```
-<!--- runtime: 6H on NVIDIA RTX 24GB GPU -->
-+ Tip: You can get the same simple distillation logic by using `distillation.py --no_teacher ` followed by identical arguments as the ones in `train_distilbart_cnn.sh`.
-If you are using `wandb` and comparing the two distillation methods, using this entry point will make your logs consistent,
-because you will have the same hyper-parameters logged in every run.
-
-### Pseudo-Labeling
-+ You don't need `distillation.py`.
-+ Instructions to generate pseudo-labels and use pre-computed pseudo-labels can be found [here](./precomputed_pseudo_labels.md).
-Simply run `finetune.py` with one of those pseudo-label datasets as `--data_dir` (`DATA`, below).
-
-```bash
-python finetune.py \
-  --teacher facebook/bart-large-xsum --data_dir DATA \
-  --freeze_encoder --freeze_embeds \
-  --learning_rate=3e-4 \
-  --do_train \
-  --do_predict \
-  --fp16 --fp16_opt_level=O1 \
-  --val_check_interval 0.1 --n_val 1000 --eval_beams 2 --length_penalty=0.5 \
-  --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
-  --model_name_or_path dbart_xsum_12_3 \
-  --train_batch_size=32 --eval_batch_size=32 \
-  --sortish_sampler \
-  --num_train_epochs=5 \
-  --warmup_steps 500 \
-  --output_dir dbart_xsum_12_3_PL --gpus 1 --logger_name wandb
-```
-
- 
-
-To combine datasets, as in Section 6.2, try something like:
-```bash
-curl -S https://cdn-datasets.huggingface.co/pseudo/xsum/bart_xsum_pl.tgz | tar -xvz -C .
-curl -S https://cdn-datasets.huggingface.co/pseudo/xsum/pegasus_xsum.tgz | tar -xvz -C .
-curl -S https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz | tar -xvz -C .
-mkdir all_pl
-cat bart_xsum_pl/train.source pegasus_xsum/train.source xsum/train.source > all_pl/train.source
-cat bart_xsum_pl/train.target pegasus_xsum/train.target xsum/train.target > all_pl/train.target
-cp xsum/val* all_pl
-cp xsum/test* all_pl
-```
-then use `all_pl` as DATA in the command above.
-
-#### Direct Knowledge Distillation (KD)
-+ In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `SummarizationDistiller`.
-+ This method was used for `sshleifer/distilbart-xsum-12-6`, `6-6`, and `9-6` checkpoints were produced.
-+ You must use [`distillation.py`](./distillation.py). Note that this command initializes the student for you.
-
-The command that produced `sshleifer/distilbart-xsum-12-6` is at [./train_distilbart_xsum.sh](train_distilbart_xsum.sh)
-```bash
-./train_distilbart_xsum.sh --logger_name wandb --gpus 1
-```
-
-+ Expected ROUGE-2 between 21.3 and 21.6, run time ~13H.
-+ direct KD + Pegasus is VERY slow and works best with `--supervise_forward --normalize_hidden`.
-
-<!--- runtime: 13H on V-100 16GB GPU. -->
-
-### Citation
-
-```bibtex
-@misc{shleifer2020pretrained,
-      title={Pre-trained Summarization Distillation}, 
-      author={Sam Shleifer and Alexander M. Rush},
-      year={2020},
-      eprint={2010.13002},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@article{Wolf2019HuggingFacesTS,
-  title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
-  author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush},
-  journal={ArXiv},
-  year={2019},
-  volume={abs/1910.03771}
-}
-```
-
-This is the end of the distillation section, the rest of this doc pertains to general seq2seq commands.
-
 ## Evaluation Commands
 
 To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
diff --git a/examples/seq2seq/builtin_trainer/finetune.sh b/examples/seq2seq/builtin_trainer/finetune.sh
deleted file mode 100644
index 8c2d13d5a..000000000
--- a/examples/seq2seq/builtin_trainer/finetune.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-# the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
-# run ./builtin_trainer/finetune.sh --help to see all the possible options
-python finetune_trainer.py \
-    --learning_rate=3e-5 \
-    --fp16 \
-    --do_train --do_eval --do_predict \
-    --evaluation_strategy steps \
-    --predict_with_generate \
-    --n_val 1000 \
-    "$@"
diff --git a/examples/seq2seq/builtin_trainer/finetune_tpu.sh b/examples/seq2seq/builtin_trainer/finetune_tpu.sh
deleted file mode 100644
index 577f99fc7..000000000
--- a/examples/seq2seq/builtin_trainer/finetune_tpu.sh
+++ /dev/null
@@ -1,12 +0,0 @@
-export TPU_NUM_CORES=8
-
-# the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
-# run ./builtin_trainer/finetune_tpu.sh --help to see all the possible options
-python xla_spawn.py --num_cores $TPU_NUM_CORES \
-    finetune_trainer.py \
-    --learning_rate=3e-5 \
-    --do_train --do_eval \
-    --evaluation_strategy steps \
-    --prediction_loss_only \
-    --n_val 1000 \
-    "$@"
diff --git a/examples/seq2seq/builtin_trainer/train_distilbart_cnn.sh b/examples/seq2seq/builtin_trainer/train_distilbart_cnn.sh
deleted file mode 100644
index d29f6b803..000000000
--- a/examples/seq2seq/builtin_trainer/train_distilbart_cnn.sh
+++ /dev/null
@@ -1,25 +0,0 @@
-export WANDB_PROJECT=distilbart-trainer
-export BS=32
-export m=sshleifer/student_cnn_12_6
-export tok=facebook/bart-large
-export MAX_TGT_LEN=142
-
-python finetune_trainer.py \
-    --model_name_or_path $m --tokenizer_name $tok \ 
-    --data_dir cnn_dm \
-    --output_dir distilbart-cnn-12-6 --overwrite_output_dir \
-    --learning_rate=3e-5 \
-    --warmup_steps 500 --sortish_sampler \
-    --fp16 \
-    --n_val 500 \
-    --gradient_accumulation_steps=1 \
-    --per_device_train_batch_size=$BS --per_device_eval_batch_size=$BS \
-    --freeze_encoder --freeze_embeds \
-    --num_train_epochs=2 \
-    --save_steps 3000 --eval_steps 3000 \
-    --logging_first_step \
-    --max_target_length 56 --val_max_target_length $MAX_TGT_LEN --test_max_target_length $MAX_TGT_LEN \
-    --do_train --do_eval --do_predict \
-    --evaluation_strategy steps \
-    --predict_with_generate --sortish_sampler \
-    "$@"
diff --git a/examples/seq2seq/builtin_trainer/train_mbart_cc25_enro.sh b/examples/seq2seq/builtin_trainer/train_mbart_cc25_enro.sh
deleted file mode 100644
index 3dc711f20..000000000
--- a/examples/seq2seq/builtin_trainer/train_mbart_cc25_enro.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-python finetune_trainer.py \
-    --model_name_or_path=facebook/mbart-large-cc25 \
-    --data_dir $ENRO_DIR \
-    --output_dir mbart_cc25_enro --overwrite_output_dir \
-    --learning_rate=3e-5 \
-    --warmup_steps 500 \ 
-    --fp16 \
-    --label_smoothing 0.1 \
-    --adam_eps 1e-06 \
-    --src_lang en_XX --tgt_lang ro_RO \
-    --freeze_embeds \
-    --per_device_train_batch_size=4 --per_device_eval_batch_size=4 \
-    --max_source_length 128 --max_target_length 128 \
-    --val_max_target_length 128 --test_max_target_length 128 \
-    --sortish_sampler \
-    --num_train_epochs 6 \
-    --save_steps 25000 --eval_steps 25000 --logging_steps 1000 \
-    --do_train --do_eval --do_predict \
-    --evaluation_strategy steps \
-    --predict_with_generate --logging_first_step \
-    --task translation \
-    "$@"
diff --git a/examples/seq2seq/convert_model_to_fp16.py b/examples/seq2seq/convert_model_to_fp16.py
index e853d0393..7fffbde79 100755
--- a/examples/seq2seq/convert_model_to_fp16.py
+++ b/examples/seq2seq/convert_model_to_fp16.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 from typing import Union
 
diff --git a/examples/seq2seq/download_wmt.py b/examples/seq2seq/download_wmt.py
index bef04726c..c52c0c7b4 100755
--- a/examples/seq2seq/download_wmt.py
+++ b/examples/seq2seq/download_wmt.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 from pathlib import Path
 
diff --git a/examples/seq2seq/finetune.sh b/examples/seq2seq/finetune.sh
old mode 100755
new mode 100644
index 683c2d775..1f518835d
--- a/examples/seq2seq/finetune.sh
+++ b/examples/seq2seq/finetune.sh
@@ -1,11 +1,24 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
 # run ./finetune.sh --help to see all the possible options
-python finetune.py \
+python finetune_trainer.py \
     --learning_rate=3e-5 \
     --fp16 \
-    --gpus 1 \
-    --do_train \
-    --do_predict \
+    --do_train --do_eval --do_predict \
+    --evaluation_strategy steps \
+    --predict_with_generate \
     --n_val 1000 \
-    --val_check_interval 0.1 \
     "$@"
diff --git a/examples/seq2seq/finetune_tpu.sh b/examples/seq2seq/finetune_tpu.sh
new file mode 100644
index 000000000..68cf0d773
--- /dev/null
+++ b/examples/seq2seq/finetune_tpu.sh
@@ -0,0 +1,26 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export TPU_NUM_CORES=8
+
+# the proper usage is documented in the README, you need to specify data_dir, output_dir and model_name_or_path
+# run ./finetune_tpu.sh --help to see all the possible options
+python xla_spawn.py --num_cores $TPU_NUM_CORES \
+    finetune_trainer.py \
+    --learning_rate=3e-5 \
+    --do_train --do_eval \
+    --evaluation_strategy steps \
+    --prediction_loss_only \
+    --n_val 1000 \
+    "$@"
diff --git a/examples/seq2seq/finetune_trainer.py b/examples/seq2seq/finetune_trainer.py
index 473de9273..22ec2d7ae 100755
--- a/examples/seq2seq/finetune_trainer.py
+++ b/examples/seq2seq/finetune_trainer.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import logging
 import os
diff --git a/examples/seq2seq/minify_dataset.py b/examples/seq2seq/minify_dataset.py
index c441db565..8fd03196a 100755
--- a/examples/seq2seq/minify_dataset.py
+++ b/examples/seq2seq/minify_dataset.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 from pathlib import Path
 
diff --git a/examples/seq2seq/pack_dataset.py b/examples/seq2seq/pack_dataset.py
index 11351b75a..6f226de2c 100755
--- a/examples/seq2seq/pack_dataset.py
+++ b/examples/seq2seq/pack_dataset.py
@@ -1,5 +1,17 @@
 #!/usr/bin/env python
-
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Fill examples with bitext up to max_tokens without breaking up examples.
 [['I went', 'yo fui'],
 ['to the store', 'a la tienda']
diff --git a/examples/seq2seq/requirements.txt b/examples/seq2seq/requirements.txt
new file mode 100644
index 000000000..e40aef179
--- /dev/null
+++ b/examples/seq2seq/requirements.txt
@@ -0,0 +1,20 @@
+tensorboard
+scikit-learn
+seqeval
+psutil
+sacrebleu
+rouge-score
+tensorflow_datasets
+matplotlib
+git-python==1.0.3
+faiss-cpu
+streamlit
+elasticsearch
+nltk
+pandas
+datasets >= 1.1.3
+fire
+pytest
+conllu
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/seq2seq/rouge_cli.py b/examples/seq2seq/rouge_cli.py
index 6a54a72eb..cd636bbcd 100644
--- a/examples/seq2seq/rouge_cli.py
+++ b/examples/seq2seq/rouge_cli.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import fire
 
 from utils import calculate_rouge, save_json
diff --git a/examples/seq2seq/run_distributed_eval.py b/examples/seq2seq/run_distributed_eval.py
index 5b9f66fd9..90a348078 100755
--- a/examples/seq2seq/run_distributed_eval.py
+++ b/examples/seq2seq/run_distributed_eval.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import argparse
 import shutil
diff --git a/examples/seq2seq/run_eval.py b/examples/seq2seq/run_eval.py
index 910d430bd..739efc531 100755
--- a/examples/seq2seq/run_eval.py
+++ b/examples/seq2seq/run_eval.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import argparse
 import datetime
diff --git a/examples/seq2seq/run_eval_search.py b/examples/seq2seq/run_eval_search.py
index 8052b921d..f7b3bda0f 100755
--- a/examples/seq2seq/run_eval_search.py
+++ b/examples/seq2seq/run_eval_search.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import argparse
 import itertools
diff --git a/examples/seq2seq/save_len_file.py b/examples/seq2seq/save_len_file.py
index 15413cab1..9e73b59e7 100755
--- a/examples/seq2seq/save_len_file.py
+++ b/examples/seq2seq/save_len_file.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import fire
 from torch.utils.data import DataLoader
diff --git a/examples/seq2seq/save_randomly_initialized_model.py b/examples/seq2seq/save_randomly_initialized_model.py
index c4a18afb7..1b7b17fde 100755
--- a/examples/seq2seq/save_randomly_initialized_model.py
+++ b/examples/seq2seq/save_randomly_initialized_model.py
@@ -1,4 +1,17 @@
 #!/usr/bin/env python
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import fire
 
diff --git a/examples/seq2seq/sentence_splitter.py b/examples/seq2seq/sentence_splitter.py
index c5acec739..54a07967e 100644
--- a/examples/seq2seq/sentence_splitter.py
+++ b/examples/seq2seq/sentence_splitter.py
@@ -1,3 +1,16 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 import re
 
 from filelock import FileLock
diff --git a/examples/seq2seq/seq2seq_trainer.py b/examples/seq2seq/seq2seq_trainer.py
index 684407a6b..468669258 100644
--- a/examples/seq2seq/seq2seq_trainer.py
+++ b/examples/seq2seq/seq2seq_trainer.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 from typing import Any, Dict, List, Optional, Tuple, Union
 
 import torch
diff --git a/examples/seq2seq/seq2seq_training_args.py b/examples/seq2seq/seq2seq_training_args.py
index 0bd486026..6ec220181 100644
--- a/examples/seq2seq/seq2seq_training_args.py
+++ b/examples/seq2seq/seq2seq_training_args.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import logging
 from dataclasses import dataclass, field
 from typing import Optional
diff --git a/examples/seq2seq/test_calculate_rouge.py b/examples/seq2seq/test_calculate_rouge.py
index bfa35adf1..bd1dd57a2 100644
--- a/examples/seq2seq/test_calculate_rouge.py
+++ b/examples/seq2seq/test_calculate_rouge.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 from collections import defaultdict
 from pathlib import Path
 
diff --git a/examples/seq2seq/test_data/test_data b/examples/seq2seq/test_data/test_data
new file mode 120000
index 000000000..9eee112ad
--- /dev/null
+++ b/examples/seq2seq/test_data/test_data
@@ -0,0 +1 @@
+seq2seq/test_data
\ No newline at end of file
diff --git a/examples/seq2seq/test_data/wmt_en_ro/train.len b/examples/seq2seq/test_data/wmt_en_ro/train.len
index 2632a33e8..33ce003c8 100644
Binary files a/examples/seq2seq/test_data/wmt_en_ro/train.len and b/examples/seq2seq/test_data/wmt_en_ro/train.len differ
diff --git a/examples/seq2seq/test_data/wmt_en_ro/val.len b/examples/seq2seq/test_data/wmt_en_ro/val.len
index fdf8fa353..897314a96 100644
Binary files a/examples/seq2seq/test_data/wmt_en_ro/val.len and b/examples/seq2seq/test_data/wmt_en_ro/val.len differ
diff --git a/examples/seq2seq/test_datasets.py b/examples/seq2seq/test_datasets.py
index 61e5d7aa5..7ef962b9c 100644
--- a/examples/seq2seq/test_datasets.py
+++ b/examples/seq2seq/test_datasets.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 from pathlib import Path
 
@@ -8,7 +22,6 @@ from torch.utils.data import DataLoader
 from pack_dataset import pack_data_dir
 from parameterized import parameterized
 from save_len_file import save_len_file
-from test_seq2seq_examples import ARTICLES, BART_TINY, MARIAN_TINY, MBART_TINY, SUMMARIES, T5_TINY, make_test_data_dir
 from transformers import AutoTokenizer
 from transformers.models.bart.modeling_bart import shift_tokens_right
 from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me, slow
@@ -17,6 +30,24 @@ from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDat
 
 BERT_BASE_CASED = "bert-base-cased"
 PEGASUS_XSUM = "google/pegasus-xsum"
+ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."]
+SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
+T5_TINY = "patrickvonplaten/t5-tiny-random"
+BART_TINY = "sshleifer/bart-tiny-random"
+MBART_TINY = "sshleifer/tiny-mbart"
+MARIAN_TINY = "sshleifer/tiny-marian-en-de"
+
+
+def _dump_articles(path: Path, articles: list):
+    content = "\n".join(articles)
+    Path(path).open("w").writelines(content)
+
+
+def make_test_data_dir(tmp_dir):
+    for split in ["train", "val", "test"]:
+        _dump_articles(os.path.join(tmp_dir, f"{split}.source"), ARTICLES)
+        _dump_articles(os.path.join(tmp_dir, f"{split}.target"), SUMMARIES)
+    return tmp_dir
 
 
 class TestAll(TestCasePlus):
diff --git a/examples/seq2seq/test_finetune_trainer.py b/examples/seq2seq/test_finetune_trainer.py
index a57809558..92bad878a 100644
--- a/examples/seq2seq/test_finetune_trainer.py
+++ b/examples/seq2seq/test_finetune_trainer.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import sys
 from unittest.mock import patch
@@ -17,11 +31,11 @@ from transformers.trainer_utils import set_seed
 
 from .finetune_trainer import Seq2SeqTrainingArguments, main
 from .seq2seq_trainer import Seq2SeqTrainer
-from .test_seq2seq_examples import MBART_TINY
 
 
 set_seed(42)
 MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"
+MBART_TINY = "sshleifer/tiny-mbart"
 
 
 class TestFinetuneTrainer(TestCasePlus):
diff --git a/examples/seq2seq/test_seq2seq_examples.py b/examples/seq2seq/test_seq2seq_examples.py
index 4793aeba7..ecc0524c3 100644
--- a/examples/seq2seq/test_seq2seq_examples.py
+++ b/examples/seq2seq/test_seq2seq_examples.py
@@ -1,96 +1,32 @@
-import argparse
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import logging
 import os
 import sys
-import tempfile
 from pathlib import Path
 from unittest.mock import patch
 
-import pytest
-import pytorch_lightning as pl
-import torch
-
-import lightning_base
-from convert_pl_checkpoint_to_hf import convert_pl_to_hf
-from distillation import distill_main
-from finetune import SummarizationModule, main
 from parameterized import parameterized
-from run_eval import generate_summaries_or_translations, run_generate
+from run_eval import run_generate
 from run_eval_search import run_search
-from transformers import AutoConfig, AutoModelForSeq2SeqLM
-from transformers.hf_api import HfApi
-from transformers.testing_utils import CaptureStderr, CaptureStdout, TestCasePlus, require_torch_gpu, slow
-from utils import ROUGE_KEYS, label_smoothed_nll_loss, lmap, load_json
+from transformers.testing_utils import CaptureStdout, TestCasePlus, slow
+from utils import ROUGE_KEYS
 
 
 logging.basicConfig(level=logging.DEBUG)
-
 logger = logging.getLogger()
-CUDA_AVAILABLE = torch.cuda.is_available()
-CHEAP_ARGS = {
-    "max_tokens_per_batch": None,
-    "supervise_forward": True,
-    "normalize_hidden": True,
-    "label_smoothing": 0.2,
-    "eval_max_gen_length": None,
-    "eval_beams": 1,
-    "val_metric": "loss",
-    "save_top_k": 1,
-    "adafactor": True,
-    "early_stopping_patience": 2,
-    "logger_name": "default",
-    "length_penalty": 0.5,
-    "cache_dir": "",
-    "task": "summarization",
-    "num_workers": 2,
-    "alpha_hid": 0,
-    "freeze_embeds": True,
-    "enc_only": False,
-    "tgt_suffix": "",
-    "resume_from_checkpoint": None,
-    "sortish_sampler": True,
-    "student_decoder_layers": 1,
-    "val_check_interval": 1.0,
-    "output_dir": "",
-    "fp16": False,  # TODO(SS): set this to CUDA_AVAILABLE if ci installs apex or start using native amp
-    "no_teacher": False,
-    "fp16_opt_level": "O1",
-    "gpus": 1 if CUDA_AVAILABLE else 0,
-    "n_tpu_cores": 0,
-    "max_grad_norm": 1.0,
-    "do_train": True,
-    "do_predict": True,
-    "accumulate_grad_batches": 1,
-    "server_ip": "",
-    "server_port": "",
-    "seed": 42,
-    "model_name_or_path": "sshleifer/bart-tiny-random",
-    "config_name": "",
-    "tokenizer_name": "facebook/bart-large",
-    "do_lower_case": False,
-    "learning_rate": 0.3,
-    "lr_scheduler": "linear",
-    "weight_decay": 0.0,
-    "adam_epsilon": 1e-08,
-    "warmup_steps": 0,
-    "max_epochs": 1,
-    "train_batch_size": 2,
-    "eval_batch_size": 2,
-    "max_source_length": 12,
-    "max_target_length": 12,
-    "val_max_target_length": 12,
-    "test_max_target_length": 12,
-    "fast_dev_run": False,
-    "no_cache": False,
-    "n_train": -1,
-    "n_val": -1,
-    "n_test": -1,
-    "student_encoder_layers": 1,
-    "freeze_encoder": False,
-    "auto_scale_batch_size": False,
-    "overwrite_output_dir": False,
-    "student": None,
-}
 
 
 def _dump_articles(path: Path, articles: list):
@@ -98,187 +34,15 @@ def _dump_articles(path: Path, articles: list):
     Path(path).open("w").writelines(content)
 
 
-ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."]
-SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
 T5_TINY = "patrickvonplaten/t5-tiny-random"
-T5_TINIER = "sshleifer/t5-tinier-random"
 BART_TINY = "sshleifer/bart-tiny-random"
 MBART_TINY = "sshleifer/tiny-mbart"
-MARIAN_TINY = "sshleifer/tiny-marian-en-de"
-FSMT_TINY = "stas/tiny-wmt19-en-de"
-
 
 stream_handler = logging.StreamHandler(sys.stdout)
 logger.addHandler(stream_handler)
 logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
 
 
-def make_test_data_dir(tmp_dir):
-    for split in ["train", "val", "test"]:
-        _dump_articles(os.path.join(tmp_dir, f"{split}.source"), ARTICLES)
-        _dump_articles(os.path.join(tmp_dir, f"{split}.target"), SUMMARIES)
-    return tmp_dir
-
-
-class TestSummarizationDistiller(TestCasePlus):
-    @classmethod
-    def setUpClass(cls):
-        logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
-        return cls
-
-    @slow
-    @require_torch_gpu
-    def test_hub_configs(self):
-        """I put require_torch_gpu cause I only want this to run with self-scheduled."""
-
-        model_list = HfApi().model_list()
-        org = "sshleifer"
-        model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
-        allowed_to_be_broken = ["sshleifer/blenderbot-3B", "sshleifer/blenderbot-90M"]
-        failures = []
-        for m in model_ids:
-            if m in allowed_to_be_broken:
-                continue
-            try:
-                AutoConfig.from_pretrained(m)
-            except Exception:
-                failures.append(m)
-        assert not failures, f"The following models could not be loaded through AutoConfig: {failures}"
-
-    def test_distill_no_teacher(self):
-        updates = dict(student_encoder_layers=2, student_decoder_layers=1, no_teacher=True)
-        self._test_distiller_cli(updates)
-
-    def test_distill_checkpointing_with_teacher(self):
-        updates = dict(
-            student_encoder_layers=2,
-            student_decoder_layers=1,
-            max_epochs=4,
-            val_check_interval=0.25,
-            alpha_hid=2.0,
-            model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED",
-        )
-        model = self._test_distiller_cli(updates, check_contents=False)
-
-        ckpts = list(Path(model.output_dir).glob("*.ckpt"))
-        self.assertEqual(1, len(ckpts))
-        transformer_ckpts = list(Path(model.output_dir).glob("**/*.bin"))
-        self.assertEqual(len(transformer_ckpts), 2)
-        examples = lmap(str.strip, Path(model.hparams.data_dir).joinpath("test.source").open().readlines())
-        out_path = tempfile.mktemp()  # XXX: not being cleaned up
-        generate_summaries_or_translations(examples, out_path, str(model.output_dir / "best_tfmr"))
-        self.assertTrue(Path(out_path).exists())
-
-        out_path_new = self.get_auto_remove_tmp_dir()
-        convert_pl_to_hf(ckpts[0], transformer_ckpts[0].parent, out_path_new)
-        assert os.path.exists(os.path.join(out_path_new, "pytorch_model.bin"))
-
-    def test_loss_fn(self):
-        model = AutoModelForSeq2SeqLM.from_pretrained(BART_TINY)
-        input_ids, mask = model.dummy_inputs["input_ids"], model.dummy_inputs["attention_mask"]
-        target_ids = torch.tensor([[0, 4, 8, 2], [0, 8, 2, 1]], dtype=torch.long, device=model.device)
-        decoder_input_ids = target_ids[:, :-1].contiguous()  # Why this line?
-        lm_labels = target_ids[:, 1:].clone()  # why clone?
-        model_computed_loss = model(
-            input_ids, attention_mask=mask, decoder_input_ids=decoder_input_ids, labels=lm_labels, use_cache=False
-        ).loss
-
-        logits = model(input_ids, attention_mask=mask, decoder_input_ids=decoder_input_ids, use_cache=False).logits
-
-        lprobs = torch.nn.functional.log_softmax(logits, dim=-1)
-        smoothed_loss, nll_loss = label_smoothed_nll_loss(
-            lprobs, lm_labels, 0.1, ignore_index=model.config.pad_token_id
-        )
-        with self.assertRaises(AssertionError):
-            # TODO: understand why this breaks
-            self.assertEqual(nll_loss, model_computed_loss)
-
-    def test_distill_mbart(self):
-        updates = dict(
-            student_encoder_layers=2,
-            student_decoder_layers=1,
-            num_train_epochs=4,
-            val_check_interval=0.25,
-            alpha_hid=2.0,
-            task="translation",
-            model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED",
-            tokenizer_name=MBART_TINY,
-            teacher=MBART_TINY,
-            src_lang="en_XX",
-            tgt_lang="ro_RO",
-        )
-        model = self._test_distiller_cli(updates, check_contents=False)
-        assert model.model.config.model_type == "mbart"
-
-        ckpts = list(Path(model.output_dir).glob("*.ckpt"))
-        self.assertEqual(1, len(ckpts))
-        transformer_ckpts = list(Path(model.output_dir).glob("**/*.bin"))
-        all_files = list(Path(model.output_dir).glob("best_tfmr/*"))
-        assert len(all_files) > 2
-        self.assertEqual(len(transformer_ckpts), 2)
-
-    def test_distill_t5(self):
-        updates = dict(
-            student_encoder_layers=1,
-            student_decoder_layers=1,
-            alpha_hid=2.0,
-            teacher=T5_TINY,
-            model_name_or_path=T5_TINY,
-            tokenizer_name=T5_TINY,
-        )
-        self._test_distiller_cli(updates)
-
-    def test_distill_different_base_models(self):
-        updates = dict(
-            teacher=T5_TINY,
-            student=T5_TINIER,
-            model_name_or_path=T5_TINIER,
-            tokenizer_name=T5_TINIER,
-        )
-        self._test_distiller_cli(updates)
-
-    def _test_distiller_cli(self, updates, check_contents=True):
-        default_updates = dict(
-            label_smoothing=0.0,
-            early_stopping_patience=-1,
-            train_batch_size=1,
-            eval_batch_size=2,
-            max_epochs=2,
-            alpha_mlm=0.2,
-            alpha_ce=0.8,
-            do_predict=True,
-            model_name_or_path="sshleifer/tinier_bart",
-            teacher=CHEAP_ARGS["model_name_or_path"],
-            val_check_interval=0.5,
-        )
-        default_updates.update(updates)
-        args_d: dict = CHEAP_ARGS.copy()
-        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
-        output_dir = self.get_auto_remove_tmp_dir()
-
-        args_d.update(data_dir=tmp_dir, output_dir=output_dir, **default_updates)
-        model = distill_main(argparse.Namespace(**args_d))
-        if not check_contents:
-            return model
-        contents = os.listdir(output_dir)
-        contents = {os.path.basename(p) for p in contents}
-        ckpt_files = [p for p in contents if p.endswith("ckpt")]
-        assert len(ckpt_files) > 0
-
-        self.assertIn("test_generations.txt", contents)
-        self.assertIn("test_results.txt", contents)
-
-        metrics = load_json(model.metrics_save_path)
-        last_step_stats = metrics["val"][-1]
-        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
-        self.assertGreaterEqual(1.0, last_step_stats["val_avg_gen_time"])
-        self.assertIsInstance(last_step_stats[f"val_avg_{model.val_metric}"], float)
-        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) + 1)
-        self.assertEqual(len(metrics["val"]), desired_n_evals)
-        self.assertEqual(len(metrics["test"]), 1)
-        return model
-
-
 class TestTheRest(TestCasePlus):
     def run_eval_tester(self, model):
         input_file_name = Path(self.get_auto_remove_tmp_dir()) / "utest_input.source"
@@ -365,167 +129,3 @@ class TestTheRest(TestCasePlus):
                 assert w not in cs.out
             assert Path(output_file_name).exists()
             os.remove(Path(output_file_name))
-
-    @parameterized.expand(
-        [T5_TINY, BART_TINY, MBART_TINY, MARIAN_TINY, FSMT_TINY],
-    )
-    def test_finetune(self, model):
-        args_d: dict = CHEAP_ARGS.copy()
-        task = "translation" if model in [MBART_TINY, MARIAN_TINY, FSMT_TINY] else "summarization"
-        args_d["label_smoothing"] = 0.1 if task == "translation" else 0
-
-        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
-        output_dir = self.get_auto_remove_tmp_dir()
-        args_d.update(
-            data_dir=tmp_dir,
-            model_name_or_path=model,
-            tokenizer_name=None,
-            train_batch_size=2,
-            eval_batch_size=2,
-            output_dir=output_dir,
-            do_predict=True,
-            task=task,
-            src_lang="en_XX",
-            tgt_lang="ro_RO",
-            freeze_encoder=True,
-            freeze_embeds=True,
-        )
-        assert "n_train" in args_d
-        args = argparse.Namespace(**args_d)
-        module = main(args)
-
-        input_embeds = module.model.get_input_embeddings()
-        assert not input_embeds.weight.requires_grad
-        if model == T5_TINY:
-            lm_head = module.model.lm_head
-            assert not lm_head.weight.requires_grad
-            assert (lm_head.weight == input_embeds.weight).all().item()
-        elif model == FSMT_TINY:
-            fsmt = module.model.model
-            embed_pos = fsmt.decoder.embed_positions
-            assert not embed_pos.weight.requires_grad
-            assert not fsmt.decoder.embed_tokens.weight.requires_grad
-            # check that embeds are not the same
-            assert fsmt.decoder.embed_tokens != fsmt.encoder.embed_tokens
-        else:
-            bart = module.model.model
-            embed_pos = bart.decoder.embed_positions
-            assert not embed_pos.weight.requires_grad
-            assert not bart.shared.weight.requires_grad
-            # check that embeds are the same
-            assert bart.decoder.embed_tokens == bart.encoder.embed_tokens
-            assert bart.decoder.embed_tokens == bart.shared
-
-        example_batch = load_json(module.output_dir / "text_batch.json")
-        assert isinstance(example_batch, dict)
-        assert len(example_batch) >= 4
-
-    def test_finetune_extra_model_args(self):
-        args_d: dict = CHEAP_ARGS.copy()
-
-        task = "summarization"
-        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
-
-        args_d.update(
-            data_dir=tmp_dir,
-            tokenizer_name=None,
-            train_batch_size=2,
-            eval_batch_size=2,
-            do_predict=False,
-            task=task,
-            src_lang="en_XX",
-            tgt_lang="ro_RO",
-            freeze_encoder=True,
-            freeze_embeds=True,
-        )
-
-        # test models whose config includes the extra_model_args
-        model = BART_TINY
-        output_dir = self.get_auto_remove_tmp_dir()
-        args_d1 = args_d.copy()
-        args_d1.update(
-            model_name_or_path=model,
-            output_dir=output_dir,
-        )
-        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
-        for p in extra_model_params:
-            args_d1[p] = 0.5
-        args = argparse.Namespace(**args_d1)
-        model = main(args)
-        for p in extra_model_params:
-            assert getattr(model.config, p) == 0.5, f"failed to override the model config for param {p}"
-
-        # test models whose config doesn't include the extra_model_args
-        model = T5_TINY
-        output_dir = self.get_auto_remove_tmp_dir()
-        args_d2 = args_d.copy()
-        args_d2.update(
-            model_name_or_path=model,
-            output_dir=output_dir,
-        )
-        unsupported_param = "encoder_layerdrop"
-        args_d2[unsupported_param] = 0.5
-        args = argparse.Namespace(**args_d2)
-        with pytest.raises(Exception) as excinfo:
-            model = main(args)
-        assert str(excinfo.value) == f"model config doesn't have a `{unsupported_param}` attribute"
-
-    def test_finetune_lr_schedulers(self):
-        args_d: dict = CHEAP_ARGS.copy()
-
-        task = "summarization"
-        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
-
-        model = BART_TINY
-        output_dir = self.get_auto_remove_tmp_dir()
-
-        args_d.update(
-            data_dir=tmp_dir,
-            model_name_or_path=model,
-            output_dir=output_dir,
-            tokenizer_name=None,
-            train_batch_size=2,
-            eval_batch_size=2,
-            do_predict=False,
-            task=task,
-            src_lang="en_XX",
-            tgt_lang="ro_RO",
-            freeze_encoder=True,
-            freeze_embeds=True,
-        )
-
-        # emulate finetune.py
-        parser = argparse.ArgumentParser()
-        parser = pl.Trainer.add_argparse_args(parser)
-        parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
-        args = {"--help": True}
-
-        # --help test
-        with pytest.raises(SystemExit) as excinfo:
-            with CaptureStdout() as cs:
-                args = parser.parse_args(args)
-            assert False, "--help is expected to sys.exit"
-        assert excinfo.type == SystemExit
-        expected = lightning_base.arg_to_scheduler_metavar
-        assert expected in cs.out, "--help is expected to list the supported schedulers"
-
-        # --lr_scheduler=non_existing_scheduler test
-        unsupported_param = "non_existing_scheduler"
-        args = {f"--lr_scheduler={unsupported_param}"}
-        with pytest.raises(SystemExit) as excinfo:
-            with CaptureStderr() as cs:
-                args = parser.parse_args(args)
-            assert False, "invalid argument is expected to sys.exit"
-        assert excinfo.type == SystemExit
-        expected = f"invalid choice: '{unsupported_param}'"
-        assert expected in cs.err, f"should have bailed on invalid choice of scheduler {unsupported_param}"
-
-        # --lr_scheduler=existing_scheduler test
-        supported_param = "cosine"
-        args_d1 = args_d.copy()
-        args_d1["lr_scheduler"] = supported_param
-        args = argparse.Namespace(**args_d1)
-        model = main(args)
-        assert (
-            getattr(model.hparams, "lr_scheduler") == supported_param
-        ), f"lr_scheduler={supported_param} shouldn't fail"
diff --git a/examples/seq2seq/test_seq2seq_examples_multi_gpu.py b/examples/seq2seq/test_seq2seq_examples_multi_gpu.py
index eafa7e37f..6625f061b 100644
--- a/examples/seq2seq/test_seq2seq_examples_multi_gpu.py
+++ b/examples/seq2seq/test_seq2seq_examples_multi_gpu.py
@@ -1,18 +1,24 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # as due to their complexity multi-gpu tests could impact other tests, and to aid debug we have those in a separate module.
 
 import os
 import sys
 
-from transformers.testing_utils import (
-    TestCasePlus,
-    execute_subprocess_async,
-    get_gpu_count,
-    require_torch_gpu,
-    require_torch_multi_gpu,
-    slow,
-)
+from transformers.testing_utils import TestCasePlus, execute_subprocess_async, get_gpu_count, require_torch_gpu, slow
 
-from .test_seq2seq_examples import CHEAP_ARGS, make_test_data_dir
 from .utils import load_json
 
 
@@ -21,73 +27,6 @@ class TestSummarizationDistillerMultiGPU(TestCasePlus):
     def setUpClass(cls):
         return cls
 
-    @require_torch_multi_gpu
-    def test_multi_gpu(self):
-
-        updates = dict(
-            no_teacher=True,
-            freeze_encoder=True,
-            gpus=2,
-            overwrite_output_dir=True,
-            sortish_sampler=True,
-        )
-        self._test_distiller_cli_fork(updates, check_contents=False)
-
-    def _test_distiller_cli_fork(self, updates, check_contents=True):
-        default_updates = dict(
-            label_smoothing=0.0,
-            early_stopping_patience=-1,
-            train_batch_size=1,
-            eval_batch_size=2,
-            max_epochs=2,
-            alpha_mlm=0.2,
-            alpha_ce=0.8,
-            do_predict=True,
-            model_name_or_path="sshleifer/tinier_bart",
-            teacher=CHEAP_ARGS["model_name_or_path"],
-            val_check_interval=0.5,
-        )
-        default_updates.update(updates)
-        args_d: dict = CHEAP_ARGS.copy()
-        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
-        output_dir = self.get_auto_remove_tmp_dir()
-        args_d.update(data_dir=tmp_dir, output_dir=output_dir, **default_updates)
-
-        def convert(k, v):
-            if k in ["tgt_suffix", "server_ip", "server_port", "out", "n_tpu_cores"]:
-                return ""
-            if v is False or v is None:
-                return ""
-            if v is True:  # or len(str(v))==0:
-                return f"--{k}"
-            return f"--{k}={v}"
-
-        cli_args = [x for x in (convert(k, v) for k, v in args_d.items()) if len(x)]
-        cmd = [sys.executable, f"{self.test_file_dir}/distillation.py"] + cli_args
-        execute_subprocess_async(cmd, env=self.get_env())
-
-        contents = os.listdir(output_dir)
-        contents = {os.path.basename(p) for p in contents}
-        ckpt_files = [p for p in contents if p.endswith("ckpt")]
-        assert len(ckpt_files) > 0
-
-        self.assertIn("test_generations.txt", contents)
-        self.assertIn("test_results.txt", contents)
-
-        # get the following from the module, (we don't have access to `model` here)
-        metrics_save_path = os.path.join(output_dir, "metrics.json")
-        val_metric = "rouge2"
-
-        metrics = load_json(metrics_save_path)
-        # {'test': [{'test_avg_loss': 10.63731575012207, 'test_avg_rouge1': 0.0, 'test_avg_rouge2': 0.0, 'test_avg_rougeL': 0.0, 'test_avg_gen_time': 0.1822289228439331, 'test_avg_gen_len': 142.0, 'step_count': 1}]}
-        print(metrics)
-        last_step_stats = metrics["val"][-1]
-        self.assertGreaterEqual(last_step_stats["val_avg_gen_time"], 0.01)
-        self.assertIsInstance(last_step_stats[f"val_avg_{val_metric}"], float)
-        self.assertEqual(len(metrics["test"]), 1)
-        desired_n_evals = int(args_d["max_epochs"] * (1 / args_d["val_check_interval"]) / 2 + 1)
-        self.assertEqual(len(metrics["val"]), desired_n_evals)
-
     @slow
     @require_torch_gpu
     def test_distributed_eval(self):
diff --git a/examples/seq2seq/test_tatoeba_conversion.py b/examples/seq2seq/test_tatoeba_conversion.py
index 065aed287..5747811bd 100644
--- a/examples/seq2seq/test_tatoeba_conversion.py
+++ b/examples/seq2seq/test_tatoeba_conversion.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import tempfile
 import unittest
diff --git a/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh b/examples/seq2seq/train_distil_marian_enro.sh
similarity index 59%
rename from examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh
rename to examples/seq2seq/train_distil_marian_enro.sh
index 10c809b0e..f09fd875e 100644
--- a/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh
+++ b/examples/seq2seq/train_distil_marian_enro.sh
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export WANDB_PROJECT=distil-marian
 export BS=64
 export GAS=1
diff --git a/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh b/examples/seq2seq/train_distil_marian_enro_tpu.sh
similarity index 59%
rename from examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh
rename to examples/seq2seq/train_distil_marian_enro_tpu.sh
index 098425d65..271a8cf3e 100644
--- a/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh
+++ b/examples/seq2seq/train_distil_marian_enro_tpu.sh
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 export WANDB_PROJECT=distil-marian
 export BS=64
 export m=sshleifer/student_marian_en_ro_6_3
diff --git a/examples/seq2seq/train_distilbart_cnn.sh b/examples/seq2seq/train_distilbart_cnn.sh
old mode 100755
new mode 100644
index 6a1bafbdc..e89394adc
--- a/examples/seq2seq/train_distilbart_cnn.sh
+++ b/examples/seq2seq/train_distilbart_cnn.sh
@@ -1,24 +1,39 @@
-#!/usr/bin/env bash
-export PYTHONPATH="../":"${PYTHONPATH}"
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
+export WANDB_PROJECT=distilbart-trainer
 export BS=32
-export GAS=1
+export m=sshleifer/student_cnn_12_6
+export tok=facebook/bart-large
+export MAX_TGT_LEN=142
 
-python finetune.py \
+python finetune_trainer.py \
+    --model_name_or_path $m --tokenizer_name $tok \ 
+    --data_dir cnn_dm \
+    --output_dir distilbart-cnn-12-6 --overwrite_output_dir \
     --learning_rate=3e-5 \
+    --warmup_steps 500 --sortish_sampler \
     --fp16 \
-    --gpus 1 \
-    --do_train \
-    --do_predict \
-    --val_check_interval 0.25 \
     --n_val 500 \
-    --num_train_epochs 2 \
-    --freeze_encoder --freeze_embeds --data_dir cnn_dm \
-    --max_target_length 142 --val_max_target_length=142 \
-    --train_batch_size=$BS --eval_batch_size=$BS --gradient_accumulation_steps=$GAS \
-    --model_name_or_path sshleifer/student_cnn_12_6 \
-    --tokenizer_name facebook/bart-large \
-    --warmup_steps 500 \
-    --output_dir distilbart-cnn-12-6 \
+    --gradient_accumulation_steps=1 \
+    --per_device_train_batch_size=$BS --per_device_eval_batch_size=$BS \
+    --freeze_encoder --freeze_embeds \
+    --num_train_epochs=2 \
+    --save_steps 3000 --eval_steps 3000 \
+    --logging_first_step \
+    --max_target_length 56 --val_max_target_length $MAX_TGT_LEN --test_max_target_length $MAX_TGT_LEN \
+    --do_train --do_eval --do_predict \
+    --evaluation_strategy steps \
+    --predict_with_generate --sortish_sampler \
     "$@"
-
diff --git a/examples/seq2seq/train_mbart_cc25_enro.sh b/examples/seq2seq/train_mbart_cc25_enro.sh
old mode 100755
new mode 100644
index 54e7935ff..cccae914c
--- a/examples/seq2seq/train_mbart_cc25_enro.sh
+++ b/examples/seq2seq/train_mbart_cc25_enro.sh
@@ -1,18 +1,36 @@
-#!/usr/bin/env bash
-export PYTHONPATH="../":"${PYTHONPATH}"
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
-python finetune.py \
-    --learning_rate=3e-5 \
-    --fp16 \
-    --do_train \
-    --val_check_interval=0.25 \
-    --adam_eps 1e-06 \
-    --num_train_epochs 6 --src_lang en_XX --tgt_lang ro_RO \
-    --data_dir $ENRO_DIR \
-    --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
-    --train_batch_size=$BS --eval_batch_size=$BS \
-    --task translation \
-    --warmup_steps 500 \
-    --freeze_embeds \
+python finetune_trainer.py \
     --model_name_or_path=facebook/mbart-large-cc25 \
+    --data_dir $ENRO_DIR \
+    --output_dir mbart_cc25_enro --overwrite_output_dir \
+    --learning_rate=3e-5 \
+    --warmup_steps 500 \ 
+    --fp16 \
+    --label_smoothing 0.1 \
+    --adam_eps 1e-06 \
+    --src_lang en_XX --tgt_lang ro_RO \
+    --freeze_embeds \
+    --per_device_train_batch_size=4 --per_device_eval_batch_size=4 \
+    --max_source_length 128 --max_target_length 128 \
+    --val_max_target_length 128 --test_max_target_length 128 \
+    --sortish_sampler \
+    --num_train_epochs 6 \
+    --save_steps 25000 --eval_steps 25000 --logging_steps 1000 \
+    --do_train --do_eval --do_predict \
+    --evaluation_strategy steps \
+    --predict_with_generate --logging_first_step \
+    --task translation \
     "$@"
diff --git a/examples/seq2seq/utils.py b/examples/seq2seq/utils.py
index b6994a183..70ef5f07b 100644
--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import itertools
 import json
 import linecache
diff --git a/examples/seq2seq/xla_spawn.py b/examples/seq2seq/xla_spawn.py
index 0889e57af..d84b41994 100644
--- a/examples/seq2seq/xla_spawn.py
+++ b/examples/seq2seq/xla_spawn.py
@@ -1,3 +1,16 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 A simple launcher script for TPU training
 
diff --git a/examples/test_examples.py b/examples/test_examples.py
index b840e267d..1b5811255 100644
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -45,7 +45,6 @@ if SRC_DIRS is not None:
     import run_glue
     import run_mlm
     import run_ner
-    import run_pl_glue
     import run_qa as run_squad
 
 
@@ -100,44 +99,6 @@ class ExamplesTests(TestCasePlus):
             for value in result.values():
                 self.assertGreaterEqual(value, 0.75)
 
-    @require_torch_non_multi_gpu_but_fix_me
-    def test_run_pl_glue(self):
-        stream_handler = logging.StreamHandler(sys.stdout)
-        logger.addHandler(stream_handler)
-
-        tmp_dir = self.get_auto_remove_tmp_dir()
-        testargs = f"""
-            run_pl_glue.py
-            --model_name_or_path bert-base-cased
-            --data_dir ./tests/fixtures/tests_samples/MRPC/
-            --output_dir {tmp_dir}
-            --task mrpc
-            --do_train
-            --do_predict
-            --train_batch_size=32
-            --learning_rate=1e-4
-            --num_train_epochs=1
-            --seed=42
-            --max_seq_length=128
-            """.split()
-        if torch.cuda.is_available():
-            testargs += ["--gpus=1"]
-        if is_cuda_and_apex_available():
-            testargs.append("--fp16")
-
-        with patch.object(sys, "argv", testargs):
-            result = run_pl_glue.main()[0]
-            # for now just testing that the script can run to completion
-            self.assertGreater(result["acc"], 0.25)
-            #
-            # TODO: this fails on CI - doesn't get acc/f1>=0.75:
-            #
-            #     # remove all the various *loss* attributes
-            #     result = {k: v for k, v in result.items() if "loss" not in k}
-            #     for k, v in result.items():
-            #         self.assertGreaterEqual(v, 0.75, f"({k})")
-            #
-
     @require_torch_non_multi_gpu_but_fix_me
     def test_run_clm(self):
         stream_handler = logging.StreamHandler(sys.stdout)
diff --git a/examples/text-classification/README.md b/examples/text-classification/README.md
index 3994dde49..999613990 100644
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## GLUE Benchmark
 
 # Run TensorFlow 2.0 version
diff --git a/examples/text-classification/requirements.txt b/examples/text-classification/requirements.txt
new file mode 100644
index 000000000..0f5c38bd4
--- /dev/null
+++ b/examples/text-classification/requirements.txt
@@ -0,0 +1,3 @@
+datasets >= 1.1.3
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/text-classification/run_tf_glue.py b/examples/text-classification/run_tf_glue.py
index 343439343..be49399b3 100644
--- a/examples/text-classification/run_tf_glue.py
+++ b/examples/text-classification/run_tf_glue.py
@@ -1,4 +1,17 @@
 # coding=utf-8
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """ Fine-tuning the library models for sequence classification."""
 
 
diff --git a/examples/text-classification/run_tf_text_classification.py b/examples/text-classification/run_tf_text_classification.py
index 880f0f2aa..058b164b8 100644
--- a/examples/text-classification/run_tf_text_classification.py
+++ b/examples/text-classification/run_tf_text_classification.py
@@ -1,4 +1,17 @@
 # coding=utf-8
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """ Fine-tuning the library models for sequence classification."""
 
 
diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
index d16499348..4e68b126e 100644
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Language generation
 
 Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).
diff --git a/examples/text-generation/requirements.txt b/examples/text-generation/requirements.txt
new file mode 100644
index 000000000..013c579bc
--- /dev/null
+++ b/examples/text-generation/requirements.txt
@@ -0,0 +1,2 @@
+sentencepiece != 0.1.92
+protobuf
diff --git a/examples/token-classification/README.md b/examples/token-classification/README.md
index 7c9e16065..f4e50fed2 100644
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
@@ -1,3 +1,19 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 ## Token classification
 
 Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
@@ -32,10 +48,16 @@ python run_ner.py \
   --do_eval
 ```
 
+**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
+uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
+[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
+of the script.
+
 ## Old version of the script
 
-Based on the scripts [`run_ner_old.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) for Pytorch and
-[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
+You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner_old.py).
+
+### TensorFlow version
 
 The following examples are covered in this section:
 
@@ -98,79 +120,6 @@ export SAVE_STEPS=750
 export SEED=1
 ```
 
-#### Run the Pytorch version
-
-To start training, just run:
-
-```bash
-python3 run_ner_old.py --data_dir ./ \
---labels ./labels.txt \
---model_name_or_path $BERT_MODEL \
---output_dir $OUTPUT_DIR \
---max_seq_length  $MAX_LENGTH \
---num_train_epochs $NUM_EPOCHS \
---per_device_train_batch_size $BATCH_SIZE \
---save_steps $SAVE_STEPS \
---seed $SEED \
---do_train \
---do_eval \
---do_predict
-```
-
-If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
-
-#### JSON-based configuration file
-
-Instead of passing all parameters via commandline arguments, the `run_ner_old.py` script also supports reading parameters from a json-based configuration file:
-
-```json
-{
-    "data_dir": ".",
-    "labels": "./labels.txt",
-    "model_name_or_path": "bert-base-multilingual-cased",
-    "output_dir": "germeval-model",
-    "max_seq_length": 128,
-    "num_train_epochs": 3,
-    "per_device_train_batch_size": 32,
-    "save_steps": 750,
-    "seed": 1,
-    "do_train": true,
-    "do_eval": true,
-    "do_predict": true
-}
-```
-
-It must be saved with a `.json` extension and can be used by running `python3 run_ner_old.py config.json`.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following for our example:
-
-```bash
-10/04/2019 00:42:06 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:06 - INFO - __main__ -     f1 = 0.8623348017621146
-10/04/2019 00:42:06 - INFO - __main__ -     loss = 0.07183869666975543
-10/04/2019 00:42:06 - INFO - __main__ -     precision = 0.8467916366258111
-10/04/2019 00:42:06 - INFO - __main__ -     recall = 0.8784592370979806
-```
-
-On the test dataset the following results could be achieved:
-
-```bash
-10/04/2019 00:42:42 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:42 - INFO - __main__ -     f1 = 0.8614389652384803
-10/04/2019 00:42:42 - INFO - __main__ -     loss = 0.07064602487454782
-10/04/2019 00:42:42 - INFO - __main__ -     precision = 0.8604651162790697
-10/04/2019 00:42:42 - INFO - __main__ -     recall = 0.8624150210424085
-```
-
-#### Run PyTorch version using PyTorch-Lightning
-
-Run `bash run_pl.sh` from the `ner` directory. This would also install `pytorch-lightning` and the `examples/requirements.txt`. It is a shell pipeline which would automatically download, pre-process the data and run the models in `germeval-model` directory. Logs are saved in `lightning_logs` directory.
-
-Pass `--gpus` flag to change the number of GPUs. Default uses 1. At the end, the expected results are: `TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recall': 0.869537067011978, 'f1': 0.8608974358974358}`
-
-
 #### Run the Tensorflow 2 version
 
 To start training, just run:
@@ -235,102 +184,3 @@ On the test dataset the following results could be achieved:
 micro avg     0.8722    0.8774    0.8748     13869
 macro avg     0.8712    0.8774    0.8740     13869
 ```
-
-### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
-
-Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
-
-> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
-> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
-> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
-
-Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
-
-#### Data (Download and pre-processing steps)
-
-The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
-
-The following commands show how to prepare the dataset for fine-tuning:
-
-```bash
-mkdir -p data_wnut_17
-
-curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll'  | tr '\t' ' ' > data_wnut_17/train.txt.tmp
-curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
-curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
-```
-
-Let's define some variables that we need for further pre-processing steps:
-
-```bash
-export MAX_LENGTH=128
-export BERT_MODEL=bert-large-cased
-```
-
-Here we use the English BERT large model for fine-tuning.
-The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
-
-```bash
-python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
-python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
-python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
-```
-
-In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
-
-```bash
-cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
-```
-
-#### Run the Pytorch version
-
-Fine-tuning with the PyTorch version can be started using the `run_ner_old.py` script. In this example we use a JSON-based configuration file.
-
-This configuration file looks like:
-
-```json
-{
-    "data_dir": "./data_wnut_17",
-    "labels": "./data_wnut_17/labels.txt",
-    "model_name_or_path": "bert-large-cased",
-    "output_dir": "wnut-17-model-1",
-    "max_seq_length": 128,
-    "num_train_epochs": 3,
-    "per_device_train_batch_size": 32,
-    "save_steps": 425,
-    "seed": 1,
-    "do_train": true,
-    "do_eval": true,
-    "do_predict": true,
-    "fp16": false
-}
-```
-
-If your GPU supports half-precision training, please set `fp16` to `true`.
-
-Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following:
-
-```bash
-05/29/2020 23:33:44 - INFO - __main__ -   ***** Eval results *****
-05/29/2020 23:33:44 - INFO - __main__ -     eval_loss = 0.26505235286212275
-05/29/2020 23:33:44 - INFO - __main__ -     eval_precision = 0.7008264462809918
-05/29/2020 23:33:44 - INFO - __main__ -     eval_recall = 0.507177033492823
-05/29/2020 23:33:44 - INFO - __main__ -     eval_f1 = 0.5884802220680084
-05/29/2020 23:33:44 - INFO - __main__ -     epoch = 3.0
-```
-
-On the test dataset the following results could be achieved:
-
-```bash
-05/29/2020 23:33:44 - INFO - transformers.trainer -   ***** Running Prediction *****
-05/29/2020 23:34:02 - INFO - __main__ -     eval_loss = 0.30948806500973547
-05/29/2020 23:34:02 - INFO - __main__ -     eval_precision = 0.5840108401084011
-05/29/2020 23:34:02 - INFO - __main__ -     eval_recall = 0.3994439295644115
-05/29/2020 23:34:02 - INFO - __main__ -     eval_f1 = 0.47440836543753434
-```
-
-WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
diff --git a/examples/token-classification/requirements.txt b/examples/token-classification/requirements.txt
new file mode 100644
index 000000000..b03c28ecd
--- /dev/null
+++ b/examples/token-classification/requirements.txt
@@ -0,0 +1,2 @@
+seqeval
+datasets >= 1.1.3
diff --git a/examples/token-classification/run.sh b/examples/token-classification/run.sh
index 6c46a8139..2dd49117d 100755
--- a/examples/token-classification/run.sh
+++ b/examples/token-classification/run.sh
@@ -1,3 +1,17 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 python3 run_ner.py \
   --model_name_or_path bert-base-uncased \
   --dataset_name conll2003 \
diff --git a/examples/token-classification/test_ner_examples.py b/examples/token-classification/test_ner_examples.py
deleted file mode 100644
index 4a9e176f3..000000000
--- a/examples/token-classification/test_ner_examples.py
+++ /dev/null
@@ -1,57 +0,0 @@
-import logging
-import sys
-import unittest
-from unittest.mock import patch
-
-import run_ner_old as run_ner
-from transformers.testing_utils import require_torch_non_multi_gpu_but_fix_me, slow
-
-
-logging.basicConfig(level=logging.INFO)
-
-logger = logging.getLogger()
-
-
-class ExamplesTests(unittest.TestCase):
-    @slow
-    @require_torch_non_multi_gpu_but_fix_me
-    def test_run_ner(self):
-        stream_handler = logging.StreamHandler(sys.stdout)
-        logger.addHandler(stream_handler)
-
-        testargs = """
-            --model_name distilbert-base-german-cased
-            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --overwrite_output_dir
-            --data_dir ./tests/fixtures/tests_samples/GermEval
-            --labels ./tests/fixtures/tests_samples/GermEval/labels.txt
-            --max_seq_length 128
-            --num_train_epochs 6
-            --logging_steps 1
-            --do_train
-            --do_eval
-            """.split()
-        with patch.object(sys, "argv", ["run.py"] + testargs):
-            result = run_ner.main()
-            self.assertLess(result["eval_loss"], 1.5)
-
-    @require_torch_non_multi_gpu_but_fix_me
-    def test_run_ner_pl(self):
-        stream_handler = logging.StreamHandler(sys.stdout)
-        logger.addHandler(stream_handler)
-
-        testargs = """
-            --model_name distilbert-base-german-cased
-            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --overwrite_output_dir
-            --data_dir ./tests/fixtures/tests_samples/GermEval
-            --labels ./tests/fixtures/tests_samples/GermEval/labels.txt
-            --max_seq_length 128
-            --num_train_epochs 6
-            --logging_steps 1
-            --do_train
-            --do_eval
-            """.split()
-        with patch.object(sys, "argv", ["run.py"] + testargs):
-            result = run_ner.main()
-            self.assertLess(result["eval_loss"], 1.5)
diff --git a/examples/xla_spawn.py b/examples/xla_spawn.py
index 0889e57af..d84b41994 100644
--- a/examples/xla_spawn.py
+++ b/examples/xla_spawn.py
@@ -1,3 +1,16 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """
 A simple launcher script for TPU training