Staging to master to add the latest fixes (#503)

* update mlflow version to match the other azureml versions * Update generate_conda_file.py * added temporary * doc: update github url references * docs: update nlp recipes references * Minor bug fix for text classification of multi languages notebook * remove bert and xlnet notebooks * remove obsolete tests and links * Add missing tmp directories. * fix import error and max_nodes for the cluster * Minor edits. * Attempt to fix test device error. * Temporarily pin transformers version * Remove gpu tags temporarily * Test whether device error also occurs for SequenceClassifier. * Revert temporary changes. * Revert temporary changes.
2019-11-30 00:53:57 +00:00 · 2019-11-30 00:53:57 +00:00 · ed04438ae1
--- a/README.md
+++ b/README.md
@ -85,6 +85,8 @@ The following is a list of related repositories that we like and think are usefu
 |[AzureML-BERT](https://github.com/Microsoft/AzureML-BERT)|End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.|
 |[MASS](https://github.com/microsoft/MASS)|MASS: Masked Sequence to Sequence Pre-training for Language Generation.|
 |[MT-DNN](https://github.com/namisan/mt-dnn)|Multi-Task Deep Neural Networks for Natural Language Understanding.|
+|[UniLM](https://github.com/microsoft/unilm)|Unified Language Model Pre-training.|
+


 ## Build Status
--- a/SETUP.md
+++ b/SETUP.md
@ -47,9 +47,9 @@ You can learn how to create a Notebook VM [here](https://docs.microsoft.com/en-u
 We provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file
 which you can use to create the target environment using the Python version 3.6 with all the correct dependencies.

-Assuming the repo is cloned as `nlp` in the system, to install **a default (Python CPU) environment**:
+Assuming the repo is cloned as `nlp-recipes` in the system, to install **a default (Python CPU) environment**:

-    cd nlp
+    cd nlp-recipes
    python tools/generate_conda_file.py
    conda env create -f nlp_cpu.yaml

@ -62,7 +62,7 @@ Click on the following menus to see how to install the Python GPU environment:

 Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:

-    cd nlp
+    cd nlp-recipes
    python tools/generate_conda_file.py --gpu
    conda env create -n nlp_gpu -f nlp_gpu.yaml

@ -79,7 +79,7 @@ Assuming that you have an Azure GPU DSVM machine, here are the steps to setup th

 2. Install the GPU environment.

-        cd nlp
+        cd nlp-recipes
        python tools/generate_conda_file.py --gpu
        conda env create -n nlp_gpu -f nlp_gpu.yaml

@ -110,7 +110,7 @@ Running the command tells pip to install the `utils_nlp` package from source in

 > It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects (while still reflecting updates to the source as it's installed as an editable `'-e'` package). 

->   `pip install -e  git+git@github.com:microsoft/nlp.git@master#egg=utils_nlp`  
+>   `pip install -e  git+git@github.com:microsoft/nlp-recipes.git@master#egg=utils_nlp`  

 Either command, from above, makes `utils_nlp` available in your conda virtual environment. You can verify it was properly installed by running:  

--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -34,7 +34,7 @@ version = ".".join(VERSION.split(".")[:2])
 # The full version, including alpha/beta/rc tags
 release = VERSION

-prefix = "NLP"
+prefix = "NLPRecipes"

 # -- General configuration ---------------------------------------------------

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -2,9 +2,9 @@
 NLP Utilities
 ===================================================

-The `NLP repository <https://github.com/Microsoft/NLP>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks. 
+The `NLP repository <https://github.com/microsoft/nlp-recipes>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks. 

-The module `utils_nlp <https://github.com/microsoft/nlp/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and 
+The module `utils_nlp <https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and 
 evaluating NLP systems. 

 .. toctree::
--- a/examples/entailment/entailment_xnli_bert_azureml.ipynb
+++ b/examples/entailment/entailment_xnli_bert_azureml.ipynb
@ -45,7 +45,7 @@
    "from azureml.core.runconfig import MpiConfiguration\n",
    "from azureml.core import Experiment\n",
    "from azureml.widgets import RunDetails\n",
-    "from azureml.core.compute import ComputeTarget\n",
+    "from azureml.core.compute import ComputeTarget, AmlCompute\n",
    "from azureml.exceptions import ComputeTargetException\n",
    "from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
   ]
@ -169,7 +169,7 @@
    "except ComputeTargetException:\n",
    "    print(\"Creating new compute target: {}\".format(cluster_name))\n",
    "    compute_config = AmlCompute.provisioning_configuration(\n",
-    "        vm_size=\"STANDARD_NC6\", max_nodes=1\n",
+    "        vm_size=\"STANDARD_NC6\", max_nodes=NODE_COUNT\n",
    "    )\n",
    "    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
    "    compute_target.wait_for_completion(show_output=True)\n",
@ -524,9 +524,9 @@
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python (nlp_gpu_transformer_bug_bash)",
   "language": "python",
-   "name": "python3"
+   "name": "nlp_gpu_transformer_bug_bash"
  },
  "language_info": {
   "codemirror_mode": {
--- a/examples/question_answering/question_answering_system_bidaf_quickstart.ipynb
+++ b/examples/question_answering/question_answering_system_bidaf_quickstart.ipynb
@ -175,7 +175,7 @@
   "metadata": {},
   "source": [
    "This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
-    ")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/examples/question_answering/bidaf_deep_dive.ipynb\n",
+    ")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp-recipes/examples/question_answering/bidaf_deep_dive.ipynb\n",
    ") for more information on this algorithm and AllenNLP implementation."
   ]
  },
--- a/examples/text_classification/README.md
+++ b/examples/text_classification/README.md
@ -19,8 +19,5 @@ The following summarizes each notebook for Text Classification. Each notebook pr
 |Notebook|Environment|Description|Dataset|
 |---|---|---|---|
 |[BERT for text classification on AzureML](tc_bert_azureml.ipynb) |Azure ML|A notebook which walks through fine-tuning and evaluating pre-trained BERT model on a distributed setup with AzureML. |[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
-|[XLNet for text classification with MNLI](tc_mnli_xlnet.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained XLNet model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
-|[BERT for text classification of Hindi BBC News](tc_bbc_bert_hi.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Hindi BBC news data|[BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1)|
-|[BERT for text classification of Arabic News](tc_dac_bert_ar.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Arabic news articles|[DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)|
 |[Text Classification of MultiNLI Sentences using Multiple Transformer Models](tc_mnli_transformers.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a number of pre-trained transformer models|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
 |[Text Classification of Multi Language Datasets using Transformer Model](tc_multi_languages_transformers.ipynb)|Local|A notebook which walks through fine-tuning and evaluating a pre-trained transformer model for multiple datasets in different language|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) <br> [BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1) <br> [DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)
--- a/examples/text_classification/tc_bbc_bert_hi.ipynb
+++ b/examples/text_classification/tc_bbc_bert_hi.ipynb
--- a/examples/text_classification/tc_dac_bert_ar.ipynb
+++ b/examples/text_classification/tc_dac_bert_ar.ipynb
@ -1,821 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
-    "\n",
-    "*Licensed under the MIT License.*\n",
-    "\n",
-    "# Classification of Arabic News Articles using BERT"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "import os\n",
-    "import sys\n",
-    "\n",
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "import scrapbook as sb\n",
-    "import torch\n",
-    "import torch.nn as nn\n",
-    "from sklearn.metrics import accuracy_score, classification_report\n",
-    "from sklearn.model_selection import train_test_split\n",
-    "\n",
-    "sys.path.append(\"../../\")\n",
-    "from utils_nlp.common.timer import Timer\n",
-    "from utils_nlp.dataset.dac import load_pandas_df\n",
-    "from utils_nlp.models.bert.common import Language, Tokenizer\n",
-    "from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Introduction\n",
-    "In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).\n",
-    "\n",
-    "We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {
-    "tags": [
-     "parameters"
-    ]
-   },
-   "outputs": [],
-   "source": [
-    "DATA_FOLDER = \"./temp\"\n",
-    "BERT_CACHE_DIR = \"./temp\"\n",
-    "LANGUAGE = Language.MULTILINGUAL\n",
-    "MAX_LEN = 200\n",
-    "BATCH_SIZE = 32\n",
-    "NUM_GPUS = 2\n",
-    "NUM_EPOCHS = 1\n",
-    "TRAIN_SIZE = 0.8\n",
-    "NUM_ROWS = 15000\n",
-    "RANDOM_STATE = 0"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Read Dataset\n",
-    "We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS, random_state=RANDOM_STATE)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>text</th>\n",
-       "      <th>targe</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>80414</th>\n",
-       "      <td>فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...</td>\n",
-       "      <td>4</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>6649</th>\n",
-       "      <td>أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...</td>\n",
-       "      <td>0</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3722</th>\n",
-       "      <td>أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...</td>\n",
-       "      <td>0</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>82317</th>\n",
-       "      <td>الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...</td>\n",
-       "      <td>4</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>5219</th>\n",
-       "      <td>المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...</td>\n",
-       "      <td>0</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "                                                    text  targe\n",
-       "80414  فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...      4\n",
-       "6649   أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...      0\n",
-       "3722   أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...      0\n",
-       "82317  الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...      4\n",
-       "5219   المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...      0"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "df.head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# set the text and label columns\n",
-    "text_col = df.columns[0]\n",
-    "label_col = df.columns[1]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# remove empty documents\n",
-    "df = df[df[text_col].isna() == False]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Inspect the distribution of labels:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "4    5844\n",
-       "3    2796\n",
-       "1    2139\n",
-       "0    1917\n",
-       "2    1900\n",
-       "Name: targe, dtype: int64"
-      ]
-     },
-     "execution_count": 7,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "df[label_col].value_counts()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>label</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>culture</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>diverse</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>economy</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3</th>\n",
-       "      <td>politics</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>4</th>\n",
-       "      <td>sports</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "      label\n",
-       "0   culture\n",
-       "1   diverse\n",
-       "2   economy\n",
-       "3  politics\n",
-       "4    sports"
-      ]
-     },
-     "execution_count": 8,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# ordered list of labels\n",
-    "labels = [\"culture\", \"diverse\", \"economy\", \"politics\", \"sports\"]\n",
-    "num_labels = len(labels)\n",
-    "pd.DataFrame({\"label\": labels})"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next, we split the data for training and testing:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Number of training examples: 11676\n",
-      "Number of testing examples: 2920\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
-      "  FutureWarning)\n"
-     ]
-    }
-   ],
-   "source": [
-    "df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=RANDOM_STATE)\n",
-    "print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
-    "print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Tokenize and Preprocess"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 11676/11676 [00:59<00:00, 196.42it/s]\n",
-      "100%|██████████| 2920/2920 [00:14<00:00, 197.99it/s]\n"
-     ]
-    }
-   ],
-   "source": [
-    "tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)\n",
-    "tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))\n",
-    "tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In addition, we perform the following preprocessing steps in the cell below:\n",
-    "- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
-    "- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence\n",
-    "- Pad or truncate the token lists to the specified max length\n",
-    "- Return mask lists that indicate paddings' positions\n",
-    "\n",
-    "*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(\n",
-    "    tokens_train, MAX_LEN\n",
-    ")\n",
-    "tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(\n",
-    "    tokens_test, MAX_LEN\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Create Model\n",
-    "Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "classifier = BERTSequenceClassifier(\n",
-    "    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Train\n",
-    "We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "t_total value of -1 results in schedule not being applied\n",
-      "Iteration:   0%|          | 1/365 [00:03<21:12,  3.49s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:1->37/365; average training loss:1.591262\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  10%|█         | 38/365 [01:02<08:45,  1.61s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:38->74/365; average training loss:0.745935\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  21%|██        | 75/365 [02:02<07:52,  1.63s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:75->111/365; average training loss:0.593934\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  31%|███       | 112/365 [03:03<06:56,  1.65s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:112->148/365; average training loss:0.530150\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  41%|████      | 149/365 [04:03<05:54,  1.64s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:149->185/365; average training loss:0.481620\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  51%|█████     | 186/365 [05:05<05:02,  1.69s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:186->222/365; average training loss:0.455032\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  61%|██████    | 223/365 [06:06<03:59,  1.69s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:223->259/365; average training loss:0.421702\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  71%|███████   | 260/365 [07:08<02:56,  1.68s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:260->296/365; average training loss:0.401165\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  81%|████████▏ | 297/365 [08:09<01:52,  1.65s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:297->333/365; average training loss:0.382719\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration:  92%|█████████▏| 334/365 [09:12<00:52,  1.71s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "epoch:1/1; batch:334->365/365; average training loss:0.372204\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration: 100%|██████████| 365/365 [10:04<00:00,  1.63s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[Training time: 0.169 hrs]\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "with Timer() as t:\n",
-    "    classifier.fit(\n",
-    "        token_ids=tokens_train,\n",
-    "        input_mask=mask_train,\n",
-    "        labels=list(df_train[label_col]),    \n",
-    "        num_gpus=NUM_GPUS,        \n",
-    "        num_epochs=NUM_EPOCHS,\n",
-    "        batch_size=BATCH_SIZE,    \n",
-    "        verbose=True,\n",
-    "    )    \n",
-    "print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Score\n",
-    "We score the test set using the trained classifier:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Iteration: 100%|██████████| 92/92 [00:48<00:00,  2.25it/s]\n"
-     ]
-    }
-   ],
-   "source": [
-    "preds = classifier.predict(\n",
-    "    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Evaluate Results\n",
-    "Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "accuracy: 0.9277397260273973\n",
-      "{\n",
-      "    \"culture\": {\n",
-      "        \"f1-score\": 0.9081761006289307,\n",
-      "        \"precision\": 0.8848039215686274,\n",
-      "        \"recall\": 0.9328165374677002,\n",
-      "        \"support\": 387\n",
-      "    },\n",
-      "    \"diverse\": {\n",
-      "        \"f1-score\": 0.9237983587338804,\n",
-      "        \"precision\": 0.9471153846153846,\n",
-      "        \"recall\": 0.9016018306636155,\n",
-      "        \"support\": 437\n",
-      "    },\n",
-      "    \"economy\": {\n",
-      "        \"f1-score\": 0.8547418967587034,\n",
-      "        \"precision\": 0.8221709006928406,\n",
-      "        \"recall\": 0.89,\n",
-      "        \"support\": 400\n",
-      "    },\n",
-      "    \"macro avg\": {\n",
-      "        \"f1-score\": 0.9099850933798536,\n",
-      "        \"precision\": 0.9087524907040864,\n",
-      "        \"recall\": 0.9125256551533433,\n",
-      "        \"support\": 2920\n",
-      "    },\n",
-      "    \"micro avg\": {\n",
-      "        \"f1-score\": 0.9277397260273973,\n",
-      "        \"precision\": 0.9277397260273973,\n",
-      "        \"recall\": 0.9277397260273973,\n",
-      "        \"support\": 2920\n",
-      "    },\n",
-      "    \"politics\": {\n",
-      "        \"f1-score\": 0.8734177215189873,\n",
-      "        \"precision\": 0.8994413407821229,\n",
-      "        \"recall\": 0.8488576449912126,\n",
-      "        \"support\": 569\n",
-      "    },\n",
-      "    \"sports\": {\n",
-      "        \"f1-score\": 0.9897913892587662,\n",
-      "        \"precision\": 0.9902309058614565,\n",
-      "        \"recall\": 0.9893522626441881,\n",
-      "        \"support\": 1127\n",
-      "    },\n",
-      "    \"weighted avg\": {\n",
-      "        \"f1-score\": 0.9279213601549715,\n",
-      "        \"precision\": 0.9290922105520572,\n",
-      "        \"recall\": 0.9277397260273973,\n",
-      "        \"support\": 2920\n",
-      "    }\n",
-      "}\n"
-     ]
-    }
-   ],
-   "source": [
-    "report = classification_report(df_test[label_col], preds, target_names=labels, output_dict=True) \n",
-    "accuracy = accuracy_score(df_test[label_col], preds )\n",
-    "print(\"accuracy: {}\".format(accuracy))\n",
-    "print(json.dumps(report, indent=4, sort_keys=True))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/scrapbook.scrap.json+json": {
-       "data": 0.9277397260273973,
-       "encoder": "json",
-       "name": "accuracy",
-       "version": 1
-      }
-     },
-     "metadata": {
-      "scrapbook": {
-       "data": true,
-       "display": false,
-       "name": "accuracy"
-      }
-     },
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/scrapbook.scrap.json+json": {
-       "data": 0.9087524907040864,
-       "encoder": "json",
-       "name": "precision",
-       "version": 1
-      }
-     },
-     "metadata": {
-      "scrapbook": {
-       "data": true,
-       "display": false,
-       "name": "precision"
-      }
-     },
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/scrapbook.scrap.json+json": {
-       "data": 0.9125256551533433,
-       "encoder": "json",
-       "name": "recall",
-       "version": 1
-      }
-     },
-     "metadata": {
-      "scrapbook": {
-       "data": true,
-       "display": false,
-       "name": "recall"
-      }
-     },
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/scrapbook.scrap.json+json": {
-       "data": 0.9099850933798536,
-       "encoder": "json",
-       "name": "f1",
-       "version": 1
-      }
-     },
-     "metadata": {
-      "scrapbook": {
-       "data": true,
-       "display": false,
-       "name": "f1"
-      }
-     },
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "# for testing\n",
-    "sb.glue(\"accuracy\", accuracy)\n",
-    "sb.glue(\"precision\", report[\"macro avg\"][\"precision\"])\n",
-    "sb.glue(\"recall\", report[\"macro avg\"][\"recall\"])\n",
-    "sb.glue(\"f1\", report[\"macro avg\"][\"f1-score\"])"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "nlp_gpu",
-   "language": "python",
-   "name": "nlp_gpu"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.8"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/examples/text_classification/tc_mnli_xlnet.ipynb
+++ b/examples/text_classification/tc_mnli_xlnet.ipynb
--- a/examples/text_classification/tc_multi_languages_transformers.ipynb
+++ b/examples/text_classification/tc_multi_languages_transformers.ipynb
@ -440,7 +440,7 @@
    "    test_labels, \n",
    "    preds,\n",
    "    digits=2,\n",
-    "    labels=test_labels.unique(),\n",
+    "    labels=np.unique(test_labels),\n",
    "    target_names=label_encoder.classes_\n",
    ")\n",
    "\n",
--- a/setup.py
+++ b/setup.py
@ -29,7 +29,7 @@ setup(
    ),
    author=AUTHOR,
    author_email="teamsharat@microsoft.com",
-    url="https://github.com/microsoft/nlp",
+    url="https://github.com/microsoft/nlp-recipes",
    packages=["utils_nlp"],
    include_package_data=True,
    zip_safe=True,
@ -56,10 +56,16 @@ setup(
        "Intended Audience :: Telecommunications Industry",
    ],
    project_urls={
-        "Documentation": "https://github.com/microsoft/nlp/",
-        "Issue Tracker": "https://github.com/microsoft/nlp/issues",
+        "Documentation": "https://github.com/microsoft/nlp-recipes/",
+        "Issue Tracker": "https://github.com/microsoft/nlp-recipes/issues",
    },
-    keywords=["Microsoft NLP", "Natural Language Processing", "Text Processing", "Word Embedding"],
+    keywords=[
+        "Microsoft NLP",
+        "NLP Recipes",
+        "Natural Language Processing",
+        "Text Processing",
+        "Word Embedding",
+    ],
    python_requires=">=3.6",
    install_requires=[],
    dependency_links=[],
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -76,12 +76,6 @@ def notebooks():
        "tc_mnli_transformers": os.path.join(
            folder_notebooks, "text_classification", "tc_mnli_transformers.ipynb"
        ),
-        "tc_dac_bert_ar": os.path.join(
-            folder_notebooks, "text_classification", "tc_dac_bert_ar.ipynb"
-        ),
-        "tc_bbc_bert_hi": os.path.join(
-            folder_notebooks, "text_classification", "tc_bbc_bert_hi.ipynb"
-        ),
        "tc_multi_languages_transformers": os.path.join(
            folder_notebooks, "text_classification", "tc_multi_languages_transformers.ipynb"
        ),
@ -104,6 +98,15 @@ def tmp(tmp_path_factory):
        td.cleanup()


+@pytest.fixture(scope="module")
+def tmp_module(tmp_path_factory):
+    td = TemporaryDirectory(dir=tmp_path_factory.getbasetemp())
+    try:
+        yield td.name
+    finally:
+        td.cleanup()
+
+
@pytest.fixture(scope="module")
 def ner_test_data():
    UNIQUE_LABELS = ["O", "I-LOC", "I-MISC", "I-PER", "I-ORG", "X"]
--- a/tests/integration/test_notebooks_text_classification.py
+++ b/tests/integration/test_notebooks_text_classification.py
@ -37,50 +37,6 @@ def test_tc_mnli_transformers(notebooks, tmp):
    assert pytest.approx(result["f1"], 0.89, abs=ABS_TOL)


-@pytest.mark.gpu
-@pytest.mark.integration
-def test_tc_dac_bert_ar(notebooks, tmp):
-    notebook_path = notebooks["tc_dac_bert_ar"]
-    pm.execute_notebook(
-        notebook_path,
-        OUTPUT_NOTEBOOK,
-        kernel_name=KERNEL_NAME,
-        parameters=dict(
-            NUM_GPUS=1,
-            DATA_FOLDER=tmp,
-            BERT_CACHE_DIR=tmp,
-            MAX_LEN=175,
-            BATCH_SIZE=16,
-            NUM_EPOCHS=1,
-            TRAIN_SIZE=0.8,
-            NUM_ROWS=8000,
-            RANDOM_STATE=0,
-        ),
-    )
-    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
-    assert pytest.approx(result["accuracy"], 0.871, abs=ABS_TOL)
-    assert pytest.approx(result["precision"], 0.865, abs=ABS_TOL)
-    assert pytest.approx(result["recall"], 0.852, abs=ABS_TOL)
-    assert pytest.approx(result["f1"], 0.845, abs=ABS_TOL)
-
-
-@pytest.mark.gpu
-@pytest.mark.integration
-def test_tc_bbc_bert_hi(notebooks, tmp):
-    notebook_path = notebooks["tc_bbc_bert_hi"]
-    pm.execute_notebook(
-        notebook_path,
-        OUTPUT_NOTEBOOK,
-        kernel_name=KERNEL_NAME,
-        parameters=dict(NUM_GPUS=1, DATA_FOLDER=tmp, BERT_CACHE_DIR=tmp, NUM_EPOCHS=1),
-    )
-    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
-    assert pytest.approx(result["accuracy"], 0.71, abs=ABS_TOL)
-    assert pytest.approx(result["precision"], 0.25, abs=ABS_TOL)
-    assert pytest.approx(result["recall"], 0.28, abs=ABS_TOL)
-    assert pytest.approx(result["f1"], 0.26, abs=ABS_TOL)
-
-
@pytest.mark.integration
@pytest.mark.azureml
@pytest.mark.gpu
@ -118,6 +74,7 @@ def test_tc_bert_azureml(
    if os.path.exists("outputs"):
        shutil.rmtree("outputs")

+
@pytest.mark.gpu
@pytest.mark.integration
 def test_multi_languages_transformer(notebooks, tmp):
@ -126,10 +83,7 @@ def test_multi_languages_transformer(notebooks, tmp):
        notebook_path,
        OUTPUT_NOTEBOOK,
        kernel_name=KERNEL_NAME,
-        parameters={
-            "QUICK_RUN": True,
-            "USE_DATASET": "dac"
-        },
+        parameters={"QUICK_RUN": True, "USE_DATASET": "dac"},
    )
    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
    assert pytest.approx(result["precision"], 0.94, abs=ABS_TOL)
--- a/tests/unit/test_bert_sentence_encoding.py
+++ b/tests/unit/test_bert_sentence_encoding.py
@ -18,8 +18,7 @@ def data():
    ]


-@pytest.mark.cpu
-def test_sentence_encoding(data):
+def test_sentence_encoding(tmp, data):
    se = BERTSentenceEncoder(
        language=Language.ENGLISH,
        num_gpus=0,
@ -27,6 +26,7 @@ def test_sentence_encoding(data):
        max_len=128,
        layer_index=-2,
        pooling_strategy=PoolingStrategy.MEAN,
+        cache_dir=tmp,
    )

    result = se.encode(data, as_numpy=False)
--- a/tests/unit/test_models_transformers_question_answering.py
+++ b/tests/unit/test_models_transformers_question_answering.py
@ -17,8 +17,8 @@ NUM_GPUS = max(1, torch.cuda.device_count())
 BATCH_SIZE = 8


-@pytest.fixture()
-def qa_test_data(qa_test_df, tmp):
+@pytest.fixture(scope="module")
+def qa_test_data(qa_test_df, tmp_module):

    train_dataset = QADataset(
        df=qa_test_df["test_df"],
@ -63,7 +63,7 @@ def qa_test_data(qa_test_df, tmp):
        qa_id_col=qa_test_df["qa_id_col"],
    )

-    qa_processor_bert = QAProcessor()
+    qa_processor_bert = QAProcessor(cache_dir=tmp_module)
    train_features_bert = qa_processor_bert.preprocess(
        train_dataset,
        batch_size=BATCH_SIZE,
@ -72,7 +72,7 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

    test_features_bert = qa_processor_bert.preprocess(
@ -83,10 +83,10 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

-    qa_processor_xlnet = QAProcessor(model_name="xlnet-base-cased")
+    qa_processor_xlnet = QAProcessor(model_name="xlnet-base-cased", cache_dir=tmp_module)
    train_features_xlnet = qa_processor_xlnet.preprocess(
        train_dataset,
        batch_size=BATCH_SIZE,
@ -95,7 +95,7 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

    test_features_xlnet = qa_processor_xlnet.preprocess(
@ -106,10 +106,12 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

-    qa_processor_distilbert = QAProcessor(model_name="distilbert-base-uncased")
+    qa_processor_distilbert = QAProcessor(
+        model_name="distilbert-base-uncased", cache_dir=tmp_module
+    )
    train_features_distilbert = qa_processor_distilbert.preprocess(
        train_dataset,
        batch_size=BATCH_SIZE,
@ -118,7 +120,7 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

    test_features_distilbert = qa_processor_distilbert.preprocess(
@ -129,7 +131,7 @@ def qa_test_data(qa_test_df, tmp):
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )

    return {
@ -147,103 +149,136 @@ def qa_test_data(qa_test_df, tmp):
    }


-def test_QAProcessor(qa_test_data, tmp):
+@pytest.mark.gpu
+def test_QAProcessor(qa_test_data, tmp_module):
    for model_name in ["bert-base-cased", "xlnet-base-cased", "distilbert-base-uncased"]:
-        qa_processor = QAProcessor(model_name=model_name)
-        qa_processor.preprocess(qa_test_data["train_dataset"], is_training=True)
-        qa_processor.preprocess(qa_test_data["train_dataset_list"], is_training=True)
-        qa_processor.preprocess(qa_test_data["test_dataset"], is_training=False)
+        qa_processor = QAProcessor(model_name=model_name, cache_dir=tmp_module)
+        qa_processor.preprocess(
+            qa_test_data["train_dataset"], is_training=True, feature_cache_dir=tmp_module
+        )
+        qa_processor.preprocess(
+            qa_test_data["train_dataset_list"], is_training=True, feature_cache_dir=tmp_module
+        )
+        qa_processor.preprocess(
+            qa_test_data["test_dataset"], is_training=False, feature_cache_dir=tmp_module
+        )

    # test unsupported model type
    with pytest.raises(ValueError):
-        qa_processor = QAProcessor(model_name="abc")
+        qa_processor = QAProcessor(model_name="abc", cache_dir=tmp_module)

    # test training data has no ground truth exception
    with pytest.raises(Exception):
-        qa_processor.preprocess(qa_test_data["test_dataset"], is_training=True)
+        qa_processor.preprocess(
+            qa_test_data["test_dataset"], is_training=True, feature_cache_dir=tmp_module
+        )

    # test when answer start is a list, but answer text is not
    with pytest.raises(Exception):
-        qa_processor.preprocess(qa_test_data["train_dataset_start_text_mismatch"], is_training=True)
+        qa_processor.preprocess(
+            qa_test_data["train_dataset_start_text_mismatch"],
+            is_training=True,
+            feature_cache_dir=tmp_module,
+        )

    # test when training data has multiple answers
    with pytest.raises(Exception):
-        qa_processor.preprocess(qa_test_data["train_dataset_multi_answers"], is_training=True)
+        qa_processor.preprocess(
+            qa_test_data["train_dataset_multi_answers"],
+            is_training=True,
+            feature_cache_dir=tmp_module,
+        )


-def test_AnswerExtractor(qa_test_data, tmp):
+def test_AnswerExtractor(qa_test_data, tmp_module):
    # test bert
-    qa_extractor_bert = AnswerExtractor(cache_dir=tmp)
+    qa_extractor_bert = AnswerExtractor(cache_dir=tmp_module)
    qa_extractor_bert.fit(qa_test_data["train_features_bert"], cache_model=True)

    # test saving fine-tuned model
-    model_output_dir = os.path.join(tmp, "fine_tuned")
+    model_output_dir = os.path.join(tmp_module, "fine_tuned")
    assert os.path.exists(os.path.join(model_output_dir, "pytorch_model.bin"))
    assert os.path.exists(os.path.join(model_output_dir, "config.json"))

-    qa_extractor_from_cache = AnswerExtractor(cache_dir=tmp, load_model_from_dir=model_output_dir)
+    qa_extractor_from_cache = AnswerExtractor(
+        cache_dir=tmp_module, load_model_from_dir=model_output_dir
+    )
    qa_extractor_from_cache.predict(qa_test_data["test_features_bert"])

-    qa_extractor_xlnet = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp)
+    qa_extractor_xlnet = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
    qa_extractor_xlnet.fit(qa_test_data["train_features_xlnet"], cache_model=False)
    qa_extractor_xlnet.predict(qa_test_data["test_features_xlnet"])

-    qa_extractor_distilbert = AnswerExtractor(model_name="distilbert-base-uncased", cache_dir=tmp)
+    qa_extractor_distilbert = AnswerExtractor(
+        model_name="distilbert-base-uncased", cache_dir=tmp_module
+    )
    qa_extractor_distilbert.fit(qa_test_data["train_features_distilbert"], cache_model=False)
    qa_extractor_distilbert.predict(qa_test_data["test_features_distilbert"])


-def test_postprocess_bert_answer(qa_test_data, tmp):
-    qa_processor = QAProcessor()
+def test_postprocess_bert_answer(qa_test_data, tmp_module):
+    qa_processor = QAProcessor(cache_dir=tmp_module)
    test_features = qa_processor.preprocess(
        qa_test_data["test_dataset"],
        is_training=False,
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )
-    qa_extractor = AnswerExtractor(cache_dir=tmp)
+    qa_extractor = AnswerExtractor(cache_dir=tmp_module)
    predictions = qa_extractor.predict(test_features)

    qa_processor.postprocess(
        results=predictions,
-        examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
-        features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
+        examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
+        features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
+        output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
+        output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
+        output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
    )

    qa_processor.postprocess(
        results=predictions,
-        examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
-        features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
+        examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
+        features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
        unanswerable_exists=True,
        verbose_logging=True,
+        output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
+        output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
+        output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
    )


-def test_postprocess_xlnet_answer(qa_test_data, tmp):
-    qa_processor = QAProcessor(model_name="xlnet-base-cased")
+def test_postprocess_xlnet_answer(qa_test_data, tmp_module):
+    qa_processor = QAProcessor(model_name="xlnet-base-cased", cache_dir=tmp_module)
    test_features = qa_processor.preprocess(
        qa_test_data["test_dataset"],
        is_training=False,
        max_question_length=16,
        max_seq_length=64,
        doc_stride=32,
-        feature_cache_dir=tmp,
+        feature_cache_dir=tmp_module,
    )
-    qa_extractor = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp)
+    qa_extractor = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
    predictions = qa_extractor.predict(test_features)

    qa_processor.postprocess(
        results=predictions,
-        examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
-        features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
+        examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
+        features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
+        output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
+        output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
+        output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
    )

    qa_processor.postprocess(
        results=predictions,
-        examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
-        features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
+        examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
+        features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
        unanswerable_exists=True,
        verbose_logging=True,
+        output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
+        output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
+        output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
    )
--- a/tests/unit/test_notebooks_cpu.py
+++ b/tests/unit/test_notebooks_cpu.py
@ -9,17 +9,13 @@ from utils_nlp.models.bert.common import Language


@pytest.mark.notebooks
-def test_bert_encoder(notebooks):
+def test_bert_encoder(notebooks, tmp):
    notebook_path = notebooks["bert_encoder"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
        kernel_name=KERNEL_NAME,
        parameters=dict(
-            NUM_GPUS=0,
-            LANGUAGE=Language.ENGLISH,
-            TO_LOWER=True,
-            MAX_SEQ_LENGTH=128,
-            CACHE_DIR="./temp",
+            NUM_GPUS=0, LANGUAGE=Language.ENGLISH, TO_LOWER=True, MAX_SEQ_LENGTH=128, CACHE_DIR=tmp
        ),
    )
--- a/tests/unit/test_notebooks_gpu.py
+++ b/tests/unit/test_notebooks_gpu.py
@ -10,17 +10,13 @@ from utils_nlp.models.bert.common import Language

@pytest.mark.notebooks
@pytest.mark.gpu
-def test_bert_encoder(notebooks):
+def test_bert_encoder(notebooks, tmp):
    notebook_path = notebooks["bert_encoder"]
    pm.execute_notebook(
        notebook_path,
        OUTPUT_NOTEBOOK,
        kernel_name=KERNEL_NAME,
        parameters=dict(
-            NUM_GPUS=1,
-            LANGUAGE=Language.ENGLISH,
-            TO_LOWER=True,
-            MAX_SEQ_LENGTH=128,
-            CACHE_DIR="./temp",
+            NUM_GPUS=1, LANGUAGE=Language.ENGLISH, TO_LOWER=True, MAX_SEQ_LENGTH=128, CACHE_DIR=tmp
        ),
    )
--- a/tests/unit/test_xlnet_common.py
+++ b/tests/unit/test_xlnet_common.py
@ -1,27 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT License.
-
-import pytest
-
-def test_preprocess_classification_tokens(xlnet_english_tokenizer):
-    text = ["Hello World.",
-            "How you doing?",
-            "greatttt",
-            "The quick, brown fox jumps over a lazy dog.",
-            " DJs flock by when MTV ax quiz prog",
-            "Quick wafting zephyrs vex bold Jim",
-            "Quick, Baz, get my woven flax jodhpurs!"            
-           ]
-    seq_length = 5
-    input_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(text, seq_length)
-    
-    assert len(input_ids) == len(text)
-    assert len(input_mask) == len(text)
-    assert len(segment_ids) == len(text)
-    
-    
-    for sentence in range(len(text)):
-        assert len(input_ids[sentence]) == seq_length
-        assert len(input_mask[sentence]) == seq_length
-        assert len(segment_ids[sentence]) == seq_length
-    
--- a/tests/unit/test_xlnet_sequence_classification.py
+++ b/tests/unit/test_xlnet_sequence_classification.py
@ -1,44 +0,0 @@
-# Copyright (c) Microsoft Corporation. All rights reserved.
-# Licensed under the MIT License.
-
-import pytest
-
-from utils_nlp.models.xlnet.common import Language
-from utils_nlp.models.xlnet.sequence_classification import XLNetSequenceClassifier
-
-
-@pytest.fixture()
-def data():
-    return (
-        ["hi", "hello", "what's wrong with us", "can I leave?"],
-        [0, 0, 1, 2],
-        ["hey", "i will", "be working from", "home today"],
-        [2, 1, 1, 0],
-    )
-
-
-def test_classifier(xlnet_english_tokenizer, data):
-    token_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(
-        data[0], max_seq_length=10
-    )
-
-    val_data = xlnet_english_tokenizer.preprocess_classification_tokens(data[2], max_seq_length=10)
-
-    val_token_ids, val_input_mask, val_segment_ids = val_data
-
-    classifier = XLNetSequenceClassifier(language=Language.ENGLISHCASED, num_labels=3)
-    classifier.fit(
-        token_ids=token_ids,
-        input_mask=input_mask,
-        token_type_ids=segment_ids,
-        labels=data[1],
-        val_token_ids=val_token_ids,
-        val_input_mask=val_input_mask,
-        val_labels=data[3],
-        val_token_type_ids=val_segment_ids,
-    )
-
-    preds = classifier.predict(
-        token_ids=token_ids, input_mask=input_mask, token_type_ids=segment_ids
-    )
-    assert len(preds) == len(data[1])
--- a/tools/generate_conda_file.py
+++ b/tools/generate_conda_file.py
@ -29,6 +29,7 @@ $ python -m ipykernel install --user --name {conda_env} \
 --display-name "Python ({conda_env})"
 """

+
 CHANNELS = ["defaults", "conda-forge", "pytorch"]

 CONDA_BASE = {
@ -63,14 +64,14 @@ PIP_BASE = {
    "azureml-train-automl": "azureml-train-automl==1.0.57",
    "azureml-dataprep": "azureml-dataprep==1.1.8",
    "azureml-widgets": "azureml-widgets==1.0.57",
-    "azureml-mlflow": "azureml-mlflow>=1.0.43.1",
+    "azureml-mlflow": "azureml-mlflow==1.0.57",
    "black": "black>=18.6b4",
    "cached-property": "cached-property==1.5.1",
    "jsonlines": "jsonlines>=1.2.0",
    "nteract-scrapbook": "nteract-scrapbook>=0.2.1",
    "pydocumentdb": "pydocumentdb>=2.3.3",
    "pytorch-pretrained-bert": "pytorch-pretrained-bert>=0.6",
-    "tqdm": "tqdm==4.31.1",
+    "tqdm": "tqdm==4.32.2",
    "pyemd": "pyemd==0.5.1",
    "ipywebrtc": "ipywebrtc==0.4.3",
    "pre-commit": "pre-commit>=1.14.4",
@ -82,7 +83,7 @@ PIP_BASE = {
        "https://github.com/explosion/spacy-models/releases/download/"
        "en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz"
    ),
-    "transformers": "transformers>=2.0.0",
+    "transformers": "transformers==2.1.1",
    "gensim": "gensim>=3.7.0",
    "nltk": "nltk>=3.4",
    "seqeval": "seqeval>=0.0.12",
--- a/utils_nlp/README.md
+++ b/utils_nlp/README.md
@ -26,7 +26,7 @@ ws = get_or_create_workspace(
 This submodule contains high-level utilities that are commonly used in multiple algorithms as well as helper functions for managing frameworks like pytorch.

 ### [Dataset](dataset)
-This submodule includes helper functions for interacting with well-known datasets,  utility functions to process datasets for different NLP tasks, as well as utilities for splitting data for training/testing. For example, the [snli module](snli.py) will allow you to load a dataframe in pandas from the  Stanford Natural Language Inference (SNLI) Corpus dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Information on the datasets used in the repo can be found [here](https://github.com/microsoft/nlp/tree/staging/utils_nlp/dataset#datasets).
+This submodule includes helper functions for interacting with well-known datasets,  utility functions to process datasets for different NLP tasks, as well as utilities for splitting data for training/testing. For example, the [snli module](snli.py) will allow you to load a dataframe in pandas from the  Stanford Natural Language Inference (SNLI) Corpus dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Information on the datasets used in the repo can be found [here](https://github.com/microsoft/nlp-recipes/tree/staging/utils_nlp/dataset#datasets).

 Most datasets may be split into `train`, `dev`, and `test`.