Merge pull request #63 from microsoft/bleik

Bleik to staging
2019-06-06 15:59:33 -04:00 · 2019-06-06 15:59:33 -04:00 · 3e47deee70
--- a/README.md
+++ b/README.md
@ -1,14 +1,22 @@

-| Branch | Status |     | Branch | Status | 
-|  ---   |  ---   | --- |  ---   |  ---   |
-| master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-master?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=22&branchName=master) |  | staging | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-staging?branchName=staging)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=21&branchName=staging) |
+| Branch | Status                                                                                                                                                                                                      |     | Branch  | Status                                                                                                                                                                                                         |
+| ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| master | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-master?branchName=master)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=22&branchName=master) |     | staging | [![Build Status](https://dev.azure.com/best-practices/nlp/_apis/build/status/unit-test-staging?branchName=staging)](https://dev.azure.com/best-practices/nlp/_build/latest?definitionId=21&branchName=staging) |


 # NLP Best Practices

-This repository will provide examples and best practices for building NLP systems, provided as Jupyter notebooks.
+This repository will provide examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions.


+## Scenarios
+
+| Scenario                 | Applications                                 | Languages | Models |
+|---| ------------------------ | -------------------------------------------- | ------------------- |
+| Text Classification      | Sentiment Analysis <br> Topic Classification | English | BERT, fastText             |
+| Named Entity Recognition |                                              | English  | BERT           |
+| Sentence Encoding        | Sentence Similarity                          | English             |
+

 ## Planning etc documents

--- a/scenarios/text_classification/tc_mnli_bert.ipynb
+++ b/scenarios/text_classification/tc_mnli_bert.ipynb
@ -0,0 +1,530 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
+    "\n",
+    "*Licensed under the MIT License.*\n",
+    "\n",
+    "# Text Classification of MultiNLI Sentences using BERT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.append(\"../../\")\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
+    "from sklearn.preprocessing import LabelEncoder\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from utils_nlp.dataset.multinli import load_pandas_df\n",
+    "from utils_nlp.eval.classification import eval_classification\n",
+    "from utils_nlp.bert.sequence_classification import SequenceClassifier\n",
+    "from utils_nlp.bert.common import Language, Tokenizer\n",
+    "from utils_nlp.common.timer import Timer\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset.\n",
+    "\n",
+    "We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DATA_FOLDER = \"./temp\"\n",
+    "BERT_CACHE_DIR = \"./temp\"\n",
+    "LANGUAGE = Language.ENGLISH\n",
+    "TO_LOWER = True\n",
+    "MAX_LEN = 150\n",
+    "BATCH_SIZE = 32\n",
+    "NUM_GPUS = 2\n",
+    "NUM_EPOCHS = 1\n",
+    "TRAIN_SIZE = 0.6\n",
+    "LABEL_COL = \"genre\"\n",
+    "TEXT_COL = \"sentence1\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Read Dataset\n",
+    "We start by loading a subset of the data. The following function also downloads and extracts the files, if they don't exist in the data folder.\n",
+    "\n",
+    "The MultiNLI dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators. The sentence pairs are also classified into *genres* that allow for more coverage and better evaluation of NLI models.\n",
+    "\n",
+    "For our classification task, we use the first sentence only as the text input, and the corresponding genre as the label. We select the examples corresponding to one of the entailment labels (*neutral* in this case) to avoid duplicate rows, as the sentences are not unique, whereas the sentence pairs are."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = load_pandas_df(DATA_FOLDER, \"train\")\n",
+    "df = df[df[\"gold_label\"]==\"neutral\"]  # get unique sentences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>genre</th>\n",
+       "      <th>sentence1</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>government</td>\n",
+       "      <td>Conceptually cream skimming has two basic dime...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>telephone</td>\n",
+       "      <td>yeah i tell you what though if you go price so...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>travel</td>\n",
+       "      <td>But a few Christian mosaics survive above the ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>slate</td>\n",
+       "      <td>It's not that the questions they asked weren't...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>travel</td>\n",
+       "      <td>Thebes held onto power until the 12th Dynasty,...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "         genre                                          sentence1\n",
+       "0   government  Conceptually cream skimming has two basic dime...\n",
+       "4    telephone  yeah i tell you what though if you go price so...\n",
+       "6       travel  But a few Christian mosaics survive above the ...\n",
+       "12       slate  It's not that the questions they asked weren't...\n",
+       "13      travel  Thebes held onto power until the 12th Dynasty,..."
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df[[LABEL_COL, TEXT_COL]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The examples in the dataset are grouped into 5 genres:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "telephone     27783\n",
+       "government    25784\n",
+       "travel        25783\n",
+       "fiction       25782\n",
+       "slate         25768\n",
+       "Name: genre, dtype: int64"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df[LABEL_COL].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We split the data for training and testing, and encode the class labels:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# split\n",
+    "df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)\n",
+    "\n",
+    "# encode labels\n",
+    "label_encoder = LabelEncoder()\n",
+    "labels_train = label_encoder.fit_transform(df_train[LABEL_COL])\n",
+    "labels_test = label_encoder.transform(df_test[LABEL_COL])\n",
+    "\n",
+    "num_labels = len(np.unique(labels_train))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of unique labels: 5\n",
+      "Number of training examples: 78540\n",
+      "Number of testing examples: 52360\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Number of unique labels: {}\".format(num_labels))\n",
+    "print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
+    "print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tokenize and Preprocess"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = Tokenizer(LANGUAGE, to_lower=TO_LOWER, cache_dir=BERT_CACHE_DIR)\n",
+    "\n",
+    "tokens_train = tokenizer.tokenize(df_train[TEXT_COL])\n",
+    "tokens_test = tokenizer.tokenize(df_test[TEXT_COL])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition, we perform the following preprocessing steps in the cell below:\n",
+    "- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
+    "- Add sentence markers\n",
+    "- Pad or truncate the token lists to the specified max length\n",
+    "- Return mask lists that indicate paddings' positions\n",
+    "\n",
+    "*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokens_train, mask_train = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_train, MAX_LEN\n",
+    ")\n",
+    "tokens_test, mask_test = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_test, MAX_LEN\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Model\n",
+    "Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "classifier = SequenceClassifier(\n",
+    "    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train\n",
+    "We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "t_total value of -1 results in schedule not being applied\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:1->246/2454; loss:1.631739\n",
+      "epoch:1/1; batch:247->492/2454; loss:0.427608\n",
+      "epoch:1/1; batch:493->738/2454; loss:0.255493\n",
+      "epoch:1/1; batch:739->984/2454; loss:0.286230\n",
+      "epoch:1/1; batch:985->1230/2454; loss:0.375268\n",
+      "epoch:1/1; batch:1231->1476/2454; loss:0.146290\n",
+      "epoch:1/1; batch:1477->1722/2454; loss:0.092100\n",
+      "epoch:1/1; batch:1723->1968/2454; loss:0.009405\n",
+      "epoch:1/1; batch:1969->2214/2454; loss:0.038235\n",
+      "epoch:1/1; batch:2215->2460/2454; loss:0.098216\n",
+      "[Training time: 0.981 hrs]\n"
+     ]
+    }
+   ],
+   "source": [
+    "with Timer() as t:\n",
+    "    classifier.fit(\n",
+    "        token_ids=tokens_train,\n",
+    "        input_mask=mask_train,\n",
+    "        labels=labels_train,    \n",
+    "        num_gpus=NUM_GPUS,        \n",
+    "        num_epochs=NUM_EPOCHS,\n",
+    "        batch_size=BATCH_SIZE,    \n",
+    "        verbose=True,\n",
+    "    )    \n",
+    "print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Score\n",
+    "We score the test set using the trained classifier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "52384it [11:54, 88.50it/s]                           \n"
+     ]
+    }
+   ],
+   "source": [
+    "preds = classifier.predict(\n",
+    "    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluate Results\n",
+    "Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      " accuracy: 0.938273\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>label</th>\n",
+       "      <th>precision</th>\n",
+       "      <th>recall</th>\n",
+       "      <th>f1</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>fiction</td>\n",
+       "      <td>0.917004</td>\n",
+       "      <td>0.925839</td>\n",
+       "      <td>0.921401</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>government</td>\n",
+       "      <td>0.961477</td>\n",
+       "      <td>0.928780</td>\n",
+       "      <td>0.944845</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>slate</td>\n",
+       "      <td>0.875161</td>\n",
+       "      <td>0.861535</td>\n",
+       "      <td>0.868295</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>telephone</td>\n",
+       "      <td>0.989105</td>\n",
+       "      <td>0.996609</td>\n",
+       "      <td>0.992843</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>travel</td>\n",
+       "      <td>0.943405</td>\n",
+       "      <td>0.973232</td>\n",
+       "      <td>0.958087</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "        label  precision    recall        f1\n",
+       "0     fiction   0.917004  0.925839  0.921401\n",
+       "1  government   0.961477  0.928780  0.944845\n",
+       "2       slate   0.875161  0.861535  0.868295\n",
+       "3   telephone   0.989105  0.996609  0.992843\n",
+       "4      travel   0.943405  0.973232  0.958087"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "accuracy = accuracy_score(labels_test, preds)\n",
+    "precision = precision_score(labels_test, preds, average=None)\n",
+    "recall = recall_score(labels_test, preds, average=None)\n",
+    "f1 = f1_score(labels_test, preds, average=None)\n",
+    "\n",
+    "print(\"\\n accuracy: {:.6f}\".format(accuracy))\n",
+    "pd.DataFrame({\"label\": label_encoder.classes_, \"precision\": precision, \"recall\": recall, \"f1\": f1})"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/scenarios/text_classification/tc_yahoo_answers_bert.ipynb
+++ b/scenarios/text_classification/tc_yahoo_answers_bert.ipynb
@ -6,7 +6,6 @@
   "source": [
    "*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
    "\n",
-    "\n",
    "*Licensed under the MIT License.*\n",
    "\n",
    "# Text Classification of Yahoo Answers using BERT\n"
@ -14,7 +13,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@ -22,32 +21,33 @@
    "sys.path.append(\"../../\")\n",
    "import os\n",
    "import pandas as pd\n",
+    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
    "import utils_nlp.dataset.yahoo_answers as ya_dataset\n",
    "from utils_nlp.eval.classification import eval_classification\n",
+    "from utils_nlp.bert.sequence_classification import SequenceClassifier\n",
+    "from utils_nlp.bert.common import Language, Tokenizer\n",
+    "from utils_nlp.common.timer import Timer\n",
    "import torch\n",
    "import torch.nn as nn\n",
-    "from pytorch_pretrained_bert.modeling import BertForSequenceClassification\n",
-    "from pytorch_pretrained_bert.optimization import BertAdam\n",
-    "from pytorch_pretrained_bert.tokenization import BertTokenizer\n",
-    "import numpy as np\n",
-    "from sklearn.metrics import f1_score"
+    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
-    "DATA_FOLDER = \"../../../temp\"\n",
+    "DATA_FOLDER = \"./temp\"\n",
    "TRAIN_FILE = \"yahoo_answers_csv/train.csv\"\n",
    "TEST_FILE = \"yahoo_answers_csv/test.csv\"\n",
-    "BERT_CACHE_DIR = \"../../../temp\"\n",
-    "MAX_LEN = 100\n",
-    "BATCH_SIZE = 32\n",
-    "UPDATE_EMBEDDINGS = False\n",
+    "BERT_CACHE_DIR = \"./temp\"\n",
+    "MAX_LEN = 250\n",
+    "BATCH_SIZE = 16\n",
+    "NUM_GPUS = 2\n",
    "NUM_EPOCHS = 1\n",
-    "NUM_ROWS_TRAIN = 100000  # number of training examples to read"
+    "NUM_ROWS_TRAIN = 50000 # number of training examples to read\n",
+    "NUM_ROWS_TEST = 20000  # number of test examples to read"
   ]
  },
  {
@ -59,85 +59,98 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
-    "if (not os.path.isfile(os.path.join(DATA_FOLDER, TRAIN_FILE))) or (\n",
-    "    not os.path.isfile(os.path.join(DATA_FOLDER, TEST_FILE))\n",
-    "):\n",
-    "    ya_dataset.download(DATA_FOLDER)"
+    "if not os.path.exists(DATA_FOLDER):\n",
+    "    os.mkdir(DATA_FOLDER)\n",
+    "ya_dataset.download(DATA_FOLDER)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Read and Preprocess Dataset"
+    "## Read Dataset"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read data\n",
-    "df_train = ya_dataset.read_data(os.path.join(DATA_FOLDER, TRAIN_FILE), nrows=NUM_ROWS_TRAIN)\n",
-    "df_test = ya_dataset.read_data(os.path.join(DATA_FOLDER, TEST_FILE), nrows=None)\n",
-    "y_train = ya_dataset.get_labels(df_train)\n",
-    "y_test = ya_dataset.get_labels(df_test)\n",
-    "\n",
-    "num_train_examples = df_train.shape[0]\n",
-    "num_test_examples = df_test.shape[0]\n",
-    "num_labels = len(np.unique(y_train))\n",
-    "\n",
-    "# clean/get text\n",
-    "text_train = ya_dataset.clean_data(df_train)\n",
-    "text_test = ya_dataset.clean_data(df_test)\n",
-    "\n",
-    "# get tokenizer\n",
-    "tokenizer = BertTokenizer.from_pretrained(\n",
-    "    \"bert-base-uncased\", do_lower_case=True, cache_dir=BERT_CACHE_DIR\n",
+    "df_train = ya_dataset.read_data(\n",
+    "    os.path.join(DATA_FOLDER, TRAIN_FILE), nrows=NUM_ROWS_TRAIN\n",
+    ")\n",
+    "df_test = ya_dataset.read_data(\n",
+    "    os.path.join(DATA_FOLDER, TEST_FILE), nrows=NUM_ROWS_TEST\n",
    ")\n",
-    "# tokenize and truncate\n",
-    "tokens_train = [tokenizer.tokenize(x)[0 : MAX_LEN - 2] for x in text_train]\n",
-    "tokens_test = [tokenizer.tokenize(x)[0 : MAX_LEN - 2] for x in text_test]\n",
    "\n",
-    "# BERT format\n",
-    "tokens_train = [[\"[CLS]\"] + x + [\"[SEP]\"] for x in tokens_train]\n",
-    "tokens_test = [[\"[CLS]\"] + x + [\"[SEP]\"] for x in tokens_test]\n",
+    "# get labels\n",
+    "labels_train = ya_dataset.get_labels(df_train)\n",
+    "labels_test = ya_dataset.get_labels(df_test)\n",
    "\n",
-    "# convert tokens to ids\n",
-    "tokens_train = [tokenizer.convert_tokens_to_ids(x) for x in tokens_train]\n",
-    "tokens_test = [tokenizer.convert_tokens_to_ids(x) for x in tokens_test]\n",
+    "num_labels = len(np.unique(labels_train))\n",
    "\n",
-    "# pad\n",
-    "tokens_train = [x + [0] * (MAX_LEN - len(x)) for x in tokens_train]\n",
-    "tokens_test = [x + [0] * (MAX_LEN - len(x)) for x in tokens_test]\n",
-    "\n",
-    "# create input mask\n",
-    "input_mask_train = [[min(1, x) for x in y] for y in tokens_train]\n",
-    "input_mask_test = [[min(1, x) for x in y] for y in tokens_test]"
+    "# get text\n",
+    "text_train = ya_dataset.get_text(df_train)\n",
+    "text_test = ya_dataset.get_text(df_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Set Device"
+    "## Tokenize and Preprocess"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and test sets."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# set device\n",
-    "device_str = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
-    "device = torch.device(device_str)\n",
-    "print(\"using {} ...\".format(device_str))"
+    "tokenizer = Tokenizer(Language.ENGLISH, to_lower=True, cache_dir=BERT_CACHE_DIR)\n",
+    "\n",
+    "# tokenize\n",
+    "tokens_train = tokenizer.tokenize(text_train)\n",
+    "tokens_test = tokenizer.tokenize(text_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition, we perform the following preprocessing steps in the cell below:\n",
+    "- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
+    "- Add sentence markers\n",
+    "- Pad or truncate the token lists to the specified max length\n",
+    "\n",
+    "*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokens_train, mask_train = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_train, MAX_LEN\n",
+    ")\n",
+    "tokens_test, mask_test = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_test, MAX_LEN\n",
+    ")"
   ]
  },
  {
@ -149,95 +162,67 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# define model\n",
-    "model = BertForSequenceClassification.from_pretrained(\n",
-    "    \"bert-base-uncased\", cache_dir=BERT_CACHE_DIR, num_labels=num_labels\n",
-    ").to(device)\n",
-    "\n",
-    "# define loss function\n",
-    "loss_func = nn.CrossEntropyLoss().to(device)\n",
-    "\n",
-    "# define optimizer and model parameters\n",
-    "param_optimizer = list(model.named_parameters())\n",
-    "no_decay = [\"bias\", \"LayerNorm.bias\", \"LayerNorm.weight\"]\n",
-    "optimizer_grouped_parameters = [\n",
-    "    {\n",
-    "        \"params\": [\n",
-    "            p for n, p in param_optimizer if not any(nd in n for nd in no_decay)\n",
-    "        ],\n",
-    "        \"weight_decay\": 0.01,\n",
-    "    },\n",
-    "    {\n",
-    "        \"params\": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],\n",
-    "        \"weight_decay\": 0.0,\n",
-    "    },\n",
-    "]\n",
-    "opt = BertAdam(optimizer_grouped_parameters, lr=2e-5)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# check whether embedding layer is trainable\n",
-    "if not UPDATE_EMBEDDINGS:\n",
-    "    for p in model.bert.embeddings.parameters():\n",
-    "        p.requires_grad = False"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# use multiple GPUs if available\n",
-    "if torch.cuda.device_count() > 1:\n",
-    "    model = nn.DataParallel(model)\n",
-    "print(\"using {} GPUs...\".format(torch.cuda.device_count()))"
+    "classifier = SequenceClassifier(\n",
+    "    language=Language.ENGLISH, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
+    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Train Model"
+    "## Train"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 8,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "t_total value of -1 results in schedule not being applied\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:1->313/3125; loss:2.469508\n",
+      "epoch:1/1; batch:314->626/3125; loss:1.179081\n",
+      "epoch:1/1; batch:627->939/3125; loss:0.677443\n",
+      "epoch:1/1; batch:940->1252/3125; loss:1.689727\n",
+      "epoch:1/1; batch:1253->1565/3125; loss:0.781167\n",
+      "epoch:1/1; batch:1566->1878/3125; loss:1.036024\n",
+      "epoch:1/1; batch:1879->2191/3125; loss:0.909294\n",
+      "epoch:1/1; batch:2192->2504/3125; loss:0.441344\n",
+      "epoch:1/1; batch:2505->2817/3125; loss:0.823389\n",
+      "epoch:1/1; batch:2818->3130/3125; loss:1.036229\n",
+      "[Training time: 1.132 hrs]\n"
+     ]
+    }
+   ],
   "source": [
    "# train\n",
-    "model.train()\n",
-    "num_batches = int(num_train_examples / BATCH_SIZE)\n",
-    "for epoch in range(NUM_EPOCHS):\n",
-    "    for i in range(num_batches):\n",
-    "        X_batch, mask_batch, y_batch = ya_dataset.get_batch_rnd(\n",
-    "            tokens_train, input_mask_train, y_train, num_train_examples, BATCH_SIZE\n",
-    "        )\n",
-    "        X_batch = torch.tensor(X_batch, dtype=torch.long, device=device)\n",
-    "        y_batch = torch.tensor(y_batch, dtype=torch.long, device=device)\n",
-    "        mask_batch = torch.tensor(mask_batch, dtype=torch.long, device=device)\n",
-    "        opt.zero_grad()\n",
-    "        y_h = model(X_batch, None, mask_batch, labels=None)\n",
-    "        loss = loss_func(y_h, y_batch)\n",
-    "        loss.backward()\n",
-    "        opt.step()\n",
-    "        if i % int(0.01 * num_batches) == 0:\n",
-    "            print(\n",
-    "                \"epoch:{}/{}; batch:{}/{}; loss:{}\".format(\n",
-    "                    epoch + 1, NUM_EPOCHS, i + 1, num_batches, loss.data\n",
-    "                )\n",
-    "            )"
+    "with Timer() as t:\n",
+    "    classifier.fit(\n",
+    "        token_ids=tokens_train,\n",
+    "        input_mask=mask_train,\n",
+    "        labels=labels_train,    \n",
+    "        num_gpus=NUM_GPUS,        \n",
+    "        num_epochs=NUM_EPOCHS,\n",
+    "        batch_size=BATCH_SIZE,    \n",
+    "        verbose=True,\n",
+    "    )    \n",
+    "print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
   ]
  },
  {
@ -249,33 +234,162 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 20000/20000 [08:00<00:00, 41.85it/s]\n"
+     ]
+    }
+   ],
   "source": [
-    "# score\n",
-    "model.eval()\n",
-    "preds = []\n",
-    "for i in range(0, num_test_examples, BATCH_SIZE):\n",
-    "    X_batch, mask_batch, y_batch = ya_dataset.get_batch_by_idx(\n",
-    "        tokens_test, input_mask_test, y_test, i, BATCH_SIZE\n",
-    "    )\n",
-    "    X_batch = torch.tensor(X_batch, dtype=torch.long, device=device)\n",
-    "    y_batch = torch.tensor(y_batch, dtype=torch.long, device=device)\n",
-    "    mask_batch = torch.tensor(mask_batch, dtype=torch.long, device=device)\n",
-    "    with torch.no_grad():\n",
-    "        p_batch = model(X_batch, None, mask_batch, labels=None)\n",
-    "    preds.append(p_batch.cpu().data.numpy())\n",
-    "\n",
-    "preds = [x.argmax(1) for x in preds]\n",
-    "preds = np.concatenate(preds)"
+    "preds = classifier.predict(\n",
+    "    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
+    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Evaluate Results"
+    "## Evaluate Results\n",
+    "Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      " accuracy: 0.6564\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>precision</th>\n",
+       "      <th>recall</th>\n",
+       "      <th>f1</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0.592506</td>\n",
+       "      <td>0.497053</td>\n",
+       "      <td>0.540598</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>0.749070</td>\n",
+       "      <td>0.673518</td>\n",
+       "      <td>0.709288</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>0.789308</td>\n",
+       "      <td>0.680955</td>\n",
+       "      <td>0.731139</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>0.561592</td>\n",
+       "      <td>0.440535</td>\n",
+       "      <td>0.493752</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>0.854772</td>\n",
+       "      <td>0.789272</td>\n",
+       "      <td>0.820717</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>0.885998</td>\n",
+       "      <td>0.847659</td>\n",
+       "      <td>0.866404</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>0.425440</td>\n",
+       "      <td>0.687416</td>\n",
+       "      <td>0.525592</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>0.756364</td>\n",
+       "      <td>0.700337</td>\n",
+       "      <td>0.727273</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>0.826006</td>\n",
+       "      <td>0.485432</td>\n",
+       "      <td>0.611496</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>0.756186</td>\n",
+       "      <td>0.731039</td>\n",
+       "      <td>0.743400</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   precision    recall        f1\n",
+       "0   0.592506  0.497053  0.540598\n",
+       "1   0.749070  0.673518  0.709288\n",
+       "2   0.789308  0.680955  0.731139\n",
+       "3   0.561592  0.440535  0.493752\n",
+       "4   0.854772  0.789272  0.820717\n",
+       "5   0.885998  0.847659  0.866404\n",
+       "6   0.425440  0.687416  0.525592\n",
+       "7   0.756364  0.700337  0.727273\n",
+       "8   0.826006  0.485432  0.611496\n",
+       "9   0.756186  0.731039  0.743400"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "accuracy = accuracy_score(labels_test, preds)\n",
+    "precision = precision_score(labels_test, preds, average=None)\n",
+    "recall = recall_score(labels_test, preds, average=None)\n",
+    "f1 = f1_score(labels_test, preds, average=None)\n",
+    "\n",
+    "print(\"\\n accuracy: {}\".format(accuracy))\n",
+    "pd.DataFrame({\"precision\": precision, \"recall\": recall, \"f1\": f1})"
   ]
  },
  {
@ -283,20 +397,14 @@
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": [
-    "# eval\n",
-    "eval_results = eval_classification(y_test, preds)\n",
-    "print(\"accuracy: {}\".format(eval_results[\"accuracy\"]))\n",
-    "print(\"precision: {}\".format(eval_results[\"precision\"]))\n",
-    "print(\"recall: {}\".format(eval_results[\"recall\"]))"
-   ]
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python [conda env:pytorch] *",
+   "display_name": "Python 3",
   "language": "python",
-   "name": "conda-env-pytorch-py"
+   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
@ -308,7 +416,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.6.8"
  }
 },
 "nbformat": 4,
--- a/utils_nlp/bert/common.py
+++ b/utils_nlp/bert/common.py
@ -0,0 +1,80 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+from pytorch_pretrained_bert.tokenization import BertTokenizer
+from enum import Enum
+
+# Max supported sequence length
+BERT_MAX_LEN = 512
+
+
+class Language(Enum):
+    """An enumeration of the supported languages."""
+
+    ENGLISH = "bert-base-uncased"
+    ENGLISHCASED = "bert-base-cased"
+    ENGLISHLARGE = "bert-large-uncased"
+    ENGLISHLARGECASED = "bert-large-cased"
+    CHINESE = "bert-base-chinese"
+    MULTILINGUAL = "bert-base-multilingual-cased"
+
+
+class Tokenizer:
+    def __init__(
+        self, language=Language.ENGLISH, to_lower=False, cache_dir="."
+    ):
+        """Initializes the underlying pretrained BERT tokenizer.
+        Args:
+            language (Language, optional): The pretrained model's language.
+                                           Defaults to Language.ENGLISH.
+            cache_dir (str, optional): Location of BERT's cache directory. Defaults to ".".
+        """
+        self.tokenizer = BertTokenizer.from_pretrained(
+            language.value, do_lower_case=to_lower, cache_dir=cache_dir
+        )
+        self.language = language
+
+    def tokenize(self, text):
+        """Uses a BERT tokenizer 
+        
+        Args:
+            text (list): [description]
+        
+        Returns:
+            [list]: [description]
+        """
+        tokens = [self.tokenizer.tokenize(x) for x in text]
+        return tokens
+
+    def preprocess_classification_tokens(self, tokens, max_len=BERT_MAX_LEN):
+        """Preprocessing of input tokens:
+            - add BERT sentence markers ([CLS] and [SEP])
+            - map tokens to indices
+            - pad and truncate sequences
+            - create an input_mask    
+        Args:
+            tokens (list): List of tokens to preprocess.
+            max_len (int, optional): Maximum number of tokens
+                            (documents will be truncated or padded).
+                            Defaults to 512.
+        Returns:
+            list of preprocesssed token lists
+            list of input mask lists
+        """
+        if max_len > BERT_MAX_LEN:
+            print(
+                "setting max_len to max allowed tokens: {}".format(
+                    BERT_MAX_LEN
+                )
+            )
+            max_len = BERT_MAX_LEN
+
+        # truncate and add BERT sentence markers
+        tokens = [["[CLS]"] + x[0 : max_len - 2] + ["[SEP]"] for x in tokens]
+        # convert tokens to indices
+        tokens = [self.tokenizer.convert_tokens_to_ids(x) for x in tokens]
+        # pad sequence
+        tokens = [x + [0] * (max_len - len(x)) for x in tokens]
+        # create input mask
+        input_mask = [[min(1, x) for x in y] for y in tokens]
+        return tokens, input_mask
--- a/utils_nlp/bert/sequence_classification.py
+++ b/utils_nlp/bert/sequence_classification.py
@ -0,0 +1,186 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import random
+import numpy as np
+import torch
+import torch.nn as nn
+from pytorch_pretrained_bert.modeling import BertForSequenceClassification
+from pytorch_pretrained_bert.optimization import BertAdam
+from tqdm import tqdm
+from utils_nlp.bert.common import BERT_MAX_LEN, Language
+from utils_nlp.pytorch.device_utils import get_device, move_to_device
+
+
+class SequenceClassifier:
+    """BERT-based sequence classifier"""
+
+    def __init__(self, language=Language.ENGLISH, num_labels=2, cache_dir="."):
+        """Initializes the classifier and the underlying pretrained model.
+        Args:
+            language (Language, optional): The pretrained model's language.
+                                           Defaults to Language.ENGLISH.
+            num_labels (int, optional): The number of unique labels in the training data.
+                                        Defaults to 2.
+            cache_dir (str, optional): Location of BERT's cache directory. Defaults to ".".
+        """
+        if num_labels < 2:
+            raise Exception("Number of labels should be at least 2.")
+
+        self.language = language
+        self.num_labels = num_labels
+        self.cache_dir = cache_dir
+
+        # create classifier
+        self.model = BertForSequenceClassification.from_pretrained(
+            language.value, cache_dir=cache_dir, num_labels=num_labels
+        )
+
+    def fit(
+        self,
+        token_ids,
+        input_mask,
+        labels,
+        num_gpus=None,
+        num_epochs=1,
+        batch_size=32,
+        lr=2e-5,
+        verbose=True,
+    ):
+        """Fine-tunes the BERT classifier using the given training data.
+        Args:
+            token_ids (list): List of training token id lists.
+            input_mask (list): List of input mask lists.
+            labels (list): List of training labels.
+            device (str, optional): Device used for training ("cpu" or "gpu").
+                                    Defaults to "gpu".
+            num_gpus (int, optional): The number of gpus to use. 
+                                      If None is specified, all available GPUs will be used.
+                                      Defaults to None.
+            num_epochs (int, optional): Number of training epochs. Defaults to 1.
+            batch_size (int, optional): Training batch size. Defaults to 32.
+            lr (float): Learning rate of the Adam optimizer. Defaults to 2e-5.
+            verbose (bool, optional): If True, shows the training progress and loss values.
+                                      Defaults to True.
+        """
+
+        device = get_device("cpu" if num_gpus == 0 else "gpu")
+        self.model = move_to_device(self.model, device, num_gpus)
+
+        # define optimizer and model parameters
+        param_optimizer = list(self.model.named_parameters())
+        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [
+                    p
+                    for n, p in param_optimizer
+                    if not any(nd in n for nd in no_decay)
+                ],
+                "weight_decay": 0.01,
+            },
+            {
+                "params": [
+                    p
+                    for n, p in param_optimizer
+                    if any(nd in n for nd in no_decay)
+                ]
+            },
+        ]
+
+        opt = BertAdam(optimizer_grouped_parameters, lr=lr)
+
+        # define loss function
+        loss_func = nn.CrossEntropyLoss().to(device)
+
+        # train
+        self.model.train()  # training mode
+        num_examples = len(token_ids)
+        num_batches = int(num_examples / batch_size)
+
+        for epoch in range(num_epochs):
+            for i in range(num_batches):
+
+                # get random batch
+                start = int(random.random() * num_examples)
+                end = start + batch_size
+                x_batch = torch.tensor(
+                    token_ids[start:end], dtype=torch.long, device=device
+                )
+                y_batch = torch.tensor(
+                    labels[start:end], dtype=torch.long, device=device
+                )
+                mask_batch = torch.tensor(
+                    input_mask[start:end], dtype=torch.long, device=device
+                )
+
+                opt.zero_grad()
+
+                y_h = self.model(
+                    input_ids=x_batch,
+                    token_type_ids=None,
+                    attention_mask=mask_batch,
+                    labels=None,
+                )
+
+                loss = loss_func(y_h, y_batch).mean()
+                loss.backward()
+                opt.step()
+                if verbose:
+                    if i % ((num_batches // 10) + 1) == 0:
+                        print(
+                            "epoch:{}/{}; batch:{}->{}/{}; loss:{:.6f}".format(
+                                epoch + 1,
+                                num_epochs,
+                                i + 1,
+                                i + 1 + (num_batches // 10),
+                                num_batches,
+                                loss.data,
+                            )
+                        )
+        # empty cache
+        del [x_batch, y_batch, mask_batch]
+        torch.cuda.empty_cache()
+
+    def predict(self, token_ids, input_mask, num_gpus=1, batch_size=32):
+        """Scores the given dataset and returns the predicted classes.
+        Args:
+            token_ids (list): List of training token lists.
+            input_mask (list): List of input mask lists.
+            num_gpus (int, optional): The number of gpus to use. 
+                                      If None is specified, all available GPUs will be used.
+                                      Defaults to 1.
+            batch_size (int, optional): Scoring batch size. Defaults to 32.
+        Returns:
+            [ndarray]: Predicted classes.
+        """
+
+        device = get_device("cpu" if num_gpus == 0 else "gpu")
+        self.model = move_to_device(self.model, device, num_gpus)
+
+        # score
+        self.model.eval()
+        preds = []
+        with tqdm(total=len(token_ids)) as pbar:
+            for i in range(0, len(token_ids), batch_size):
+                x_batch = token_ids[i : i + batch_size]
+                mask_batch = input_mask[i : i + batch_size]
+                x_batch = torch.tensor(
+                    x_batch, dtype=torch.long, device=device
+                )
+                mask_batch = torch.tensor(
+                    mask_batch, dtype=torch.long, device=device
+                )
+                with torch.no_grad():
+                    p_batch = self.model(
+                        input_ids=x_batch,
+                        token_type_ids=None,
+                        attention_mask=mask_batch,
+                        labels=None,
+                    )
+                preds.append(p_batch.cpu().data.numpy())
+                if i % batch_size == 0:
+                    pbar.update(batch_size)
+        preds = [x.argmax(1) for x in preds]
+        preds = np.concatenate(preds)
+        return preds
--- a/utils_nlp/bert/token_classification.py
+++ b/utils_nlp/bert/token_classification.py
--- a/utils_nlp/dataset/multinli.py
+++ b/utils_nlp/dataset/multinli.py
@ -0,0 +1,45 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""MultiNLI dataset utils
+https://www.nyu.edu/projects/bowman/multinli/
+"""
+
+import os
+import pandas as pd
+import requests
+from utils_nlp.dataset.url_utils import extract_zip, maybe_download
+
+URL = "http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip"
+DATA_FILES = {
+    "train": "multinli_1.0/multinli_1.0_train.jsonl",
+    "dev_matched": "multinli_1.0/multinli_1.0_dev_matched.jsonl",
+    "dev_mismatched": "multinli_1.0/multinli_1.0_dev_mismatched.jsonl",
+}
+
+
+def load_pandas_df(local_cache_path=None, file_split="train"):
+    """Downloads and extracts the dataset files    
+    Args:
+        local_cache_path ([type], optional): [description]. Defaults to None.
+        file_split (str, optional): The subset to load.
+                                    One of: {"train", "dev_matched", "dev_mismatched"}
+                                    Defaults to "train".
+    Returns:
+        pd.DataFrame: pandas DataFrame containing the specified MultiNLI subset.
+    """
+
+    file_name = URL.split("/")[-1]
+    if not os.path.exists(os.path.join(local_cache_path, file_name)):
+        response = requests.get(URL)
+        with open(os.path.join(local_cache_path, file_name), "wb") as f:
+            f.write(response.content)
+    if not os.path.exists(
+        os.path.join(local_cache_path, DATA_FILES[file_split])
+    ):
+        extract_zip(
+            os.path.join(local_cache_path, file_name), local_cache_path
+        )
+    return pd.read_json(
+        os.path.join(local_cache_path, DATA_FILES[file_split]), lines=True
+    )
--- a/utils_nlp/dataset/url_utils.py
+++ b/utils_nlp/dataset/url_utils.py
@ -56,6 +56,34 @@ def maybe_download(
    return filepath


+def extract_tar(file_path, dest_path="."):
+    """Extracts all contents of a tar archive file.
+    Args:
+        file_path (str): Path of file to extract.
+        dest_path (str, optional): Destination directory. Defaults to ".".
+    """
+    if not os.path.exists(file_path):
+        raise IOError("File doesn't exist")
+    if not os.path.exists(dest_path):
+        raise IOError("Destination directory doesn't exist")
+    with tarfile.open(file_path) as t:
+        t.extractall(path=dest_path)
+
+
+def extract_zip(file_path, dest_path="."):
+    """Extracts all contents of a zip archive file.
+    Args:
+        file_path (str): Path of file to extract.
+        dest_path (str, optional): Destination directory. Defaults to ".".
+    """
+    if not os.path.exists(file_path):
+        raise IOError("File doesn't exist")
+    if not os.path.exists(dest_path):
+        raise IOError("Destination directory doesn't exist")
+    with ZipFile(file_path) as z:
+        z.extractall(path=dest_path)
+
+
@contextmanager
 def download_path(path):
    tmp_dir = TemporaryDirectory()
--- a/utils_nlp/dataset/xnli.py
+++ b/utils_nlp/dataset/xnli.py
@ -0,0 +1,44 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""XNLI dataset utils
+https://www.nyu.edu/projects/bowman/xnli/
+"""
+
+import os
+import pandas as pd
+import requests
+from utils_nlp.dataset.url_utils import extract_zip, maybe_download
+
+
+URL = "https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip"
+
+DATA_FILES = {"dev": "XNLI-1.0/xnli.dev.jsonl", "test": "XNLI-1.0/xnli.test.jsonl"}
+
+
+def load_pandas_df(local_cache_path=None, file_split="train"):
+    """Downloads and extracts the dataset files    
+    Args:
+        local_cache_path ([type], optional): [description]. Defaults to None.
+        file_split (str, optional): The subset to load.
+                                    One of: {"dev", "test"}
+                                    Defaults to "train".
+    Returns:
+        pd.DataFrame: pandas DataFrame containing the specified XNLI subset.
+    """
+
+    file_name = URL.split("/")[-1]
+    if not os.path.exists(os.path.join(local_cache_path, file_name)):
+        response = requests.get(URL)
+        with open(os.path.join(local_cache_path, file_name), "wb") as f:
+            f.write(response.content)
+    if not os.path.exists(
+        os.path.join(local_cache_path, DATA_FILES[file_split])
+    ):
+        extract_zip(
+            os.path.join(local_cache_path, file_name), local_cache_path
+        )
+    return pd.read_json(
+        os.path.join(local_cache_path, DATA_FILES[file_split]), lines=True
+    )
+
--- a/utils_nlp/dataset/yahoo_answers.py
+++ b/utils_nlp/dataset/yahoo_answers.py
@ -1,28 +1,28 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""Yahoo! Answers dataset utils"""
+
+import os
 import pandas as pd
-import numpy as np
-import random
-import pickle
-import urllib3
-import tarfile
-import io
+from utils_nlp.dataset.url_utils import maybe_download, extract_tar


 URL = "https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz"


 def download(dir_path):
-    con = urllib3.PoolManager()
-    resp = con.request("GET", URL)
-    tar = tarfile.open(fileobj=io.BytesIO(resp.data))
-    tar.extractall(path=dir_path)
-    tar.close()
+    """Downloads and extracts the dataset files"""
+    file_name = URL.split("/")[-1]
+    maybe_download(URL, file_name, dir_path)
+    extract_tar(os.path.join(dir_path, file_name), dir_path)


 def read_data(data_file, nrows=None):
    return pd.read_csv(data_file, header=None, nrows=nrows)


-def clean_data(df):
+def get_text(df):
    df.fillna("", inplace=True)
    text = df.iloc[:, 1] + " " + df.iloc[:, 2] + " " + df.iloc[:, 3]
    text = text.str.replace(r"[^A-Za-z ]", "").str.lower()
@ -33,19 +33,3 @@ def clean_data(df):

 def get_labels(df):
    return list(df[0] - 1)
-
-
-def get_batch_rnd(X, input_mask, y, n, batch_size):
-    i = int(random.random() * n)
-    X_b = X[i : i + batch_size]
-    y_b = y[i : i + batch_size]
-    m_b = input_mask[i : i + batch_size]
-    return X_b, m_b, y_b
-
-
-def get_batch_by_idx(X, input_mask, y, i, batch_size):
-    X_b = X[i : i + batch_size]
-    y_b = y[i : i + batch_size]
-    m_b = input_mask[i : i + batch_size]    
-    return X_b, m_b, y_b
-
--- a/utils_nlp/eval/classification.py
+++ b/utils_nlp/eval/classification.py
@ -1,7 +1,23 @@
-from sklearn.metrics import accuracy_score, precision_score, recall_score
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+from sklearn.metrics import (
+    accuracy_score,
+    precision_score,
+    recall_score,
+    f1_score,
+)


 def eval_classification(actual, predicted, round_decimals=4):
+    """Returns common classification evaluation metrics.
+    Args:
+        actual (1d array-like): Array of actual values.
+        predicted (1d array-like): Array of predicted values.
+        round_decimals (int, optional): Number of decimal places. Defaults to 4.
+    Returns:
+        dict: A dictionary of evaluation metrics.
+    """
    return {
        "accuracy": accuracy_score(actual, predicted).round(round_decimals),
        "precision": list(
@ -12,4 +28,7 @@ def eval_classification(actual, predicted, round_decimals=4):
        "recall": list(
            recall_score(actual, predicted, average=None).round(round_decimals)
        ),
+        "f1": list(
+            f1_score(actual, predicted, average=None).round(round_decimals)
+        ),
    }
--- a/utils_nlp/pytorch/device_utils.py
+++ b/utils_nlp/pytorch/device_utils.py
@ -0,0 +1,77 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+"""PyTorch device utils."""
+
+import torch
+import torch.nn as nn
+
+
+def get_device(device="gpu"):
+    """Gets a PyTorch device.
+    Args:
+        device (str, optional): Device string: "cpu" or "gpu". Defaults to "gpu".
+    Returns:
+        torch.device: A PyTorch device (cpu or gpu).
+    """
+    if device == "gpu":
+        if torch.cuda.is_available():
+            return torch.device("cuda:0")
+        raise Exception("CUDA device not available")
+    elif device == "cpu":
+        return torch.device("cpu")
+    else:
+        raise Exception("Only 'cpu' and 'gpu' devices are supported.")
+
+
+def move_to_device(model, device, num_gpus=1):
+    """Moves a model to the specified device (cpu or gpu/s)
+       and implements data parallelism when multiple gpus are specified.
+    Args:
+        model (Module): A PyTorch model
+        device (torch.device): A PyTorch device
+        num_gpus (int): The number of GPUs to be used. Defaults to 1.
+    Returns:
+        Module, DataParallel: A PyTorch Module or a DataParallel wrapper (when multiple gpus are used).
+    """
+    if isinstance(model, nn.DataParallel):
+        model = model.module
+
+    # cpu
+    if num_gpus == 0:
+        if device.type == "cpu":
+            return model.to(device)
+        else:
+            raise Exception("Device type should be 'cpu' when num_gpus==0.")
+
+    # gpu
+    if device.type == "cuda":
+        model.to(device)  # inplace
+        if num_gpus == 1:
+            return model
+        else:
+            # parallelize
+            num_cuda_devices = torch.cuda.device_count()
+            if num_cuda_devices < 1:
+                raise Exception("CUDA devices are not available.")
+            elif num_cuda_devices < 2:
+                print(
+                    "Warning: Only 1 CUDA device is available. Data parallelism is not possible."
+                )
+                return model
+            else:
+                if num_gpus is None:
+                    # use all available devices
+                    return nn.DataParallel(model, device_ids=None)
+                elif num_gpus > num_cuda_devices:
+                    print(
+                        "Warning: Only {0} devices are available. Setting the number of gpus to {0}".format(
+                            num_cuda_devices
+                        )
+                    )
+                    return nn.DataParallel(model, device_ids=None)
+                else:
+                    return nn.DataParallel(
+                        model, device_ids=list(range(num_gpus))
+                    )
+    else:
+        raise Exception("Device type should be 'gpu' when num_gpus!=0.")