Merge pull request #405 from microsoft/staging

Staging to Master
2019-09-12 13:26:46 -04:00 · 2019-09-12 13:26:46 -04:00 · fec3deeca4
--- a/README.md
+++ b/README.md
@ -16,18 +16,36 @@ In an era of transfer learning, transformers, and deep architectures, we believe
 >   
 >

+## Focus areas
+The repository aims to expand NLP capabilities along three separate dimensions
+
+### Scenarios
+We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.
+
+### Algorithms
+We aim to support multiple models for each of the supported scenarios. Currently, BERT-based models are supported across most scenarios. We are working to integrate [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) to allow use of many more models.
+
+### Languages 
+We strongly subscribe to the multi-language principles laid down by ["Emily Bender"](http://faculty.washington.edu/ebender/papers/Bender-SDSS-2019.pdf)
+* "Natural language is not a synonym for English"
+* "English isn't generic for language, despite what NLP papers might lead you to believe" 
+* "Always name the language you are working on" ([Bender rule](https://www.aclweb.org/anthology/Q18-1041/))
+
+The repository aims to support non-English languages  across all the scenarios. Pre-trianed models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.
+
+
+
 ## Content
 The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more [Jupyter notebook examples](examples) that make use of the core code base of models and repository utilities.

-| Scenario                              |  Models | Description|
-|-------------------------|  ------------------- |-------|
-|Text Classification                     |BERT| Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. |
-|Named Entity Recognition                |BERT| Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. |
-|Entailment                              |BERT| Textual entailment is the task of classifying the binary relation between two natural-language texts,  ‘text’ and ‘hypothesis’,  to determine if the `text' agrees with the `hypothesis` or not. |
-|Question Answering                      |BiDAF <br> BERT| Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. |
-|Sentence Similarity                     |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance<br>Models: BERT, GenSen| Sentence similarity is the process of computing a similarity score given a pair of text documents. |
-|Embeddings| Word2Vec<br>fastText<br>GloVe| Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.
-
+| Scenario                              |  Models | Description|Languages|
+|-------------------------|  ------------------- |-------|---|
+|Text Classification                     |BERT <br> XLNet| Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. |English, Hindi, Arabic|
+|Named Entity Recognition                |BERT| Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. |English|
+|Entailment                              |BERT| Textual entailment is the task of classifying the binary relation between two natural-language texts,  ‘text’ and ‘hypothesis’,  to determine if the `text' agrees with the `hypothesis` or not. |English|
+|Question Answering                      |BiDAF <br> BERT| Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. |English|
+|Sentence Similarity                     |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance<br>Models: BERT, GenSen| Sentence similarity is the process of computing a similarity score given a pair of text documents. |English|
+|Embeddings| Word2Vec<br>fastText<br>GloVe| Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.|English|

 ## Getting Started
 While solving NLP problems, it is always good to start with the prebuilt [Cognitive Services](https://azure.microsoft.com/en-us/services/cognitive-services/directory/lang/). When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods,  you will find this repository  very useful. To get started, navigate to the [Setup Guide](SETUP.md), which lists instructions on how to setup your environment and dependencies.
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,19 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/README.md
+++ b/docs/README.md
@ -0,0 +1,13 @@
+# Documentation
+
+To setup the documentation, first you need to install the dependencies of the cpu environment. For it please follow the [SETUP.md](../SETUP.md). Then type:
+
+    conda activate nlp_cpu
+    pip install sphinx_rtd_theme
+
+
+To build the documentation as HTML:
+
+    cd docs
+    make html
+
--- a/docs/source/azureml.rst
+++ b/docs/source/azureml.rst
@ -0,0 +1,21 @@
+.. _azureml:
+
+AzureML module
+**************************
+
+AzureML module from NLP utilities.
+
+AzureML utils
+===============================
+
+.. automodule:: utils_nlp.azureml.azureml_utils
+    :members:
+    
+
+AzureML utils for BERT
+===============================
+
+.. automodule:: utils_nlp.azureml.azureml_bert_util
+    :members:
+
+
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,238 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))
+sys.setrecursionlimit(1500)
+
+from utils_nlp import TITLE, VERSION, COPYRIGHT, AUTHOR
+
+# -- Project information -----------------------------------------------------
+
+project = TITLE
+copyright = COPYRIGHT
+author = AUTHOR
+
+# The short X.Y version
+version = ".".join(VERSION.split(".")[:2])
+# The full version, including alpha/beta/rc tags
+release = VERSION
+
+prefix = "NLP"
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.doctest",
+    "sphinx.ext.intersphinx",
+    "sphinx.ext.ifconfig",
+    "sphinx.ext.viewcode",  # Add links to highlighted source code
+    "sphinx.ext.napoleon",  # to render Google format docstrings
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = ".rst"
+
+# The master toctree document.
+master_doc = "index"
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ["Thumbs.db", ".DS_Store"]
+
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_rtd_theme"
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+# html_static_path = ["images"]
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = prefix + "doc"
+
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    "papersize": "letterpaper",
+    "pointsize": "10pt",
+    "figure_align": "htbp",
+    "preamble": r"""
+        %% Adding source listings https://en.wikibooks.org/wiki/LaTeX/Source_Code_Listings 
+        \usepackage{listings}
+        \usepackage{color}
+
+        \definecolor{mygreen}{rgb}{0,0.6,0}
+        \definecolor{mygray}{rgb}{0.5,0.5,0.5}
+        \definecolor{mymauve}{rgb}{0.58,0,0.82}
+
+        \lstset{ 
+        backgroundcolor=\color{white},   % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}; should come as last argument
+        basicstyle=\footnotesize,        % the size of the fonts that are used for the code
+        breakatwhitespace=false,         % sets if automatic breaks should only happen at whitespace
+        breaklines=true,                 % sets automatic line breaking
+        captionpos=b,                    % sets the caption-position to bottom
+        commentstyle=\color{mygreen},    % comment style
+        deletekeywords={...},            % if you want to delete keywords from the given language
+        escapeinside={\%*}{*)},          % if you want to add LaTeX within your code
+        extendedchars=true,              % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
+        firstnumber=1000,                % start line enumeration with line 1000
+        frame=single,	                 % adds a frame around the code
+        keepspaces=true,                 % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
+        keywordstyle=\color{blue},       % keyword style
+        language=Python,                 % the language of the code
+        morekeywords={*,...},            % if you want to add more keywords to the set
+        numbers=left,                    % where to put the line-numbers; possible values are (none, left, right)
+        numbersep=5pt,                   % how far the line-numbers are from the code
+        numberstyle=\tiny\color{mygray}, % the style that is used for the line-numbers
+        rulecolor=\color{black},         % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
+        showspaces=false,                % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
+        showstringspaces=false,          % underline spaces within strings only
+        showtabs=false,                  % show tabs within strings adding particular underscores
+        stepnumber=2,                    % the step between two line-numbers. If it's 1, each line will be numbered
+        stringstyle=\color{mymauve},     % string literal style
+        tabsize=2,	                     % sets default tabsize to 2 spaces
+        title=\lstname                   % show the filename of files included with \lstinputlisting; also try caption instead of title
+        }
+
+    """,
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [(master_doc, prefix + ".tex", prefix + " Documentation", prefix, "manual")]
+
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(master_doc, prefix, prefix + " Documentation", [author], 1)]
+
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (
+        master_doc,
+        prefix,
+        prefix + " Documentation",
+        author,
+        prefix,
+        "One line description of project.",
+        "Miscellaneous",
+    )
+]
+
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ["search.html"]
+
+
+# -- Extension configuration -------------------------------------------------
+
+# -- Options for intersphinx extension ---------------------------------------
+
+# Example configuration for intersphinx: refer to the Python standard library.
+intersphinx_mapping = {"https://docs.python.org/": None}
+
+##################################################
+# Other options
+# html_favicon = os.path.join(html_static_path[0], "favicon.ico")
+
+
+# Ensure that __init__() is always documented
+# source: https://stackoverflow.com/a/5599712
+def skip(app, what, name, obj, would_skip, options):
+    if name == "__init__":
+        return False
+    return would_skip
+
+
+def setup(app):
+    app.connect("autodoc-skip-member", skip)
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -0,0 +1,27 @@
+
+NLP Utilities
+===================================================
+
+The `NLP repository <https://github.com/Microsoft/NLP>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks. 
+
+The module `utils_nlp <https://github.com/microsoft/nlp/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and 
+evaluating NLP systems. 
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Contents:
+
+    AzureML <azureml>
+    Common <common>
+    Dataset <dataset>
+    Evaluation <eval>
+    NLP Algorithms <model>
+    NLP Interpretability <interpreter>
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
--- a/examples/README.md
+++ b/examples/README.md
@ -5,7 +5,7 @@ This folder contains examples and best practices, written in Jupyter notebooks,

 |Category|Applications|Methods|Languages|
 |---| ------------------------ | ------------------- |---|
-|[Text Classification](text_classification)|Topic Classification|BERT|en|
+|[Text Classification](text_classification)|Topic Classification|BERT, XLNet|en, hi, ar|
 |[Named Entity Recognition](named_entity_recognition) |Wikipedia NER|BERT|en|
 |[Entailment](entailment)|MultiNLI Natural Language Inference|BERT|en|
 |[Question Answering](question_answering) |SQuAD|BiDAF, BERT|en|
--- a/examples/text_classification/README.md
+++ b/examples/text_classification/README.md
@ -3,8 +3,7 @@ This folder contains examples and best practices, written in Jupyter notebooks,
 utility scripts in the [utils_nlp](../../utils_nlp) folder to speed up data preprocessing and model building for text classification.  
 The models can be used in a wide variety of applications, such as
 sentiment analysis, document indexing in digital libraries, hate speech detection, and general-purpose categorization in medical, academic, legal, and many other domains. 
-Currently, we focus on fine-tuning pre-trained BERT
-model. We plan to continue adding state-of-the-art models as they come up and welcome community
+Currently, we focus on fine-tuning pre-trained BERT and XLNet models. We plan to continue adding state-of-the-art models as they come up and welcome community
 contributions.

 ## What is Text Classification?
@ -21,3 +20,6 @@ The following summarizes each notebook for Text Classification. Each notebook pr
 |---|---|---|---|
 |[BERT for text classification with MNLI](tc_mnli_bert.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
 |[BERT for text classification on AzureML](tc_bert_azureml.ipynb) |Azure ML|A notebook which walks through fine-tuning and evaluating pre-trained BERT model on a distributed setup with AzureML. |[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
+|[XLNet for text classification with MNLI](tc_mnli_xlnet.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained XLNet model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
+|[BERT for text classification of Hindi BBC News](tc_bbc_bert_hi.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Hindi BBC news data|[BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1)|
+|[BERT for text classification of Arabic News](tc_dac_bert_ar.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Arabic news articles|[DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)|
--- a/examples/text_classification/tc_bbc_bert_hi.ipynb
+++ b/examples/text_classification/tc_bbc_bert_hi.ipynb
--- a/examples/text_classification/tc_dac_bert_ar.ipynb
+++ b/examples/text_classification/tc_dac_bert_ar.ipynb
@ -0,0 +1,821 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
+    "\n",
+    "*Licensed under the MIT License.*\n",
+    "\n",
+    "# Classification of Arabic News Articles using BERT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import scrapbook as sb\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from sklearn.metrics import accuracy_score, classification_report\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "sys.path.append(\"../../\")\n",
+    "from utils_nlp.common.timer import Timer\n",
+    "from utils_nlp.dataset.dac import load_pandas_df\n",
+    "from utils_nlp.models.bert.common import Language, Tokenizer\n",
+    "from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).\n",
+    "\n",
+    "We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "DATA_FOLDER = \"./temp\"\n",
+    "BERT_CACHE_DIR = \"./temp\"\n",
+    "LANGUAGE = Language.MULTILINGUAL\n",
+    "MAX_LEN = 200\n",
+    "BATCH_SIZE = 32\n",
+    "NUM_GPUS = 2\n",
+    "NUM_EPOCHS = 1\n",
+    "TRAIN_SIZE = 0.8\n",
+    "NUM_ROWS = 15000\n",
+    "RANDOM_STATE = 0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Read Dataset\n",
+    "We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS, random_state=RANDOM_STATE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>text</th>\n",
+       "      <th>targe</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>80414</th>\n",
+       "      <td>فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6649</th>\n",
+       "      <td>أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3722</th>\n",
+       "      <td>أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>82317</th>\n",
+       "      <td>الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5219</th>\n",
+       "      <td>المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                    text  targe\n",
+       "80414  فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...      4\n",
+       "6649   أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...      0\n",
+       "3722   أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...      0\n",
+       "82317  الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...      4\n",
+       "5219   المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...      0"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set the text and label columns\n",
+    "text_col = df.columns[0]\n",
+    "label_col = df.columns[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# remove empty documents\n",
+    "df = df[df[text_col].isna() == False]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Inspect the distribution of labels:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4    5844\n",
+       "3    2796\n",
+       "1    2139\n",
+       "0    1917\n",
+       "2    1900\n",
+       "Name: targe, dtype: int64"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df[label_col].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>culture</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>diverse</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>economy</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>politics</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>sports</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      label\n",
+       "0   culture\n",
+       "1   diverse\n",
+       "2   economy\n",
+       "3  politics\n",
+       "4    sports"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# ordered list of labels\n",
+    "labels = [\"culture\", \"diverse\", \"economy\", \"politics\", \"sports\"]\n",
+    "num_labels = len(labels)\n",
+    "pd.DataFrame({\"label\": labels})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we split the data for training and testing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of training examples: 11676\n",
+      "Number of testing examples: 2920\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
+      "  FutureWarning)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=RANDOM_STATE)\n",
+    "print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
+    "print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tokenize and Preprocess"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 11676/11676 [00:59<00:00, 196.42it/s]\n",
+      "100%|██████████| 2920/2920 [00:14<00:00, 197.99it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)\n",
+    "tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))\n",
+    "tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition, we perform the following preprocessing steps in the cell below:\n",
+    "- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
+    "- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence\n",
+    "- Pad or truncate the token lists to the specified max length\n",
+    "- Return mask lists that indicate paddings' positions\n",
+    "\n",
+    "*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_train, MAX_LEN\n",
+    ")\n",
+    "tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(\n",
+    "    tokens_test, MAX_LEN\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Model\n",
+    "Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "classifier = BERTSequenceClassifier(\n",
+    "    language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train\n",
+    "We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "t_total value of -1 results in schedule not being applied\n",
+      "Iteration:   0%|          | 1/365 [00:03<21:12,  3.49s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:1->37/365; average training loss:1.591262\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  10%|█         | 38/365 [01:02<08:45,  1.61s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:38->74/365; average training loss:0.745935\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  21%|██        | 75/365 [02:02<07:52,  1.63s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:75->111/365; average training loss:0.593934\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  31%|███       | 112/365 [03:03<06:56,  1.65s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:112->148/365; average training loss:0.530150\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  41%|████      | 149/365 [04:03<05:54,  1.64s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:149->185/365; average training loss:0.481620\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  51%|█████     | 186/365 [05:05<05:02,  1.69s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:186->222/365; average training loss:0.455032\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  61%|██████    | 223/365 [06:06<03:59,  1.69s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:223->259/365; average training loss:0.421702\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  71%|███████   | 260/365 [07:08<02:56,  1.68s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:260->296/365; average training loss:0.401165\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  81%|████████▏ | 297/365 [08:09<01:52,  1.65s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:297->333/365; average training loss:0.382719\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration:  92%|█████████▏| 334/365 [09:12<00:52,  1.71s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "epoch:1/1; batch:334->365/365; average training loss:0.372204\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration: 100%|██████████| 365/365 [10:04<00:00,  1.63s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Training time: 0.169 hrs]\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "with Timer() as t:\n",
+    "    classifier.fit(\n",
+    "        token_ids=tokens_train,\n",
+    "        input_mask=mask_train,\n",
+    "        labels=list(df_train[label_col]),    \n",
+    "        num_gpus=NUM_GPUS,        \n",
+    "        num_epochs=NUM_EPOCHS,\n",
+    "        batch_size=BATCH_SIZE,    \n",
+    "        verbose=True,\n",
+    "    )    \n",
+    "print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Score\n",
+    "We score the test set using the trained classifier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Iteration: 100%|██████████| 92/92 [00:48<00:00,  2.25it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "preds = classifier.predict(\n",
+    "    token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluate Results\n",
+    "Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "accuracy: 0.9277397260273973\n",
+      "{\n",
+      "    \"culture\": {\n",
+      "        \"f1-score\": 0.9081761006289307,\n",
+      "        \"precision\": 0.8848039215686274,\n",
+      "        \"recall\": 0.9328165374677002,\n",
+      "        \"support\": 387\n",
+      "    },\n",
+      "    \"diverse\": {\n",
+      "        \"f1-score\": 0.9237983587338804,\n",
+      "        \"precision\": 0.9471153846153846,\n",
+      "        \"recall\": 0.9016018306636155,\n",
+      "        \"support\": 437\n",
+      "    },\n",
+      "    \"economy\": {\n",
+      "        \"f1-score\": 0.8547418967587034,\n",
+      "        \"precision\": 0.8221709006928406,\n",
+      "        \"recall\": 0.89,\n",
+      "        \"support\": 400\n",
+      "    },\n",
+      "    \"macro avg\": {\n",
+      "        \"f1-score\": 0.9099850933798536,\n",
+      "        \"precision\": 0.9087524907040864,\n",
+      "        \"recall\": 0.9125256551533433,\n",
+      "        \"support\": 2920\n",
+      "    },\n",
+      "    \"micro avg\": {\n",
+      "        \"f1-score\": 0.9277397260273973,\n",
+      "        \"precision\": 0.9277397260273973,\n",
+      "        \"recall\": 0.9277397260273973,\n",
+      "        \"support\": 2920\n",
+      "    },\n",
+      "    \"politics\": {\n",
+      "        \"f1-score\": 0.8734177215189873,\n",
+      "        \"precision\": 0.8994413407821229,\n",
+      "        \"recall\": 0.8488576449912126,\n",
+      "        \"support\": 569\n",
+      "    },\n",
+      "    \"sports\": {\n",
+      "        \"f1-score\": 0.9897913892587662,\n",
+      "        \"precision\": 0.9902309058614565,\n",
+      "        \"recall\": 0.9893522626441881,\n",
+      "        \"support\": 1127\n",
+      "    },\n",
+      "    \"weighted avg\": {\n",
+      "        \"f1-score\": 0.9279213601549715,\n",
+      "        \"precision\": 0.9290922105520572,\n",
+      "        \"recall\": 0.9277397260273973,\n",
+      "        \"support\": 2920\n",
+      "    }\n",
+      "}\n"
+     ]
+    }
+   ],
+   "source": [
+    "report = classification_report(df_test[label_col], preds, target_names=labels, output_dict=True) \n",
+    "accuracy = accuracy_score(df_test[label_col], preds )\n",
+    "print(\"accuracy: {}\".format(accuracy))\n",
+    "print(json.dumps(report, indent=4, sort_keys=True))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.9277397260273973,
+       "encoder": "json",
+       "name": "accuracy",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "accuracy"
+      }
+     },
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.9087524907040864,
+       "encoder": "json",
+       "name": "precision",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "precision"
+      }
+     },
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.9125256551533433,
+       "encoder": "json",
+       "name": "recall",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "recall"
+      }
+     },
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/scrapbook.scrap.json+json": {
+       "data": 0.9099850933798536,
+       "encoder": "json",
+       "name": "f1",
+       "version": 1
+      }
+     },
+     "metadata": {
+      "scrapbook": {
+       "data": true,
+       "display": false,
+       "name": "f1"
+      }
+     },
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# for testing\n",
+    "sb.glue(\"accuracy\", accuracy)\n",
+    "sb.glue(\"precision\", report[\"macro avg\"][\"precision\"])\n",
+    "sb.glue(\"recall\", report[\"macro avg\"][\"recall\"])\n",
+    "sb.glue(\"f1\", report[\"macro avg\"][\"f1-score\"])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nlp_gpu",
+   "language": "python",
+   "name": "nlp_gpu"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/examples/text_classification/tc_mnli_xlnet.ipynb
+++ b/examples/text_classification/tc_mnli_xlnet.ipynb
--- a/setup.py
+++ b/setup.py
@ -2,18 +2,11 @@
 # -*- encoding: utf-8 -*-
 from __future__ import absolute_import
 from __future__ import print_function
-
 import io
-
 import re
 from os.path import dirname, join
-
 from setuptools import setup
-from setuptools_scm import get_version
-
-# Determine semantic versioning automatically
-# from git commits
-__version__ = get_version()
+from utils_nlp import VERSION, AUTHOR, TITLE, LICENSE


 def read(*names, **kwargs):
@ -23,15 +16,15 @@ def read(*names, **kwargs):

 setup(
    name="utils_nlp",
-    version=__version__,
-    license="MIT License",
+    version=VERSION,
+    license=LICENSE,
    description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT",
    long_description="%s\n%s"
    % (
        re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub("", read("README.md")),
        re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")),
    ),
-    author="AI CAT",
+    author=AUTHOR,
    author_email="teamsharat@microsoft.com",
    url="https://github.com/microsoft/nlp",
    packages=["utils_nlp"],
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -15,8 +15,10 @@ from tempfile import TemporaryDirectory
 import pytest
 from tests.notebooks_common import path_notebooks

-from utils_nlp.models.bert.common import Language
+from utils_nlp.models.bert.common import Language as BERTLanguage
+from utils_nlp.models.xlnet.common import Language as XLNetLanguage
 from utils_nlp.models.bert.common import Tokenizer as BERTTokenizer
+from utils_nlp.models.xlnet.common import Tokenizer as XLNetTokenizer
 from utils_nlp.azureml import azureml_utils
 from azureml.core.webservice import Webservice

@ -68,6 +70,12 @@ def notebooks():
            folder_notebooks, "sentence_similarity", "bert_senteval.ipynb"
        ),
        "tc_mnli_bert": os.path.join(folder_notebooks, "text_classification", "tc_mnli_bert.ipynb"),
+        "tc_dac_bert_ar": os.path.join(
+            folder_notebooks, "text_classification", "tc_dac_bert_ar.ipynb"
+        ),
+        "tc_bbc_bert_hi": os.path.join(
+            folder_notebooks, "text_classification", "tc_bbc_bert_hi.ipynb"
+        ),
        "ner_wikigold_bert": os.path.join(
            folder_notebooks, "named_entity_recognition", "ner_wikigold_bert.ipynb"
        ),
@ -190,7 +198,12 @@ def cluster_name(request):

@pytest.fixture()
 def bert_english_tokenizer():
-    return BERTTokenizer(language=Language.ENGLISHCASED, to_lower=False)
+    return BERTTokenizer(language=BERTLanguage.ENGLISHCASED, to_lower=False)
+
+
+@pytest.fixture()
+def xlnet_english_tokenizer():
+    return XLNetTokenizer(language=XLNetLanguage.ENGLISHCASED)


@pytest.fixture(scope="module")
--- a/tests/integration/test_notebooks_text_classification.py
+++ b/tests/integration/test_notebooks_text_classification.py
@ -18,29 +18,79 @@ ABS_TOL = 0.1
 def test_tc_mnli_bert(notebooks, tmp):
    notebook_path = notebooks["tc_mnli_bert"]
    pm.execute_notebook(
-        notebook_path, 
-        OUTPUT_NOTEBOOK, 
-        kernel_name=KERNEL_NAME, 
-        parameters=dict(NUM_GPUS=1,
-                        DATA_FOLDER=tmp,
-                        BERT_CACHE_DIR=tmp,
-                        BATCH_SIZE=32,
-                        BATCH_SIZE_PRED=512,
-                        NUM_EPOCHS=1
-                       )
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        kernel_name=KERNEL_NAME,
+        parameters=dict(
+            NUM_GPUS=1,
+            DATA_FOLDER=tmp,
+            BERT_CACHE_DIR=tmp,
+            BATCH_SIZE=32,
+            BATCH_SIZE_PRED=512,
+            NUM_EPOCHS=1,
+        ),
    )
    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
    assert pytest.approx(result["accuracy"], 0.93, abs=ABS_TOL)
    assert pytest.approx(result["precision"], 0.93, abs=ABS_TOL)
    assert pytest.approx(result["recall"], 0.93, abs=ABS_TOL)
    assert pytest.approx(result["f1"], 0.93, abs=ABS_TOL)
-    
+
+
+@pytest.mark.gpu
+@pytest.mark.integration
+def test_tc_dac_bert_ar(notebooks, tmp):
+    notebook_path = notebooks["tc_dac_bert_ar"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        kernel_name=KERNEL_NAME,
+        parameters=dict(
+            NUM_GPUS=1,
+            DATA_FOLDER=tmp,
+            BERT_CACHE_DIR=tmp,
+            BATCH_SIZE=32,
+            NUM_EPOCHS=1,
+            TRAIN_SIZE=0.8,
+            NUM_ROWS=15000,
+            RANDOM_STATE=0,
+        ),
+    )
+    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
+    assert pytest.approx(result["accuracy"], 0.93, abs=ABS_TOL)
+    assert pytest.approx(result["precision"], 0.91, abs=ABS_TOL)
+    assert pytest.approx(result["recall"], 0.91, abs=ABS_TOL)
+    assert pytest.approx(result["f1"], 0.91, abs=ABS_TOL)
+
+
+@pytest.mark.gpu
+@pytest.mark.integration
+def test_tc_bbc_bert_hi(notebooks, tmp):
+    notebook_path = notebooks["tc_bbc_bert_hi"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        kernel_name=KERNEL_NAME,
+        parameters=dict(NUM_GPUS=1, DATA_FOLDER=tmp, BERT_CACHE_DIR=tmp, NUM_EPOCHS=1),
+    )
+    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
+    assert pytest.approx(result["accuracy"], 0.71, abs=ABS_TOL)
+    assert pytest.approx(result["precision"], 0.25, abs=ABS_TOL)
+    assert pytest.approx(result["recall"], 0.28, abs=ABS_TOL)
+    assert pytest.approx(result["f1"], 0.26, abs=ABS_TOL)
+

@pytest.mark.integration
@pytest.mark.azureml
@pytest.mark.gpu
 def test_tc_bert_azureml(
-    notebooks, subscription_id, resource_group, workspace_name, workspace_region, cluster_name, tmp
+    notebooks,
+    subscription_id,
+    resource_group,
+    workspace_name,
+    workspace_region,
+    cluster_name,
+    tmp,
 ):
    notebook_path = notebooks["tc_bert_azureml"]

@ -68,7 +118,9 @@ def test_tc_bert_azureml(

    with open("outputs/results.json", "r") as handle:
        result_dict = json.load(handle)
-        assert result_dict["weighted avg"]["f1-score"] == pytest.approx(0.85, abs=ABS_TOL)
+        assert result_dict["weighted avg"]["f1-score"] == pytest.approx(
+            0.85, abs=ABS_TOL
+        )

    if os.path.exists("outputs"):
        shutil.rmtree("outputs")
--- a/tests/unit/test_xlnet_common.py
+++ b/tests/unit/test_xlnet_common.py
@ -0,0 +1,27 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import pytest
+
+def test_preprocess_classification_tokens(xlnet_english_tokenizer):
+    text = ["Hello World.",
+            "How you doing?",
+            "greatttt",
+            "The quick, brown fox jumps over a lazy dog.",
+            " DJs flock by when MTV ax quiz prog",
+            "Quick wafting zephyrs vex bold Jim",
+            "Quick, Baz, get my woven flax jodhpurs!"            
+           ]
+    seq_length = 5
+    input_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(text, seq_length)
+    
+    assert len(input_ids) == len(text)
+    assert len(input_mask) == len(text)
+    assert len(segment_ids) == len(text)
+    
+    
+    for sentence in range(len(text)):
+        assert len(input_ids[sentence]) == seq_length
+        assert len(input_mask[sentence]) == seq_length
+        assert len(segment_ids[sentence]) == seq_length
+    
--- a/tests/unit/test_xlnet_sequence_classification.py
+++ b/tests/unit/test_xlnet_sequence_classification.py
@ -0,0 +1,44 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import pytest
+
+from utils_nlp.models.xlnet.common import Language
+from utils_nlp.models.xlnet.sequence_classification import XLNetSequenceClassifier
+
+
+@pytest.fixture()
+def data():
+    return (
+        ["hi", "hello", "what's wrong with us", "can I leave?"],
+        [0, 0, 1, 2],
+        ["hey", "i will", "be working from", "home today"],
+        [2, 1, 1, 0],
+    )
+
+
+def test_classifier(xlnet_english_tokenizer, data):
+    token_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(
+        data[0], max_seq_length=10
+    )
+
+    val_data = xlnet_english_tokenizer.preprocess_classification_tokens(data[2], max_seq_length=10)
+
+    val_token_ids, val_input_mask, val_segment_ids = val_data
+
+    classifier = XLNetSequenceClassifier(language=Language.ENGLISHCASED, num_labels=3)
+    classifier.fit(
+        token_ids=token_ids,
+        input_mask=input_mask,
+        token_type_ids=segment_ids,
+        labels=data[1],
+        val_token_ids=val_token_ids,
+        val_input_mask=val_input_mask,
+        val_labels=data[3],
+        val_token_type_ids=val_segment_ids,
+    )
+
+    preds = classifier.predict(
+        token_ids=token_ids, input_mask=input_mask, token_type_ids=segment_ids
+    )
+    assert len(preds) == len(data[1])
--- a/tools/generate_conda_file.py
+++ b/tools/generate_conda_file.py
@ -68,11 +68,13 @@ PIP_BASE = {
    "nteract-scrapbook": "nteract-scrapbook>=0.2.1",
    "pydocumentdb": "pydocumentdb>=2.3.3",
    "pytorch-pretrained-bert": "pytorch-pretrained-bert>=0.6",
+    "pytorch-transformers": "pytorch-transformers>=1.2.0",
    "tqdm": "tqdm==4.31.1",
    "pyemd": "pyemd==0.5.1",
    "ipywebrtc": "ipywebrtc==0.4.3",
    "pre-commit": "pre-commit>=1.14.4",
    "scikit-learn": "scikit-learn>=0.19.0,<=0.20.3",
+    "seaborn": "seaborn>=0.9.0",
    "setuptools_scm": "setuptools_scm==3.2.0",
    "sklearn-crfsuite": "sklearn-crfsuite>=0.3.6",
    "spacy": "spacy>=2.1.4",
--- a/utils_nlp/README.md
+++ b/utils_nlp/README.md
@ -45,7 +45,8 @@ The models submodule contains implementations of various algorithms that can be
 A few highlights are
 * BERT
 * GenSen
+* XLNet


-### [Model Explainability](model_explainability)
-The model_explainability submodule contains utils that help explain or diagnose models, such as interpreting layers of a neural network.
+### [Model Explainability](interpreter)
+The interpreter submodule contains utils that help explain or diagnose models, such as interpreting layers of a neural network.
--- a/utils_nlp/init.py
+++ b/utils_nlp/init.py
@ -0,0 +1,21 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+from setuptools_scm import get_version
+
+
+__title__ = "Microsoft NLP"
+__author__ = "AI CAT at Microsoft"
+__license__ = "MIT"
+__copyright__ = "Copyright 2018-present Microsoft Corporation"
+
+# Synonyms
+TITLE = __title__
+AUTHOR = __author__
+LICENSE = __license__
+COPYRIGHT = __copyright__
+
+# Determine semantic versioning automatically
+# from git commits
+__version__ = get_version()
+VERSION = __version__
--- a/utils_nlp/dataset/dac.py
+++ b/utils_nlp/dataset/dac.py
@ -0,0 +1,38 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+"""Dataset for Arabic Classification utils
+https://data.mendeley.com/datasets/v524p5dhpj/2
+Mohamed, BINIZ (2018), “DataSet for Arabic Classification”, Mendeley Data, v2
+paper link:  ("https://www.mendeley.com/catalogue/
+        arabic-text-classification-using-deep-learning-technics/")
+"""
+
+import os
+import pandas as pd
+from utils_nlp.dataset.url_utils import extract_zip, maybe_download
+
+URL = (
+    "https://data.mendeley.com/datasets/v524p5dhpj/2"
+    "/files/91cb8398-9451-43af-88fc-041a0956ae2d/"
+    "arabic_dataset_classifiction.csv.zip"
+)
+
+
+def load_pandas_df(local_cache_path=None, num_rows=None):
+    """Downloads and extracts the dataset files
+    Args:
+        local_cache_path ([type], optional): [description]. Defaults to None.
+        num_rows (int): Number of rows to load. If None, all data is loaded.
+    Returns:
+        pd.DataFrame: pandas DataFrame containing the loaded dataset.
+    """
+    zip_file = URL.split("/")[-1]    
+    maybe_download(URL, zip_file, local_cache_path)
+
+    zip_file_path = os.path.join(local_cache_path, zip_file)
+    csv_file_path = os.path.join(local_cache_path, zip_file.replace(".zip", ""))
+
+    if not os.path.exists(csv_file_path):
+        extract_zip(file_path=zip_file_path, dest_path=local_cache_path)
+    return pd.read_csv(csv_file_path, nrows=num_rows)
--- a/utils_nlp/dataset/msrpc.py
+++ b/utils_nlp/dataset/msrpc.py
@ -23,6 +23,7 @@ DATASET_DICT = {

 def download_msrpc(download_dir):
    """Downloads Windows Installer for Microsoft Paraphrase Corpus.
+    
    Args:
        download_dir (str): File path for the downloaded file

--- a/utils_nlp/eval/classification.py
+++ b/utils_nlp/eval/classification.py
@ -3,9 +3,18 @@

 """Utilities functions for computing general model evaluation metrics."""

-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
-
+from sklearn.metrics import (
+    accuracy_score,
+    precision_score,
+    recall_score,
+    f1_score,
+    confusion_matrix,
+)
 from numpy import corrcoef
+
+from matplotlib import pyplot
+import seaborn as sn
+import numpy as np
 import pandas as pd


@ -44,3 +53,36 @@ def compute_correlation_coefficients(x, y=None):
        pd.DataFrame : A pandas dataframe from the correlation coefficient matrix of the variables.
    """
    return pd.DataFrame(corrcoef(x, y))
+
+
+def plot_confusion_matrix(
+    y_true,
+    y_pred,
+    labels,
+    normalize=False,
+    title="Confusion matrix",
+    plot_size=(8, 5),
+    font_scale=1.1,
+):
+    """Function that prints out a graphical representation of confusion matrix using Seaborn Heatmap
+
+    Args:
+        y_true (1d array-like): True labels from dataset
+        y_pred (1d array-like): Predicted labels from the models
+        labels: A list of labels
+        normalize (Bool, optional): Boolean to Set Row Normalization for Confusion Matrix
+        title (String, optional): String that is the title of the plot
+        plot_size (tuple, optional): Tuple of Plot Dimensions Default "(8, 5)"
+        font_scale (float, optional): float type scale factor for font within plot
+    """
+    conf_matrix = np.array(confusion_matrix(y_true, y_pred))
+    if normalize:
+        conf_matrix = np.round(
+            conf_matrix.astype("float") / conf_matrix.sum(axis=1)[:, np.newaxis], 3
+        )
+    conf_dataframe = pd.DataFrame(conf_matrix, labels, labels)
+    fig, ax = pyplot.subplots(figsize=plot_size)
+    sn.set(font_scale=font_scale)
+    ax.set_title(title)
+    ax = sn.heatmap(conf_dataframe, cmap="Blues", annot=True, annot_kws={"size": 16}, fmt="g")
+    ax.set(xlabel="Predicted Labels", ylabel="True Labels")
--- a/utils_nlp/models/README.md
+++ b/utils_nlp/models/README.md
@ -7,7 +7,8 @@ The following table summarizes each submodule.

 |Submodule|Description|
 |---|---|
-|[bert](./bert/README.md)| This submodule includes the BERT-based models for sequence classification, token classification, and sequence necoding.|
+|[bert](./bert/README.md)| This submodule includes the BERT-based models for sequence classification, token classification, and sequence encoding.|
 |[gensen](./gensen/README.md)| This submodule includes a distributed Pytorch implementation based on [Horovod](https://github.com/horovod/horovod) of [learning general purpose distributed sentence representations via large scale multi-task learning](https://arxiv.org/abs/1804.00079) by refactoring https://github.com/Maluuba/gensen|
 |[pretrained embeddings](./pretrained_embeddings) | This submodule provides utilities to download and extract pretrained word embeddings trained with Word2Vec, GloVe, fastText methods.|
 |[pytorch_modules](./pytorch_modules/README.md)| This submodule provides Pytorch modules like Gated Recurrent Unit with peepholes. |
+|[xlnet](./xlnet/README.md)| This submodule includes the XLNet-based model for sequence classification.|
--- a/utils_nlp/models/xlnet/README.md
+++ b/utils_nlp/models/xlnet/README.md
@ -0,0 +1,13 @@
+# XLNet-based Classes
+
+This folder contains utility functions and classes based on the implementation of [PyTorch-Transformers](https://github.com/huggingface/pytorch-transformers). 
+
+## Summary
+
+The following table summarizes each Python script.
+
+|Script|Description|
+|---|---|
+|[common.py](common.py)| This script includes <ul><li>the languages supported by XLNet-based classes</li><li> tokenization for text classification</li> <li>utilities to load data, etc.</li></ul>|
+|[sequence_classification.py](sequence_classification.py)| An implementation of sequence classification based on fine-turning XLNet. It is commonly used for text classification. The module includes logging functionality using MLFlow.|
+|[utils.py](utils.py)| This script includes a function to visualize a confusion matrix.|
--- a/utils_nlp/models/xlnet/common.py
+++ b/utils_nlp/models/xlnet/common.py
@ -0,0 +1,124 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+
+# This script reuses some code from
+# https://github.com/huggingface/pytorch-transformers/blob/master/examples/utils_glue.py
+from enum import Enum
+from pytorch_transformers import XLNetTokenizer
+from mlflow import log_metric, log_param, log_artifact
+
+
+class Language(Enum):
+    """
+    An enumeration of the supported pretrained models and languages.
+    """
+
+    ENGLISHCASED = "xlnet-base-cased" #: Base cased model for xlnet
+    ENGLISHLARGECASED = "xlnet-large-cased" #: Large cased model for xlnet
+
+class Tokenizer:
+    def __init__(
+        self, language=Language.ENGLISHCASED, cache_dir="."
+    ):
+        """Initializes the underlying pretrained XLNet tokenizer.
+
+        Args:
+            language (Language, optional): The pretrained model's language.
+                                           Defaults to Language.ENGLISHCASED
+        """
+        self.tokenizer = XLNetTokenizer.from_pretrained(language.value, cache_dir=cache_dir)
+        self.language = language
+
+    def preprocess_classification_tokens(self, examples, max_seq_length):
+        """Preprocessing of example input tokens:
+            - add XLNet sentence markers ([CLS] and [SEP])
+            - pad and truncate sequences
+            - create an input_mask
+            - create token type ids, aka. segment ids
+
+        Args:
+            examples (list): List of input strings to preprocess.
+            max_seq_length (int, optional): Maximum number of tokens
+                            (documents will be truncated or padded).
+                            Defaults to 512.
+        Returns:
+            (tuple): A tuple containing:
+                list of input ids
+                list of input mask
+                list of segment ids
+
+        """
+        features = []
+        cls_token = self.tokenizer.cls_token
+        sep_token = self.tokenizer.sep_token
+        cls_token_segment_id=2
+        pad_on_left=True
+        pad_token_segment_id=4
+        sequence_a_segment_id=0
+        cls_token_at_end=True
+        mask_padding_with_zero=True
+        pad_token=0
+        
+        list_input_ids = []
+        list_input_mask = []
+        list_segment_ids = []
+        
+        
+        for (ex_index, example) in enumerate(examples):
+
+            tokens_a = self.tokenizer.tokenize(example)
+
+            if len(tokens_a) > max_seq_length - 2:
+                tokens_a = tokens_a[:(max_seq_length - 2)]
+
+            tokens = tokens_a + [sep_token]
+            segment_ids = [sequence_a_segment_id] * len(tokens)
+
+            if cls_token_at_end:
+                tokens = tokens + [cls_token]
+                segment_ids = segment_ids + [cls_token_segment_id]
+            else:
+                tokens = [cls_token] + tokens
+                segment_ids = [cls_token_segment_id] + segment_ids
+
+            input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+
+
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+
+            # Zero-pad up to the sequence length.
+            padding_length = max_seq_length - len(input_ids)
+            if pad_on_left:
+                input_ids = ([pad_token] * padding_length) + input_ids
+                input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
+                segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
+            else:
+                input_ids = input_ids + ([pad_token] * padding_length)
+                input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
+                segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
+
+            assert len(input_ids) == max_seq_length
+            assert len(input_mask) == max_seq_length
+            assert len(segment_ids) == max_seq_length
+          
+            list_input_ids.append(input_ids)
+            list_input_mask.append(input_mask)
+            list_segment_ids.append(segment_ids)
+
+#             features.append({"input_ids":input_ids,"input_mask":input_mask,"segment_ids":segment_ids,"label_id":label_id})
+        return (list_input_ids, list_input_mask, list_segment_ids)
+
+def log_xlnet_params(local_dict):
+    """wrapper that abstracts away logging of ipython notebook local training parameters described at definition
+    Args:
+        local_dict(dict): dict containing all local varaibles from notebook 
+    """
+    params = ["DATA_FOLDER","XLNET_CACHE_DIR","LANGUAGE","MAX_SEQ_LENGTH","BATCH_SIZE","NUM_GPUS",
+              "NUM_EPOCHS","TRAIN_SIZE","LABEL_COL","TEXT_COL","LEARNING_RATE","WEIGHT_DECAY",
+              "ADAM_EPSILON","WARMUP_STEPS","DEBUG"]
+    for i in params:
+         log_param(i,local_dict[i])
+    return
--- a/utils_nlp/models/xlnet/sequence_classification.py
+++ b/utils_nlp/models/xlnet/sequence_classification.py
@ -0,0 +1,371 @@
+import numpy as np
+from collections import namedtuple
+import torch
+import torch.nn as nn
+from pytorch_transformers import (
+    XLNetConfig,
+    XLNetForSequenceClassification,
+    AdamW,
+    WarmupLinearSchedule
+)
+from tqdm import tqdm
+from torch.utils.data import (
+    DataLoader,
+    RandomSampler,
+    TensorDataset,
+)
+from utils_nlp.common.pytorch_utils import get_device, move_to_device
+from utils_nlp.models.xlnet.common import Language
+import mlflow
+import mlflow.pytorch
+import os
+
+class XLNetSequenceClassifier:
+    """XLNet-based sequence classifier"""
+
+    def __init__(
+        self,
+        language=Language.ENGLISHCASED,
+        num_labels=5,
+        cache_dir=".",
+        num_gpus=None,
+        num_epochs=1,
+        batch_size=8,
+        lr=5e-5,
+        adam_eps=1e-8,
+        warmup_steps=0,
+        weight_decay=0.0,
+        max_grad_norm=1.0,
+    ):
+        """Initializes the classifier and the underlying pretrained model.
+
+        Args:
+            language (Language, optional): The pretrained model's language.
+                                           Defaults to 'xlnet-base-cased'.
+            num_labels (int, optional): The number of unique labels in the
+                training data. Defaults to 5.
+            cache_dir (str, optional): Location of XLNet's cache directory.
+                Defaults to ".".
+            num_gpus (int, optional): The number of gpus to use.
+                                      If None is specified, all available GPUs
+                                      will be used. Defaults to None.
+            num_epochs (int, optional): Number of training epochs.
+                Defaults to 1.
+            batch_size (int, optional): Training batch size. Defaults to 8.
+            lr (float): Learning rate of the Adam optimizer. Defaults to 5e-5.
+            adam_eps (float, optional): term added to the denominator to improve
+                                        numerical stability. Defaults to 1e-8.
+            warmup_steps (int, optional): Number of steps in which to increase
+                                        learning rate linearly from 0 to 1. Defaults to 0.
+            weight_decay (float, optional): Weight decay. Defaults to 0.
+            max_grad_norm (float, optional): Maximum norm for the gradients. Defaults to 1.0
+        """
+
+        if num_labels < 2:
+            raise ValueError("Number of labels should be at least 2.")
+
+        self.language = language
+        self.num_labels = num_labels
+        self.cache_dir = cache_dir
+
+        self.num_gpus = num_gpus
+        self.num_epochs = num_epochs
+        self.batch_size = batch_size
+        self.lr = lr
+        self.adam_eps = adam_eps
+        self.warmup_steps = warmup_steps
+        self.weight_decay = weight_decay
+        self.max_grad_norm = max_grad_norm
+
+        # create classifier
+        self.config = XLNetConfig.from_pretrained(
+            self.language.value, num_labels=num_labels, cache_dir=cache_dir
+        )
+        self.model = XLNetForSequenceClassification(self.config)
+
+    def fit(
+        self,
+        token_ids,
+        input_mask,
+        labels,
+        val_token_ids,
+        val_input_mask,
+        val_labels,
+        token_type_ids=None,
+        val_token_type_ids=None,
+        verbose=True,
+        logging_steps=0,
+        save_steps=0,
+        val_steps=0,
+    ):
+        """Fine-tunes the XLNet classifier using the given training data.
+
+        Args:
+            token_ids (list): List of training token id lists.
+            input_mask (list): List of input mask lists.
+            labels (list): List of training labels.
+            token_type_ids (list, optional): List of lists. Each sublist
+                contains segment ids indicating if the token belongs to
+                the first sentence(0) or second sentence(1). Only needed
+                for two-sentence tasks.
+            verbose (bool, optional): If True, shows the training progress and
+                loss values. Defaults to True.
+        """
+
+        device = get_device("cpu" if self.num_gpus == 0 or not torch.cuda.is_available() else "gpu")
+        self.model = move_to_device(self.model, device, self.num_gpus)
+
+        token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
+        input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
+        labels_tensor = torch.tensor(labels, dtype=torch.long)
+
+        val_token_ids_tensor = torch.tensor(val_token_ids, dtype=torch.long)
+        val_input_mask_tensor = torch.tensor(val_input_mask, dtype=torch.long)
+        val_labels_tensor = torch.tensor(val_labels, dtype=torch.long)
+
+        if token_type_ids:
+            token_type_ids_tensor = torch.tensor(token_type_ids, dtype=torch.long)
+            val_token_type_ids_tensor = torch.tensor(val_token_type_ids, dtype=torch.long)
+
+            train_dataset = TensorDataset(
+                token_ids_tensor, input_mask_tensor, token_type_ids_tensor, labels_tensor
+            )
+
+            val_dataset = TensorDataset(
+                val_token_ids_tensor,
+                val_input_mask_tensor,
+                val_token_type_ids_tensor,
+                val_labels_tensor,
+            )
+
+        else:
+
+            train_dataset = TensorDataset(token_ids_tensor, input_mask_tensor, labels_tensor)
+
+            val_dataset = TensorDataset(
+                val_token_ids_tensor, val_input_mask_tensor, val_labels_tensor
+            )
+
+        # define optimizer and model parameters
+        param_optimizer = list(self.model.named_parameters())
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
+                "weight_decay": self.weight_decay,
+            },
+            {
+                "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+
+        val_sampler = RandomSampler(val_dataset)
+        
+        val_dataloader = DataLoader(
+            val_dataset,
+            sampler=val_sampler,
+            batch_size=self.batch_size
+        )
+
+        num_examples = len(token_ids)
+        num_batches = int(np.ceil(num_examples/self.batch_size))
+        num_train_optimization_steps = num_batches * self.num_epochs
+
+        optimizer = AdamW(optimizer_grouped_parameters, lr=self.lr, eps=self.adam_eps)
+        scheduler = WarmupLinearSchedule(
+            optimizer, warmup_steps=self.warmup_steps, t_total=num_train_optimization_steps
+        )
+
+        global_step = 0
+        self.model.train()
+        optimizer.zero_grad()
+        for epoch in range(self.num_epochs):
+
+            train_sampler = RandomSampler(train_dataset)
+
+            train_dataloader = DataLoader(
+                train_dataset, sampler=train_sampler, batch_size=self.batch_size
+            )
+
+            tr_loss = 0.0
+            logging_loss = 0.0
+            val_loss = 0.0
+
+            for i, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
+                if token_type_ids:
+                    x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(
+                        t.to(device) for t in batch
+                    )
+                else:
+                    token_type_ids_batch = None
+                    x_batch, mask_batch, y_batch = tuple(t.to(device) for t in batch)
+
+                outputs = self.model(
+                    input_ids=x_batch,
+                    token_type_ids=token_type_ids_batch,
+                    attention_mask=mask_batch,
+                    labels=y_batch,
+                )
+
+                loss = outputs[0]  # model outputs are always tuple in pytorch-transformers
+
+                loss.sum().backward()
+                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
+
+                tr_loss += loss.sum().item()
+                optimizer.step()
+                # Update learning rate schedule
+                scheduler.step()
+                optimizer.zero_grad()
+                global_step += 1
+                # logging of learning rate and loss
+                if logging_steps > 0 and global_step % logging_steps == 0:
+                    mlflow.log_metric("learning rate", scheduler.get_lr()[0], step=global_step)
+                    mlflow.log_metric(
+                        "training loss",
+                        (tr_loss - logging_loss) / (logging_steps * self.batch_size),
+                        step=global_step,
+                    )
+                    logging_loss = tr_loss
+                # model checkpointing
+                if save_steps > 0 and global_step % save_steps == 0:
+                    checkpoint_dir = os.path.join(os.getcwd(), "checkpoints")
+                    if not os.path.isdir(checkpoint_dir):
+                        os.makedirs(checkpoint_dir)
+                    checkpoint_path = checkpoint_dir + "/" + str(global_step) + ".pth"
+                    torch.save(self.model.state_dict(), checkpoint_path)
+                    mlflow.log_artifact(checkpoint_path)
+                # model validation
+                if val_steps > 0 and global_step % val_steps == 0:
+                    # run model on validation set
+                    self.model.eval()
+                    val_loss = 0.0
+                    for j, val_batch in enumerate(val_dataloader):
+                        if token_type_ids:
+                            val_x_batch, val_mask_batch, val_token_type_ids_batch, \
+                            val_y_batch = tuple(
+                                t.to(device) for t in val_batch
+                            )
+                        else:
+                            token_type_ids_batch = None
+                            val_x_batch, val_mask_batch, val_y_batch = tuple(
+                                t.to(device) for t in val_batch
+                            )
+                        val_outputs = self.model(
+                            input_ids=val_x_batch,
+                            token_type_ids=val_token_type_ids_batch,
+                            attention_mask=val_mask_batch,
+                            labels=val_y_batch,
+                        )
+                        vloss = val_outputs[0]
+                        val_loss += vloss.sum().item()
+                    mlflow.log_metric(
+                        "validation loss", val_loss / len(val_dataset), step=global_step
+                    )
+                    self.model.train()
+
+                if verbose:
+                    if i % ((num_batches // 10) + 1) == 0:
+                        if val_loss > 0:
+                            print(
+                                "epoch:{}/{}; batch:{}->{}/{}; average training loss:{:.6f};\
+                                 average val loss:{:.6f}".format(
+                                    epoch + 1,
+                                    self.num_epochs,
+                                    i + 1,
+                                    min(i + 1 + num_batches // 10, num_batches),
+                                    num_batches,
+                                    tr_loss / (i + 1),
+                                    val_loss / (j + 1),
+                                ),
+                            )
+                        else:
+                            print(
+                                "epoch:{}/{}; batch:{}->{}/{}; average train loss:{:.6f}".format(
+                                    epoch + 1,
+                                    self.num_epochs,
+                                    i + 1,
+                                    min(i + 1 + num_batches // 10, num_batches),
+                                    num_batches,
+                                    tr_loss / (i + 1),
+                                )
+                            )
+        checkpoint_dir = os.path.join(os.getcwd(), "checkpoints")
+        if not os.path.isdir(checkpoint_dir):
+            os.makedirs(checkpoint_dir)
+        checkpoint_path = checkpoint_dir + "/" + "final" + ".pth"
+        torch.save(self.model.state_dict(), checkpoint_path)
+        mlflow.log_artifact(checkpoint_path)
+        # empty cache
+        del [x_batch, y_batch, mask_batch, token_type_ids_batch]
+        if val_steps > 0:
+            del [val_x_batch, val_y_batch, val_mask_batch, val_token_type_ids_batch]
+        torch.cuda.empty_cache()
+
+    def predict(
+        self,
+        token_ids,
+        input_mask,
+        token_type_ids=None,
+        num_gpus=None,
+        batch_size=8,
+        probabilities=False,
+    ):
+        """Scores the given dataset and returns the predicted classes.
+
+        Args:
+            token_ids (list): List of training token lists.
+            input_mask (list): List of input mask lists.
+            token_type_ids (list, optional): List of lists. Each sublist
+                contains segment ids indicating if the token belongs to
+                the first sentence(0) or second sentence(1). Only needed
+                for two-sentence tasks.
+            num_gpus (int, optional): The number of gpus to use.
+                                      If None is specified, all available GPUs
+                                      will be used. Defaults to None.
+            batch_size (int, optional): Scoring batch size. Defaults to 8.
+            probabilities (bool, optional):
+                If True, the predicted probability distribution
+                is also returned. Defaults to False.
+        Returns:
+            1darray, namedtuple(1darray, ndarray): Predicted classes or
+                (classes, probabilities) if probabilities is True.
+        """
+
+        device = get_device("cpu" if num_gpus == 0 or not torch.cuda.is_available() else "gpu")
+        self.model = move_to_device(self.model, device, num_gpus)
+
+        self.model.eval()
+        preds = []
+
+        with tqdm(total=len(token_ids)) as pbar:
+            for i in range(0, len(token_ids), batch_size):
+                start = i
+                end = start + batch_size
+                x_batch = torch.tensor(token_ids[start:end], dtype=torch.long, device=device)
+                mask_batch = torch.tensor(input_mask[start:end], dtype=torch.long, device=device)
+
+                token_type_ids_batch = torch.tensor(
+                    token_type_ids[start:end], dtype=torch.long, device=device
+                )
+
+                with torch.no_grad():
+                    pred_batch = self.model(
+                        input_ids=x_batch,
+                        token_type_ids=token_type_ids_batch,
+                        attention_mask=mask_batch,
+                        labels=None,
+                    )
+                    preds.append(pred_batch[0].cpu())
+                    if i % batch_size == 0:
+                        pbar.update(batch_size)
+
+            preds = np.concatenate(preds)
+
+            if probabilities:
+                return namedtuple("Predictions", "classes probabilities")(
+                    preds.argmax(axis=1), nn.Softmax(dim=1)(torch.Tensor(preds)).numpy()
+                )
+            else:
+                return preds.argmax(axis=1)