Merge pull request #405 from microsoft/staging

Staging to Master
This commit is contained in:
Said Bleik 2019-09-12 13:26:46 -04:00 коммит произвёл GitHub
Родитель b78a18e5ed 9ba76e01ba
Коммит fec3deeca4
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
26 изменённых файлов: 4117 добавлений и 43 удалений

Просмотреть файл

@ -16,18 +16,36 @@ In an era of transfer learning, transformers, and deep architectures, we believe
>
>
## Focus areas
The repository aims to expand NLP capabilities along three separate dimensions
### Scenarios
We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.
### Algorithms
We aim to support multiple models for each of the supported scenarios. Currently, BERT-based models are supported across most scenarios. We are working to integrate [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) to allow use of many more models.
### Languages
We strongly subscribe to the multi-language principles laid down by ["Emily Bender"](http://faculty.washington.edu/ebender/papers/Bender-SDSS-2019.pdf)
* "Natural language is not a synonym for English"
* "English isn't generic for language, despite what NLP papers might lead you to believe"
* "Always name the language you are working on" ([Bender rule](https://www.aclweb.org/anthology/Q18-1041/))
The repository aims to support non-English languages across all the scenarios. Pre-trianed models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.
## Content
The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more [Jupyter notebook examples](examples) that make use of the core code base of models and repository utilities.
| Scenario | Models | Description|
|-------------------------| ------------------- |-------|
|Text Classification |BERT| Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. |
|Named Entity Recognition |BERT| Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. |
|Entailment |BERT| Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the `text' agrees with the `hypothesis` or not. |
|Question Answering |BiDAF <br> BERT| Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. |
|Sentence Similarity |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance<br>Models: BERT, GenSen| Sentence similarity is the process of computing a similarity score given a pair of text documents. |
|Embeddings| Word2Vec<br>fastText<br>GloVe| Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.
| Scenario | Models | Description|Languages|
|-------------------------| ------------------- |-------|---|
|Text Classification |BERT <br> XLNet| Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. |English, Hindi, Arabic|
|Named Entity Recognition |BERT| Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. |English|
|Entailment |BERT| Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the `text' agrees with the `hypothesis` or not. |English|
|Question Answering |BiDAF <br> BERT| Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. |English|
|Sentence Similarity |Representation: TF-IDF, Word Embeddings, Doc Embeddings<br>Metrics: Cosine Similarity, Word Mover's Distance<br>Models: BERT, GenSen| Sentence similarity is the process of computing a similarity score given a pair of text documents. |English|
|Embeddings| Word2Vec<br>fastText<br>GloVe| Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.|English|
## Getting Started
While solving NLP problems, it is always good to start with the prebuilt [Cognitive Services](https://azure.microsoft.com/en-us/services/cognitive-services/directory/lang/). When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the [Setup Guide](SETUP.md), which lists instructions on how to setup your environment and dependencies.

19
docs/Makefile Normal file
Просмотреть файл

@ -0,0 +1,19 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

Просмотреть файл

@ -0,0 +1,13 @@
# Documentation
To setup the documentation, first you need to install the dependencies of the cpu environment. For it please follow the [SETUP.md](../SETUP.md). Then type:
conda activate nlp_cpu
pip install sphinx_rtd_theme
To build the documentation as HTML:
cd docs
make html

21
docs/source/azureml.rst Normal file
Просмотреть файл

@ -0,0 +1,21 @@
.. _azureml:
AzureML module
**************************
AzureML module from NLP utilities.
AzureML utils
===============================
.. automodule:: utils_nlp.azureml.azureml_utils
:members:
AzureML utils for BERT
===============================
.. automodule:: utils_nlp.azureml.azureml_bert_util
:members:

238
docs/source/conf.py Normal file
Просмотреть файл

@ -0,0 +1,238 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# -*- coding: utf-8 -*-
#
# Configuration file for the Sphinx documentation builder.
#
# This file does only contain a selection of the most common options. For a
# full list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))
sys.setrecursionlimit(1500)
from utils_nlp import TITLE, VERSION, COPYRIGHT, AUTHOR
# -- Project information -----------------------------------------------------
project = TITLE
copyright = COPYRIGHT
author = AUTHOR
# The short X.Y version
version = ".".join(VERSION.split(".")[:2])
# The full version, including alpha/beta/rc tags
release = VERSION
prefix = "NLP"
# -- General configuration ---------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.doctest",
"sphinx.ext.intersphinx",
"sphinx.ext.ifconfig",
"sphinx.ext.viewcode", # Add links to highlighted source code
"sphinx.ext.napoleon", # to render Google format docstrings
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = ".rst"
# The master toctree document.
master_doc = "index"
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["Thumbs.db", ".DS_Store"]
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = None
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ["images"]
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# The default sidebars (for documents that don't match any pattern) are
# defined by theme itself. Builtin themes are using these templates by
# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
# 'searchbox.html']``.
#
# html_sidebars = {}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = prefix + "doc"
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
"papersize": "letterpaper",
"pointsize": "10pt",
"figure_align": "htbp",
"preamble": r"""
%% Adding source listings https://en.wikibooks.org/wiki/LaTeX/Source_Code_Listings
\usepackage{listings}
\usepackage{color}
\definecolor{mygreen}{rgb}{0,0.6,0}
\definecolor{mygray}{rgb}{0.5,0.5,0.5}
\definecolor{mymauve}{rgb}{0.58,0,0.82}
\lstset{
backgroundcolor=\color{white}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}; should come as last argument
basicstyle=\footnotesize, % the size of the fonts that are used for the code
breakatwhitespace=false, % sets if automatic breaks should only happen at whitespace
breaklines=true, % sets automatic line breaking
captionpos=b, % sets the caption-position to bottom
commentstyle=\color{mygreen}, % comment style
deletekeywords={...}, % if you want to delete keywords from the given language
escapeinside={\%*}{*)}, % if you want to add LaTeX within your code
extendedchars=true, % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
firstnumber=1000, % start line enumeration with line 1000
frame=single, % adds a frame around the code
keepspaces=true, % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
keywordstyle=\color{blue}, % keyword style
language=Python, % the language of the code
morekeywords={*,...}, % if you want to add more keywords to the set
numbers=left, % where to put the line-numbers; possible values are (none, left, right)
numbersep=5pt, % how far the line-numbers are from the code
numberstyle=\tiny\color{mygray}, % the style that is used for the line-numbers
rulecolor=\color{black}, % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
showstringspaces=false, % underline spaces within strings only
showtabs=false, % show tabs within strings adding particular underscores
stepnumber=2, % the step between two line-numbers. If it's 1, each line will be numbered
stringstyle=\color{mymauve}, % string literal style
tabsize=2, % sets default tabsize to 2 spaces
title=\lstname % show the filename of files included with \lstinputlisting; also try caption instead of title
}
""",
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [(master_doc, prefix + ".tex", prefix + " Documentation", prefix, "manual")]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(master_doc, prefix, prefix + " Documentation", [author], 1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(
master_doc,
prefix,
prefix + " Documentation",
author,
prefix,
"One line description of project.",
"Miscellaneous",
)
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ["search.html"]
# -- Extension configuration -------------------------------------------------
# -- Options for intersphinx extension ---------------------------------------
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {"https://docs.python.org/": None}
##################################################
# Other options
# html_favicon = os.path.join(html_static_path[0], "favicon.ico")
# Ensure that __init__() is always documented
# source: https://stackoverflow.com/a/5599712
def skip(app, what, name, obj, would_skip, options):
if name == "__init__":
return False
return would_skip
def setup(app):
app.connect("autodoc-skip-member", skip)

27
docs/source/index.rst Normal file
Просмотреть файл

@ -0,0 +1,27 @@
NLP Utilities
===================================================
The `NLP repository <https://github.com/Microsoft/NLP>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks.
The module `utils_nlp <https://github.com/microsoft/nlp/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and
evaluating NLP systems.
.. toctree::
:maxdepth: 1
:caption: Contents:
AzureML <azureml>
Common <common>
Dataset <dataset>
Evaluation <eval>
NLP Algorithms <model>
NLP Interpretability <interpreter>
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

Просмотреть файл

@ -5,7 +5,7 @@ This folder contains examples and best practices, written in Jupyter notebooks,
|Category|Applications|Methods|Languages|
|---| ------------------------ | ------------------- |---|
|[Text Classification](text_classification)|Topic Classification|BERT|en|
|[Text Classification](text_classification)|Topic Classification|BERT, XLNet|en, hi, ar|
|[Named Entity Recognition](named_entity_recognition) |Wikipedia NER|BERT|en|
|[Entailment](entailment)|MultiNLI Natural Language Inference|BERT|en|
|[Question Answering](question_answering) |SQuAD|BiDAF, BERT|en|

Просмотреть файл

@ -3,8 +3,7 @@ This folder contains examples and best practices, written in Jupyter notebooks,
utility scripts in the [utils_nlp](../../utils_nlp) folder to speed up data preprocessing and model building for text classification.
The models can be used in a wide variety of applications, such as
sentiment analysis, document indexing in digital libraries, hate speech detection, and general-purpose categorization in medical, academic, legal, and many other domains.
Currently, we focus on fine-tuning pre-trained BERT
model. We plan to continue adding state-of-the-art models as they come up and welcome community
Currently, we focus on fine-tuning pre-trained BERT and XLNet models. We plan to continue adding state-of-the-art models as they come up and welcome community
contributions.
## What is Text Classification?
@ -21,3 +20,6 @@ The following summarizes each notebook for Text Classification. Each notebook pr
|---|---|---|---|
|[BERT for text classification with MNLI](tc_mnli_bert.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[BERT for text classification on AzureML](tc_bert_azureml.ipynb) |Azure ML|A notebook which walks through fine-tuning and evaluating pre-trained BERT model on a distributed setup with AzureML. |[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[XLNet for text classification with MNLI](tc_mnli_xlnet.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained XLNet model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[BERT for text classification of Hindi BBC News](tc_bbc_bert_hi.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Hindi BBC news data|[BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1)|
|[BERT for text classification of Arabic News](tc_dac_bert_ar.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Arabic news articles|[DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)|

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,821 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
"\n",
"*Licensed under the MIT License.*\n",
"\n",
"# Classification of Arabic News Articles using BERT"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import sys\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scrapbook as sb\n",
"import torch\n",
"import torch.nn as nn\n",
"from sklearn.metrics import accuracy_score, classification_report\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"sys.path.append(\"../../\")\n",
"from utils_nlp.common.timer import Timer\n",
"from utils_nlp.dataset.dac import load_pandas_df\n",
"from utils_nlp.models.bert.common import Language, Tokenizer\n",
"from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).\n",
"\n",
"We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"DATA_FOLDER = \"./temp\"\n",
"BERT_CACHE_DIR = \"./temp\"\n",
"LANGUAGE = Language.MULTILINGUAL\n",
"MAX_LEN = 200\n",
"BATCH_SIZE = 32\n",
"NUM_GPUS = 2\n",
"NUM_EPOCHS = 1\n",
"TRAIN_SIZE = 0.8\n",
"NUM_ROWS = 15000\n",
"RANDOM_STATE = 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Dataset\n",
"We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS, random_state=RANDOM_STATE)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>targe</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>80414</th>\n",
" <td>فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6649</th>\n",
" <td>أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3722</th>\n",
" <td>أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82317</th>\n",
" <td>الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5219</th>\n",
" <td>المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text targe\n",
"80414 فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك... 4\n",
"6649 أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا... 0\n",
"3722 أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي... 0\n",
"82317 الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه... 4\n",
"5219 المطرب المصري يخوض حملة إعلامية لترويج ألبومه ... 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# set the text and label columns\n",
"text_col = df.columns[0]\n",
"label_col = df.columns[1]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# remove empty documents\n",
"df = df[df[text_col].isna() == False]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect the distribution of labels:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4 5844\n",
"3 2796\n",
"1 2139\n",
"0 1917\n",
"2 1900\n",
"Name: targe, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[label_col].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>culture</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>diverse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>economy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>politics</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>sports</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label\n",
"0 culture\n",
"1 diverse\n",
"2 economy\n",
"3 politics\n",
"4 sports"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# ordered list of labels\n",
"labels = [\"culture\", \"diverse\", \"economy\", \"politics\", \"sports\"]\n",
"num_labels = len(labels)\n",
"pd.DataFrame({\"label\": labels})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we split the data for training and testing:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of training examples: 11676\n",
"Number of testing examples: 2920\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
" FutureWarning)\n"
]
}
],
"source": [
"df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=RANDOM_STATE)\n",
"print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
"print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenize and Preprocess"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 11676/11676 [00:59<00:00, 196.42it/s]\n",
"100%|██████████| 2920/2920 [00:14<00:00, 197.99it/s]\n"
]
}
],
"source": [
"tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)\n",
"tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))\n",
"tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, we perform the following preprocessing steps in the cell below:\n",
"- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
"- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence\n",
"- Pad or truncate the token lists to the specified max length\n",
"- Return mask lists that indicate paddings' positions\n",
"\n",
"*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(\n",
" tokens_train, MAX_LEN\n",
")\n",
"tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(\n",
" tokens_test, MAX_LEN\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Model\n",
"Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"classifier = BERTSequenceClassifier(\n",
" language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train\n",
"We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"t_total value of -1 results in schedule not being applied\n",
"Iteration: 0%| | 1/365 [00:03<21:12, 3.49s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:1->37/365; average training loss:1.591262\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 10%|█ | 38/365 [01:02<08:45, 1.61s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:38->74/365; average training loss:0.745935\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 21%|██ | 75/365 [02:02<07:52, 1.63s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:75->111/365; average training loss:0.593934\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 31%|███ | 112/365 [03:03<06:56, 1.65s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:112->148/365; average training loss:0.530150\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 41%|████ | 149/365 [04:03<05:54, 1.64s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:149->185/365; average training loss:0.481620\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 51%|█████ | 186/365 [05:05<05:02, 1.69s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:186->222/365; average training loss:0.455032\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 61%|██████ | 223/365 [06:06<03:59, 1.69s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:223->259/365; average training loss:0.421702\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 71%|███████ | 260/365 [07:08<02:56, 1.68s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:260->296/365; average training loss:0.401165\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 81%|████████▏ | 297/365 [08:09<01:52, 1.65s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:297->333/365; average training loss:0.382719\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 92%|█████████▏| 334/365 [09:12<00:52, 1.71s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:334->365/365; average training loss:0.372204\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 365/365 [10:04<00:00, 1.63s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Training time: 0.169 hrs]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"with Timer() as t:\n",
" classifier.fit(\n",
" token_ids=tokens_train,\n",
" input_mask=mask_train,\n",
" labels=list(df_train[label_col]), \n",
" num_gpus=NUM_GPUS, \n",
" num_epochs=NUM_EPOCHS,\n",
" batch_size=BATCH_SIZE, \n",
" verbose=True,\n",
" ) \n",
"print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Score\n",
"We score the test set using the trained classifier:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 92/92 [00:48<00:00, 2.25it/s]\n"
]
}
],
"source": [
"preds = classifier.predict(\n",
" token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate Results\n",
"Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.9277397260273973\n",
"{\n",
" \"culture\": {\n",
" \"f1-score\": 0.9081761006289307,\n",
" \"precision\": 0.8848039215686274,\n",
" \"recall\": 0.9328165374677002,\n",
" \"support\": 387\n",
" },\n",
" \"diverse\": {\n",
" \"f1-score\": 0.9237983587338804,\n",
" \"precision\": 0.9471153846153846,\n",
" \"recall\": 0.9016018306636155,\n",
" \"support\": 437\n",
" },\n",
" \"economy\": {\n",
" \"f1-score\": 0.8547418967587034,\n",
" \"precision\": 0.8221709006928406,\n",
" \"recall\": 0.89,\n",
" \"support\": 400\n",
" },\n",
" \"macro avg\": {\n",
" \"f1-score\": 0.9099850933798536,\n",
" \"precision\": 0.9087524907040864,\n",
" \"recall\": 0.9125256551533433,\n",
" \"support\": 2920\n",
" },\n",
" \"micro avg\": {\n",
" \"f1-score\": 0.9277397260273973,\n",
" \"precision\": 0.9277397260273973,\n",
" \"recall\": 0.9277397260273973,\n",
" \"support\": 2920\n",
" },\n",
" \"politics\": {\n",
" \"f1-score\": 0.8734177215189873,\n",
" \"precision\": 0.8994413407821229,\n",
" \"recall\": 0.8488576449912126,\n",
" \"support\": 569\n",
" },\n",
" \"sports\": {\n",
" \"f1-score\": 0.9897913892587662,\n",
" \"precision\": 0.9902309058614565,\n",
" \"recall\": 0.9893522626441881,\n",
" \"support\": 1127\n",
" },\n",
" \"weighted avg\": {\n",
" \"f1-score\": 0.9279213601549715,\n",
" \"precision\": 0.9290922105520572,\n",
" \"recall\": 0.9277397260273973,\n",
" \"support\": 2920\n",
" }\n",
"}\n"
]
}
],
"source": [
"report = classification_report(df_test[label_col], preds, target_names=labels, output_dict=True) \n",
"accuracy = accuracy_score(df_test[label_col], preds )\n",
"print(\"accuracy: {}\".format(accuracy))\n",
"print(json.dumps(report, indent=4, sort_keys=True))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9277397260273973,
"encoder": "json",
"name": "accuracy",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "accuracy"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9087524907040864,
"encoder": "json",
"name": "precision",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "precision"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9125256551533433,
"encoder": "json",
"name": "recall",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "recall"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9099850933798536,
"encoder": "json",
"name": "f1",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "f1"
}
},
"output_type": "display_data"
}
],
"source": [
"# for testing\n",
"sb.glue(\"accuracy\", accuracy)\n",
"sb.glue(\"precision\", report[\"macro avg\"][\"precision\"])\n",
"sb.glue(\"recall\", report[\"macro avg\"][\"recall\"])\n",
"sb.glue(\"f1\", report[\"macro avg\"][\"f1-score\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "nlp_gpu",
"language": "python",
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -2,18 +2,11 @@
# -*- encoding: utf-8 -*-
from __future__ import absolute_import
from __future__ import print_function
import io
import re
from os.path import dirname, join
from setuptools import setup
from setuptools_scm import get_version
# Determine semantic versioning automatically
# from git commits
__version__ = get_version()
from utils_nlp import VERSION, AUTHOR, TITLE, LICENSE
def read(*names, **kwargs):
@ -23,15 +16,15 @@ def read(*names, **kwargs):
setup(
name="utils_nlp",
version=__version__,
license="MIT License",
version=VERSION,
license=LICENSE,
description="NLP Utility functions that are used for best practices in building state-of-the-art NLP methods and scenarios. Developed by Microsoft AI CAT",
long_description="%s\n%s"
% (
re.compile("^.. start-badges.*^.. end-badges", re.M | re.S).sub("", read("README.md")),
re.sub(":[a-z]+:`~?(.*?)`", r"``\1``", read("CONTRIBUTING.md")),
),
author="AI CAT",
author=AUTHOR,
author_email="teamsharat@microsoft.com",
url="https://github.com/microsoft/nlp",
packages=["utils_nlp"],

Просмотреть файл

@ -15,8 +15,10 @@ from tempfile import TemporaryDirectory
import pytest
from tests.notebooks_common import path_notebooks
from utils_nlp.models.bert.common import Language
from utils_nlp.models.bert.common import Language as BERTLanguage
from utils_nlp.models.xlnet.common import Language as XLNetLanguage
from utils_nlp.models.bert.common import Tokenizer as BERTTokenizer
from utils_nlp.models.xlnet.common import Tokenizer as XLNetTokenizer
from utils_nlp.azureml import azureml_utils
from azureml.core.webservice import Webservice
@ -68,6 +70,12 @@ def notebooks():
folder_notebooks, "sentence_similarity", "bert_senteval.ipynb"
),
"tc_mnli_bert": os.path.join(folder_notebooks, "text_classification", "tc_mnli_bert.ipynb"),
"tc_dac_bert_ar": os.path.join(
folder_notebooks, "text_classification", "tc_dac_bert_ar.ipynb"
),
"tc_bbc_bert_hi": os.path.join(
folder_notebooks, "text_classification", "tc_bbc_bert_hi.ipynb"
),
"ner_wikigold_bert": os.path.join(
folder_notebooks, "named_entity_recognition", "ner_wikigold_bert.ipynb"
),
@ -190,7 +198,12 @@ def cluster_name(request):
@pytest.fixture()
def bert_english_tokenizer():
return BERTTokenizer(language=Language.ENGLISHCASED, to_lower=False)
return BERTTokenizer(language=BERTLanguage.ENGLISHCASED, to_lower=False)
@pytest.fixture()
def xlnet_english_tokenizer():
return XLNetTokenizer(language=XLNetLanguage.ENGLISHCASED)
@pytest.fixture(scope="module")

Просмотреть файл

@ -18,29 +18,79 @@ ABS_TOL = 0.1
def test_tc_mnli_bert(notebooks, tmp):
notebook_path = notebooks["tc_mnli_bert"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(NUM_GPUS=1,
DATA_FOLDER=tmp,
BERT_CACHE_DIR=tmp,
BATCH_SIZE=32,
BATCH_SIZE_PRED=512,
NUM_EPOCHS=1
)
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
NUM_GPUS=1,
DATA_FOLDER=tmp,
BERT_CACHE_DIR=tmp,
BATCH_SIZE=32,
BATCH_SIZE_PRED=512,
NUM_EPOCHS=1,
),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["accuracy"], 0.93, abs=ABS_TOL)
assert pytest.approx(result["precision"], 0.93, abs=ABS_TOL)
assert pytest.approx(result["recall"], 0.93, abs=ABS_TOL)
assert pytest.approx(result["f1"], 0.93, abs=ABS_TOL)
@pytest.mark.gpu
@pytest.mark.integration
def test_tc_dac_bert_ar(notebooks, tmp):
notebook_path = notebooks["tc_dac_bert_ar"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
NUM_GPUS=1,
DATA_FOLDER=tmp,
BERT_CACHE_DIR=tmp,
BATCH_SIZE=32,
NUM_EPOCHS=1,
TRAIN_SIZE=0.8,
NUM_ROWS=15000,
RANDOM_STATE=0,
),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["accuracy"], 0.93, abs=ABS_TOL)
assert pytest.approx(result["precision"], 0.91, abs=ABS_TOL)
assert pytest.approx(result["recall"], 0.91, abs=ABS_TOL)
assert pytest.approx(result["f1"], 0.91, abs=ABS_TOL)
@pytest.mark.gpu
@pytest.mark.integration
def test_tc_bbc_bert_hi(notebooks, tmp):
notebook_path = notebooks["tc_bbc_bert_hi"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(NUM_GPUS=1, DATA_FOLDER=tmp, BERT_CACHE_DIR=tmp, NUM_EPOCHS=1),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["accuracy"], 0.71, abs=ABS_TOL)
assert pytest.approx(result["precision"], 0.25, abs=ABS_TOL)
assert pytest.approx(result["recall"], 0.28, abs=ABS_TOL)
assert pytest.approx(result["f1"], 0.26, abs=ABS_TOL)
@pytest.mark.integration
@pytest.mark.azureml
@pytest.mark.gpu
def test_tc_bert_azureml(
notebooks, subscription_id, resource_group, workspace_name, workspace_region, cluster_name, tmp
notebooks,
subscription_id,
resource_group,
workspace_name,
workspace_region,
cluster_name,
tmp,
):
notebook_path = notebooks["tc_bert_azureml"]
@ -68,7 +118,9 @@ def test_tc_bert_azureml(
with open("outputs/results.json", "r") as handle:
result_dict = json.load(handle)
assert result_dict["weighted avg"]["f1-score"] == pytest.approx(0.85, abs=ABS_TOL)
assert result_dict["weighted avg"]["f1-score"] == pytest.approx(
0.85, abs=ABS_TOL
)
if os.path.exists("outputs"):
shutil.rmtree("outputs")

Просмотреть файл

@ -0,0 +1,27 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
def test_preprocess_classification_tokens(xlnet_english_tokenizer):
text = ["Hello World.",
"How you doing?",
"greatttt",
"The quick, brown fox jumps over a lazy dog.",
" DJs flock by when MTV ax quiz prog",
"Quick wafting zephyrs vex bold Jim",
"Quick, Baz, get my woven flax jodhpurs!"
]
seq_length = 5
input_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(text, seq_length)
assert len(input_ids) == len(text)
assert len(input_mask) == len(text)
assert len(segment_ids) == len(text)
for sentence in range(len(text)):
assert len(input_ids[sentence]) == seq_length
assert len(input_mask[sentence]) == seq_length
assert len(segment_ids[sentence]) == seq_length

Просмотреть файл

@ -0,0 +1,44 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
from utils_nlp.models.xlnet.common import Language
from utils_nlp.models.xlnet.sequence_classification import XLNetSequenceClassifier
@pytest.fixture()
def data():
return (
["hi", "hello", "what's wrong with us", "can I leave?"],
[0, 0, 1, 2],
["hey", "i will", "be working from", "home today"],
[2, 1, 1, 0],
)
def test_classifier(xlnet_english_tokenizer, data):
token_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(
data[0], max_seq_length=10
)
val_data = xlnet_english_tokenizer.preprocess_classification_tokens(data[2], max_seq_length=10)
val_token_ids, val_input_mask, val_segment_ids = val_data
classifier = XLNetSequenceClassifier(language=Language.ENGLISHCASED, num_labels=3)
classifier.fit(
token_ids=token_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
labels=data[1],
val_token_ids=val_token_ids,
val_input_mask=val_input_mask,
val_labels=data[3],
val_token_type_ids=val_segment_ids,
)
preds = classifier.predict(
token_ids=token_ids, input_mask=input_mask, token_type_ids=segment_ids
)
assert len(preds) == len(data[1])

Просмотреть файл

@ -68,11 +68,13 @@ PIP_BASE = {
"nteract-scrapbook": "nteract-scrapbook>=0.2.1",
"pydocumentdb": "pydocumentdb>=2.3.3",
"pytorch-pretrained-bert": "pytorch-pretrained-bert>=0.6",
"pytorch-transformers": "pytorch-transformers>=1.2.0",
"tqdm": "tqdm==4.31.1",
"pyemd": "pyemd==0.5.1",
"ipywebrtc": "ipywebrtc==0.4.3",
"pre-commit": "pre-commit>=1.14.4",
"scikit-learn": "scikit-learn>=0.19.0,<=0.20.3",
"seaborn": "seaborn>=0.9.0",
"setuptools_scm": "setuptools_scm==3.2.0",
"sklearn-crfsuite": "sklearn-crfsuite>=0.3.6",
"spacy": "spacy>=2.1.4",

Просмотреть файл

@ -45,7 +45,8 @@ The models submodule contains implementations of various algorithms that can be
A few highlights are
* BERT
* GenSen
* XLNet
### [Model Explainability](model_explainability)
The model_explainability submodule contains utils that help explain or diagnose models, such as interpreting layers of a neural network.
### [Model Explainability](interpreter)
The interpreter submodule contains utils that help explain or diagnose models, such as interpreting layers of a neural network.

Просмотреть файл

@ -0,0 +1,21 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
from setuptools_scm import get_version
__title__ = "Microsoft NLP"
__author__ = "AI CAT at Microsoft"
__license__ = "MIT"
__copyright__ = "Copyright 2018-present Microsoft Corporation"
# Synonyms
TITLE = __title__
AUTHOR = __author__
LICENSE = __license__
COPYRIGHT = __copyright__
# Determine semantic versioning automatically
# from git commits
__version__ = get_version()
VERSION = __version__

38
utils_nlp/dataset/dac.py Normal file
Просмотреть файл

@ -0,0 +1,38 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
"""Dataset for Arabic Classification utils
https://data.mendeley.com/datasets/v524p5dhpj/2
Mohamed, BINIZ (2018), DataSet for Arabic Classification, Mendeley Data, v2
paper link: ("https://www.mendeley.com/catalogue/
arabic-text-classification-using-deep-learning-technics/")
"""
import os
import pandas as pd
from utils_nlp.dataset.url_utils import extract_zip, maybe_download
URL = (
"https://data.mendeley.com/datasets/v524p5dhpj/2"
"/files/91cb8398-9451-43af-88fc-041a0956ae2d/"
"arabic_dataset_classifiction.csv.zip"
)
def load_pandas_df(local_cache_path=None, num_rows=None):
"""Downloads and extracts the dataset files
Args:
local_cache_path ([type], optional): [description]. Defaults to None.
num_rows (int): Number of rows to load. If None, all data is loaded.
Returns:
pd.DataFrame: pandas DataFrame containing the loaded dataset.
"""
zip_file = URL.split("/")[-1]
maybe_download(URL, zip_file, local_cache_path)
zip_file_path = os.path.join(local_cache_path, zip_file)
csv_file_path = os.path.join(local_cache_path, zip_file.replace(".zip", ""))
if not os.path.exists(csv_file_path):
extract_zip(file_path=zip_file_path, dest_path=local_cache_path)
return pd.read_csv(csv_file_path, nrows=num_rows)

Просмотреть файл

@ -23,6 +23,7 @@ DATASET_DICT = {
def download_msrpc(download_dir):
"""Downloads Windows Installer for Microsoft Paraphrase Corpus.
Args:
download_dir (str): File path for the downloaded file

Просмотреть файл

@ -3,9 +3,18 @@
"""Utilities functions for computing general model evaluation metrics."""
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
)
from numpy import corrcoef
from matplotlib import pyplot
import seaborn as sn
import numpy as np
import pandas as pd
@ -44,3 +53,36 @@ def compute_correlation_coefficients(x, y=None):
pd.DataFrame : A pandas dataframe from the correlation coefficient matrix of the variables.
"""
return pd.DataFrame(corrcoef(x, y))
def plot_confusion_matrix(
y_true,
y_pred,
labels,
normalize=False,
title="Confusion matrix",
plot_size=(8, 5),
font_scale=1.1,
):
"""Function that prints out a graphical representation of confusion matrix using Seaborn Heatmap
Args:
y_true (1d array-like): True labels from dataset
y_pred (1d array-like): Predicted labels from the models
labels: A list of labels
normalize (Bool, optional): Boolean to Set Row Normalization for Confusion Matrix
title (String, optional): String that is the title of the plot
plot_size (tuple, optional): Tuple of Plot Dimensions Default "(8, 5)"
font_scale (float, optional): float type scale factor for font within plot
"""
conf_matrix = np.array(confusion_matrix(y_true, y_pred))
if normalize:
conf_matrix = np.round(
conf_matrix.astype("float") / conf_matrix.sum(axis=1)[:, np.newaxis], 3
)
conf_dataframe = pd.DataFrame(conf_matrix, labels, labels)
fig, ax = pyplot.subplots(figsize=plot_size)
sn.set(font_scale=font_scale)
ax.set_title(title)
ax = sn.heatmap(conf_dataframe, cmap="Blues", annot=True, annot_kws={"size": 16}, fmt="g")
ax.set(xlabel="Predicted Labels", ylabel="True Labels")

Просмотреть файл

@ -7,7 +7,8 @@ The following table summarizes each submodule.
|Submodule|Description|
|---|---|
|[bert](./bert/README.md)| This submodule includes the BERT-based models for sequence classification, token classification, and sequence necoding.|
|[bert](./bert/README.md)| This submodule includes the BERT-based models for sequence classification, token classification, and sequence encoding.|
|[gensen](./gensen/README.md)| This submodule includes a distributed Pytorch implementation based on [Horovod](https://github.com/horovod/horovod) of [learning general purpose distributed sentence representations via large scale multi-task learning](https://arxiv.org/abs/1804.00079) by refactoring https://github.com/Maluuba/gensen|
|[pretrained embeddings](./pretrained_embeddings) | This submodule provides utilities to download and extract pretrained word embeddings trained with Word2Vec, GloVe, fastText methods.|
|[pytorch_modules](./pytorch_modules/README.md)| This submodule provides Pytorch modules like Gated Recurrent Unit with peepholes. |
|[xlnet](./xlnet/README.md)| This submodule includes the XLNet-based model for sequence classification.|

Просмотреть файл

@ -0,0 +1,13 @@
# XLNet-based Classes
This folder contains utility functions and classes based on the implementation of [PyTorch-Transformers](https://github.com/huggingface/pytorch-transformers).
## Summary
The following table summarizes each Python script.
|Script|Description|
|---|---|
|[common.py](common.py)| This script includes <ul><li>the languages supported by XLNet-based classes</li><li> tokenization for text classification</li> <li>utilities to load data, etc.</li></ul>|
|[sequence_classification.py](sequence_classification.py)| An implementation of sequence classification based on fine-turning XLNet. It is commonly used for text classification. The module includes logging functionality using MLFlow.|
|[utils.py](utils.py)| This script includes a function to visualize a confusion matrix.|

Просмотреть файл

@ -0,0 +1,124 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# This script reuses some code from
# https://github.com/huggingface/pytorch-transformers/blob/master/examples/utils_glue.py
from enum import Enum
from pytorch_transformers import XLNetTokenizer
from mlflow import log_metric, log_param, log_artifact
class Language(Enum):
"""
An enumeration of the supported pretrained models and languages.
"""
ENGLISHCASED = "xlnet-base-cased" #: Base cased model for xlnet
ENGLISHLARGECASED = "xlnet-large-cased" #: Large cased model for xlnet
class Tokenizer:
def __init__(
self, language=Language.ENGLISHCASED, cache_dir="."
):
"""Initializes the underlying pretrained XLNet tokenizer.
Args:
language (Language, optional): The pretrained model's language.
Defaults to Language.ENGLISHCASED
"""
self.tokenizer = XLNetTokenizer.from_pretrained(language.value, cache_dir=cache_dir)
self.language = language
def preprocess_classification_tokens(self, examples, max_seq_length):
"""Preprocessing of example input tokens:
- add XLNet sentence markers ([CLS] and [SEP])
- pad and truncate sequences
- create an input_mask
- create token type ids, aka. segment ids
Args:
examples (list): List of input strings to preprocess.
max_seq_length (int, optional): Maximum number of tokens
(documents will be truncated or padded).
Defaults to 512.
Returns:
(tuple): A tuple containing:
list of input ids
list of input mask
list of segment ids
"""
features = []
cls_token = self.tokenizer.cls_token
sep_token = self.tokenizer.sep_token
cls_token_segment_id=2
pad_on_left=True
pad_token_segment_id=4
sequence_a_segment_id=0
cls_token_at_end=True
mask_padding_with_zero=True
pad_token=0
list_input_ids = []
list_input_mask = []
list_segment_ids = []
for (ex_index, example) in enumerate(examples):
tokens_a = self.tokenizer.tokenize(example)
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[:(max_seq_length - 2)]
tokens = tokens_a + [sep_token]
segment_ids = [sequence_a_segment_id] * len(tokens)
if cls_token_at_end:
tokens = tokens + [cls_token]
segment_ids = segment_ids + [cls_token_segment_id]
else:
tokens = [cls_token] + tokens
segment_ids = [cls_token_segment_id] + segment_ids
input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
# Zero-pad up to the sequence length.
padding_length = max_seq_length - len(input_ids)
if pad_on_left:
input_ids = ([pad_token] * padding_length) + input_ids
input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
else:
input_ids = input_ids + ([pad_token] * padding_length)
input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
list_input_ids.append(input_ids)
list_input_mask.append(input_mask)
list_segment_ids.append(segment_ids)
# features.append({"input_ids":input_ids,"input_mask":input_mask,"segment_ids":segment_ids,"label_id":label_id})
return (list_input_ids, list_input_mask, list_segment_ids)
def log_xlnet_params(local_dict):
"""wrapper that abstracts away logging of ipython notebook local training parameters described at definition
Args:
local_dict(dict): dict containing all local varaibles from notebook
"""
params = ["DATA_FOLDER","XLNET_CACHE_DIR","LANGUAGE","MAX_SEQ_LENGTH","BATCH_SIZE","NUM_GPUS",
"NUM_EPOCHS","TRAIN_SIZE","LABEL_COL","TEXT_COL","LEARNING_RATE","WEIGHT_DECAY",
"ADAM_EPSILON","WARMUP_STEPS","DEBUG"]
for i in params:
log_param(i,local_dict[i])
return

Просмотреть файл

@ -0,0 +1,371 @@
import numpy as np
from collections import namedtuple
import torch
import torch.nn as nn
from pytorch_transformers import (
XLNetConfig,
XLNetForSequenceClassification,
AdamW,
WarmupLinearSchedule
)
from tqdm import tqdm
from torch.utils.data import (
DataLoader,
RandomSampler,
TensorDataset,
)
from utils_nlp.common.pytorch_utils import get_device, move_to_device
from utils_nlp.models.xlnet.common import Language
import mlflow
import mlflow.pytorch
import os
class XLNetSequenceClassifier:
"""XLNet-based sequence classifier"""
def __init__(
self,
language=Language.ENGLISHCASED,
num_labels=5,
cache_dir=".",
num_gpus=None,
num_epochs=1,
batch_size=8,
lr=5e-5,
adam_eps=1e-8,
warmup_steps=0,
weight_decay=0.0,
max_grad_norm=1.0,
):
"""Initializes the classifier and the underlying pretrained model.
Args:
language (Language, optional): The pretrained model's language.
Defaults to 'xlnet-base-cased'.
num_labels (int, optional): The number of unique labels in the
training data. Defaults to 5.
cache_dir (str, optional): Location of XLNet's cache directory.
Defaults to ".".
num_gpus (int, optional): The number of gpus to use.
If None is specified, all available GPUs
will be used. Defaults to None.
num_epochs (int, optional): Number of training epochs.
Defaults to 1.
batch_size (int, optional): Training batch size. Defaults to 8.
lr (float): Learning rate of the Adam optimizer. Defaults to 5e-5.
adam_eps (float, optional): term added to the denominator to improve
numerical stability. Defaults to 1e-8.
warmup_steps (int, optional): Number of steps in which to increase
learning rate linearly from 0 to 1. Defaults to 0.
weight_decay (float, optional): Weight decay. Defaults to 0.
max_grad_norm (float, optional): Maximum norm for the gradients. Defaults to 1.0
"""
if num_labels < 2:
raise ValueError("Number of labels should be at least 2.")
self.language = language
self.num_labels = num_labels
self.cache_dir = cache_dir
self.num_gpus = num_gpus
self.num_epochs = num_epochs
self.batch_size = batch_size
self.lr = lr
self.adam_eps = adam_eps
self.warmup_steps = warmup_steps
self.weight_decay = weight_decay
self.max_grad_norm = max_grad_norm
# create classifier
self.config = XLNetConfig.from_pretrained(
self.language.value, num_labels=num_labels, cache_dir=cache_dir
)
self.model = XLNetForSequenceClassification(self.config)
def fit(
self,
token_ids,
input_mask,
labels,
val_token_ids,
val_input_mask,
val_labels,
token_type_ids=None,
val_token_type_ids=None,
verbose=True,
logging_steps=0,
save_steps=0,
val_steps=0,
):
"""Fine-tunes the XLNet classifier using the given training data.
Args:
token_ids (list): List of training token id lists.
input_mask (list): List of input mask lists.
labels (list): List of training labels.
token_type_ids (list, optional): List of lists. Each sublist
contains segment ids indicating if the token belongs to
the first sentence(0) or second sentence(1). Only needed
for two-sentence tasks.
verbose (bool, optional): If True, shows the training progress and
loss values. Defaults to True.
"""
device = get_device("cpu" if self.num_gpus == 0 or not torch.cuda.is_available() else "gpu")
self.model = move_to_device(self.model, device, self.num_gpus)
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
labels_tensor = torch.tensor(labels, dtype=torch.long)
val_token_ids_tensor = torch.tensor(val_token_ids, dtype=torch.long)
val_input_mask_tensor = torch.tensor(val_input_mask, dtype=torch.long)
val_labels_tensor = torch.tensor(val_labels, dtype=torch.long)
if token_type_ids:
token_type_ids_tensor = torch.tensor(token_type_ids, dtype=torch.long)
val_token_type_ids_tensor = torch.tensor(val_token_type_ids, dtype=torch.long)
train_dataset = TensorDataset(
token_ids_tensor, input_mask_tensor, token_type_ids_tensor, labels_tensor
)
val_dataset = TensorDataset(
val_token_ids_tensor,
val_input_mask_tensor,
val_token_type_ids_tensor,
val_labels_tensor,
)
else:
train_dataset = TensorDataset(token_ids_tensor, input_mask_tensor, labels_tensor)
val_dataset = TensorDataset(
val_token_ids_tensor, val_input_mask_tensor, val_labels_tensor
)
# define optimizer and model parameters
param_optimizer = list(self.model.named_parameters())
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
"weight_decay": self.weight_decay,
},
{
"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
val_sampler = RandomSampler(val_dataset)
val_dataloader = DataLoader(
val_dataset,
sampler=val_sampler,
batch_size=self.batch_size
)
num_examples = len(token_ids)
num_batches = int(np.ceil(num_examples/self.batch_size))
num_train_optimization_steps = num_batches * self.num_epochs
optimizer = AdamW(optimizer_grouped_parameters, lr=self.lr, eps=self.adam_eps)
scheduler = WarmupLinearSchedule(
optimizer, warmup_steps=self.warmup_steps, t_total=num_train_optimization_steps
)
global_step = 0
self.model.train()
optimizer.zero_grad()
for epoch in range(self.num_epochs):
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(
train_dataset, sampler=train_sampler, batch_size=self.batch_size
)
tr_loss = 0.0
logging_loss = 0.0
val_loss = 0.0
for i, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
if token_type_ids:
x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(
t.to(device) for t in batch
)
else:
token_type_ids_batch = None
x_batch, mask_batch, y_batch = tuple(t.to(device) for t in batch)
outputs = self.model(
input_ids=x_batch,
token_type_ids=token_type_ids_batch,
attention_mask=mask_batch,
labels=y_batch,
)
loss = outputs[0] # model outputs are always tuple in pytorch-transformers
loss.sum().backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
tr_loss += loss.sum().item()
optimizer.step()
# Update learning rate schedule
scheduler.step()
optimizer.zero_grad()
global_step += 1
# logging of learning rate and loss
if logging_steps > 0 and global_step % logging_steps == 0:
mlflow.log_metric("learning rate", scheduler.get_lr()[0], step=global_step)
mlflow.log_metric(
"training loss",
(tr_loss - logging_loss) / (logging_steps * self.batch_size),
step=global_step,
)
logging_loss = tr_loss
# model checkpointing
if save_steps > 0 and global_step % save_steps == 0:
checkpoint_dir = os.path.join(os.getcwd(), "checkpoints")
if not os.path.isdir(checkpoint_dir):
os.makedirs(checkpoint_dir)
checkpoint_path = checkpoint_dir + "/" + str(global_step) + ".pth"
torch.save(self.model.state_dict(), checkpoint_path)
mlflow.log_artifact(checkpoint_path)
# model validation
if val_steps > 0 and global_step % val_steps == 0:
# run model on validation set
self.model.eval()
val_loss = 0.0
for j, val_batch in enumerate(val_dataloader):
if token_type_ids:
val_x_batch, val_mask_batch, val_token_type_ids_batch, \
val_y_batch = tuple(
t.to(device) for t in val_batch
)
else:
token_type_ids_batch = None
val_x_batch, val_mask_batch, val_y_batch = tuple(
t.to(device) for t in val_batch
)
val_outputs = self.model(
input_ids=val_x_batch,
token_type_ids=val_token_type_ids_batch,
attention_mask=val_mask_batch,
labels=val_y_batch,
)
vloss = val_outputs[0]
val_loss += vloss.sum().item()
mlflow.log_metric(
"validation loss", val_loss / len(val_dataset), step=global_step
)
self.model.train()
if verbose:
if i % ((num_batches // 10) + 1) == 0:
if val_loss > 0:
print(
"epoch:{}/{}; batch:{}->{}/{}; average training loss:{:.6f};\
average val loss:{:.6f}".format(
epoch + 1,
self.num_epochs,
i + 1,
min(i + 1 + num_batches // 10, num_batches),
num_batches,
tr_loss / (i + 1),
val_loss / (j + 1),
),
)
else:
print(
"epoch:{}/{}; batch:{}->{}/{}; average train loss:{:.6f}".format(
epoch + 1,
self.num_epochs,
i + 1,
min(i + 1 + num_batches // 10, num_batches),
num_batches,
tr_loss / (i + 1),
)
)
checkpoint_dir = os.path.join(os.getcwd(), "checkpoints")
if not os.path.isdir(checkpoint_dir):
os.makedirs(checkpoint_dir)
checkpoint_path = checkpoint_dir + "/" + "final" + ".pth"
torch.save(self.model.state_dict(), checkpoint_path)
mlflow.log_artifact(checkpoint_path)
# empty cache
del [x_batch, y_batch, mask_batch, token_type_ids_batch]
if val_steps > 0:
del [val_x_batch, val_y_batch, val_mask_batch, val_token_type_ids_batch]
torch.cuda.empty_cache()
def predict(
self,
token_ids,
input_mask,
token_type_ids=None,
num_gpus=None,
batch_size=8,
probabilities=False,
):
"""Scores the given dataset and returns the predicted classes.
Args:
token_ids (list): List of training token lists.
input_mask (list): List of input mask lists.
token_type_ids (list, optional): List of lists. Each sublist
contains segment ids indicating if the token belongs to
the first sentence(0) or second sentence(1). Only needed
for two-sentence tasks.
num_gpus (int, optional): The number of gpus to use.
If None is specified, all available GPUs
will be used. Defaults to None.
batch_size (int, optional): Scoring batch size. Defaults to 8.
probabilities (bool, optional):
If True, the predicted probability distribution
is also returned. Defaults to False.
Returns:
1darray, namedtuple(1darray, ndarray): Predicted classes or
(classes, probabilities) if probabilities is True.
"""
device = get_device("cpu" if num_gpus == 0 or not torch.cuda.is_available() else "gpu")
self.model = move_to_device(self.model, device, num_gpus)
self.model.eval()
preds = []
with tqdm(total=len(token_ids)) as pbar:
for i in range(0, len(token_ids), batch_size):
start = i
end = start + batch_size
x_batch = torch.tensor(token_ids[start:end], dtype=torch.long, device=device)
mask_batch = torch.tensor(input_mask[start:end], dtype=torch.long, device=device)
token_type_ids_batch = torch.tensor(
token_type_ids[start:end], dtype=torch.long, device=device
)
with torch.no_grad():
pred_batch = self.model(
input_ids=x_batch,
token_type_ids=token_type_ids_batch,
attention_mask=mask_batch,
labels=None,
)
preds.append(pred_batch[0].cpu())
if i % batch_size == 0:
pbar.update(batch_size)
preds = np.concatenate(preds)
if probabilities:
return namedtuple("Predictions", "classes probabilities")(
preds.argmax(axis=1), nn.Softmax(dim=1)(torch.Tensor(preds)).numpy()
)
else:
return preds.argmax(axis=1)