Staging to master to add the latest fixes (#503)

* update mlflow version to match the other azureml versions

* Update generate_conda_file.py

* added temporary

* doc: update github url references

* docs: update nlp recipes references

* Minor bug fix for text classification of multi languages notebook

* remove bert and xlnet notebooks

* remove obsolete tests and links

* Add missing tmp directories.

* fix import error and max_nodes for the cluster

* Minor edits.

* Attempt to fix test device error.

* Temporarily pin transformers version

* Remove gpu tags temporarily

* Test whether device error also occurs for SequenceClassifier.

* Revert temporary changes.

* Revert temporary changes.
This commit is contained in:
Miguel González-Fierro 2019-11-30 00:53:57 +00:00 коммит произвёл Said Bleik
Родитель 967abcd816
Коммит ed04438ae1
22 изменённых файлов: 125 добавлений и 3199 удалений

Просмотреть файл

@ -85,6 +85,8 @@ The following is a list of related repositories that we like and think are usefu
|[AzureML-BERT](https://github.com/Microsoft/AzureML-BERT)|End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.|
|[MASS](https://github.com/microsoft/MASS)|MASS: Masked Sequence to Sequence Pre-training for Language Generation.|
|[MT-DNN](https://github.com/namisan/mt-dnn)|Multi-Task Deep Neural Networks for Natural Language Understanding.|
|[UniLM](https://github.com/microsoft/unilm)|Unified Language Model Pre-training.|
## Build Status

Просмотреть файл

@ -47,9 +47,9 @@ You can learn how to create a Notebook VM [here](https://docs.microsoft.com/en-u
We provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file
which you can use to create the target environment using the Python version 3.6 with all the correct dependencies.
Assuming the repo is cloned as `nlp` in the system, to install **a default (Python CPU) environment**:
Assuming the repo is cloned as `nlp-recipes` in the system, to install **a default (Python CPU) environment**:
cd nlp
cd nlp-recipes
python tools/generate_conda_file.py
conda env create -f nlp_cpu.yaml
@ -62,7 +62,7 @@ Click on the following menus to see how to install the Python GPU environment:
Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:
cd nlp
cd nlp-recipes
python tools/generate_conda_file.py --gpu
conda env create -n nlp_gpu -f nlp_gpu.yaml
@ -79,7 +79,7 @@ Assuming that you have an Azure GPU DSVM machine, here are the steps to setup th
2. Install the GPU environment.
cd nlp
cd nlp-recipes
python tools/generate_conda_file.py --gpu
conda env create -n nlp_gpu -f nlp_gpu.yaml
@ -110,7 +110,7 @@ Running the command tells pip to install the `utils_nlp` package from source in
> It is also possible to install directly from Github, which is the best way to utilize the `utils_nlp` package in external projects (while still reflecting updates to the source as it's installed as an editable `'-e'` package).
> `pip install -e git+git@github.com:microsoft/nlp.git@master#egg=utils_nlp`
> `pip install -e git+git@github.com:microsoft/nlp-recipes.git@master#egg=utils_nlp`
Either command, from above, makes `utils_nlp` available in your conda virtual environment. You can verify it was properly installed by running:

Просмотреть файл

@ -34,7 +34,7 @@ version = ".".join(VERSION.split(".")[:2])
# The full version, including alpha/beta/rc tags
release = VERSION
prefix = "NLP"
prefix = "NLPRecipes"
# -- General configuration ---------------------------------------------------

Просмотреть файл

@ -2,9 +2,9 @@
NLP Utilities
===================================================
The `NLP repository <https://github.com/Microsoft/NLP>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks.
The `NLP repository <https://github.com/microsoft/nlp-recipes>`_ provides examples and best practices for building NLP systems, provided as Jupyter notebooks.
The module `utils_nlp <https://github.com/microsoft/nlp/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and
The module `utils_nlp <https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp>`_ contains functions to simplify common tasks used when developing and
evaluating NLP systems.
.. toctree::

Просмотреть файл

@ -45,7 +45,7 @@
"from azureml.core.runconfig import MpiConfiguration\n",
"from azureml.core import Experiment\n",
"from azureml.widgets import RunDetails\n",
"from azureml.core.compute import ComputeTarget\n",
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.exceptions import ComputeTargetException\n",
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
]
@ -169,7 +169,7 @@
"except ComputeTargetException:\n",
" print(\"Creating new compute target: {}\".format(cluster_name))\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_NC6\", max_nodes=1\n",
" vm_size=\"STANDARD_NC6\", max_nodes=NODE_COUNT\n",
" )\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" compute_target.wait_for_completion(show_output=True)\n",
@ -524,9 +524,9 @@
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python (nlp_gpu_transformer_bug_bash)",
"language": "python",
"name": "python3"
"name": "nlp_gpu_transformer_bug_bash"
},
"language_info": {
"codemirror_mode": {

Просмотреть файл

@ -175,7 +175,7 @@
"metadata": {},
"source": [
"This step downloads the pre-trained [AllenNLP](https://allennlp.org/models) pretrained model and registers the model in our Workspace. The pre-trained AllenNLP model we use is called Bidirectional Attention Flow for Machine Comprehension ([BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02\n",
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp/examples/question_answering/bidaf_deep_dive.ipynb\n",
")) It achieved state-of-the-art performance on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset in 2017 and is a well-respected, performant baseline for QA. AllenNLP's pre-trained BIDAF model is trained on the SQuAD training set and achieves an EM score of 68.3 on the SQuAD development set. See the [BIDAF deep dive notebook](https://github.com/microsoft/nlp-recipes/examples/question_answering/bidaf_deep_dive.ipynb\n",
") for more information on this algorithm and AllenNLP implementation."
]
},

Просмотреть файл

@ -19,8 +19,5 @@ The following summarizes each notebook for Text Classification. Each notebook pr
|Notebook|Environment|Description|Dataset|
|---|---|---|---|
|[BERT for text classification on AzureML](tc_bert_azureml.ipynb) |Azure ML|A notebook which walks through fine-tuning and evaluating pre-trained BERT model on a distributed setup with AzureML. |[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[XLNet for text classification with MNLI](tc_mnli_xlnet.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained XLNet model on a subset of the MultiNLI dataset|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[BERT for text classification of Hindi BBC News](tc_bbc_bert_hi.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Hindi BBC news data|[BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1)|
|[BERT for text classification of Arabic News](tc_dac_bert_ar.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Arabic news articles|[DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)|
|[Text Classification of MultiNLI Sentences using Multiple Transformer Models](tc_mnli_transformers.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a number of pre-trained transformer models|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[Text Classification of Multi Language Datasets using Transformer Model](tc_multi_languages_transformers.ipynb)|Local|A notebook which walks through fine-tuning and evaluating a pre-trained transformer model for multiple datasets in different language|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) <br> [BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1) <br> [DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,821 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
"\n",
"*Licensed under the MIT License.*\n",
"\n",
"# Classification of Arabic News Articles using BERT"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import sys\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scrapbook as sb\n",
"import torch\n",
"import torch.nn as nn\n",
"from sklearn.metrics import accuracy_score, classification_report\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"sys.path.append(\"../../\")\n",
"from utils_nlp.common.timer import Timer\n",
"from utils_nlp.dataset.dac import load_pandas_df\n",
"from utils_nlp.models.bert.common import Language, Tokenizer\n",
"from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on an Arabic dataset of news articles. The [dataset](https://data.mendeley.com/datasets/v524p5dhpj/2) includes articles from 3 different newspapers, and the articles are categorized into 5 classes: *sports, politics, culture, economy and diverse*. The data is described in more detail in this [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf).\n",
"\n",
"We use a [sequence classifier](../../utils_nlp/bert/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert). The classifier loads a pretrained [multilingual BERT model](https://github.com/google-research/bert/blob/master/multilingual.md) that was trained on 104 languages, including Arabic."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"DATA_FOLDER = \"./temp\"\n",
"BERT_CACHE_DIR = \"./temp\"\n",
"LANGUAGE = Language.MULTILINGUAL\n",
"MAX_LEN = 200\n",
"BATCH_SIZE = 32\n",
"NUM_GPUS = 2\n",
"NUM_EPOCHS = 1\n",
"TRAIN_SIZE = 0.8\n",
"NUM_ROWS = 15000\n",
"RANDOM_STATE = 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Dataset\n",
"We start by loading the data. The following line also downloads the file if it doesn't exist, and extracts the csv file into the specified data folder. We retain a subset, of size *NUM_ROWS*, of the data for quicker model training."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = load_pandas_df(DATA_FOLDER).sample(NUM_ROWS, random_state=RANDOM_STATE)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>targe</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>80414</th>\n",
" <td>فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6649</th>\n",
" <td>أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3722</th>\n",
" <td>أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82317</th>\n",
" <td>الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه...</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5219</th>\n",
" <td>المطرب المصري يخوض حملة إعلامية لترويج ألبومه ...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text targe\n",
"80414 فاز فريق الدفاع الحسني الجديدي على مضيفه الكوك... 4\n",
"6649 أمام آلاف مشاهد من لبنان ومصر والمغرب والإمارا... 0\n",
"3722 أخبارنا المغربية بعد أن أصدرت المحكمة الإبتداي... 0\n",
"82317 الفريق طبق قانونا قبل المصادقة عليه وجدل حول ه... 4\n",
"5219 المطرب المصري يخوض حملة إعلامية لترويج ألبومه ... 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# set the text and label columns\n",
"text_col = df.columns[0]\n",
"label_col = df.columns[1]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# remove empty documents\n",
"df = df[df[text_col].isna() == False]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect the distribution of labels:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4 5844\n",
"3 2796\n",
"1 2139\n",
"0 1917\n",
"2 1900\n",
"Name: targe, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[label_col].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We compare the counts with those presented in the author's [paper](http://article.nadiapub.com/IJGDC/vol11_no9/9.pdf), and infer the following label mapping:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>culture</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>diverse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>economy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>politics</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>sports</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label\n",
"0 culture\n",
"1 diverse\n",
"2 economy\n",
"3 politics\n",
"4 sports"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# ordered list of labels\n",
"labels = [\"culture\", \"diverse\", \"economy\", \"politics\", \"sports\"]\n",
"num_labels = len(labels)\n",
"pd.DataFrame({\"label\": labels})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we split the data for training and testing:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of training examples: 11676\n",
"Number of testing examples: 2920\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/bleik2/miniconda3/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
" FutureWarning)\n"
]
}
],
"source": [
"df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=RANDOM_STATE)\n",
"print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
"print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenize and Preprocess"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before training, we tokenize the text documents and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 11676/11676 [00:59<00:00, 196.42it/s]\n",
"100%|██████████| 2920/2920 [00:14<00:00, 197.99it/s]\n"
]
}
],
"source": [
"tokenizer = Tokenizer(LANGUAGE, cache_dir=BERT_CACHE_DIR)\n",
"tokens_train = tokenizer.tokenize(list(df_train[text_col].astype(str)))\n",
"tokens_test = tokenizer.tokenize(list(df_test[text_col].astype(str)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, we perform the following preprocessing steps in the cell below:\n",
"- Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
"- Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence\n",
"- Pad or truncate the token lists to the specified max length\n",
"- Return mask lists that indicate paddings' positions\n",
"\n",
"*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"tokens_train, mask_train, _ = tokenizer.preprocess_classification_tokens(\n",
" tokens_train, MAX_LEN\n",
")\n",
"tokens_test, mask_test, _ = tokenizer.preprocess_classification_tokens(\n",
" tokens_test, MAX_LEN\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Model\n",
"Next, we create a sequence classifier that loads a pre-trained BERT model, given the language and number of labels."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"classifier = BERTSequenceClassifier(\n",
" language=LANGUAGE, num_labels=num_labels, cache_dir=BERT_CACHE_DIR\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train\n",
"We train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"t_total value of -1 results in schedule not being applied\n",
"Iteration: 0%| | 1/365 [00:03<21:12, 3.49s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:1->37/365; average training loss:1.591262\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 10%|█ | 38/365 [01:02<08:45, 1.61s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:38->74/365; average training loss:0.745935\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 21%|██ | 75/365 [02:02<07:52, 1.63s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:75->111/365; average training loss:0.593934\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 31%|███ | 112/365 [03:03<06:56, 1.65s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:112->148/365; average training loss:0.530150\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 41%|████ | 149/365 [04:03<05:54, 1.64s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:149->185/365; average training loss:0.481620\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 51%|█████ | 186/365 [05:05<05:02, 1.69s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:186->222/365; average training loss:0.455032\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 61%|██████ | 223/365 [06:06<03:59, 1.69s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:223->259/365; average training loss:0.421702\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 71%|███████ | 260/365 [07:08<02:56, 1.68s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:260->296/365; average training loss:0.401165\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 81%|████████▏ | 297/365 [08:09<01:52, 1.65s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:297->333/365; average training loss:0.382719\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 92%|█████████▏| 334/365 [09:12<00:52, 1.71s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/1; batch:334->365/365; average training loss:0.372204\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 365/365 [10:04<00:00, 1.63s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Training time: 0.169 hrs]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"with Timer() as t:\n",
" classifier.fit(\n",
" token_ids=tokens_train,\n",
" input_mask=mask_train,\n",
" labels=list(df_train[label_col]), \n",
" num_gpus=NUM_GPUS, \n",
" num_epochs=NUM_EPOCHS,\n",
" batch_size=BATCH_SIZE, \n",
" verbose=True,\n",
" ) \n",
"print(\"[Training time: {:.3f} hrs]\".format(t.interval / 3600))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Score\n",
"We score the test set using the trained classifier:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 92/92 [00:48<00:00, 2.25it/s]\n"
]
}
],
"source": [
"preds = classifier.predict(\n",
" token_ids=tokens_test, input_mask=mask_test, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate Results\n",
"Finally, we compute the accuracy, precision, recall, and F1 metrics of the evaluation on the test set."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.9277397260273973\n",
"{\n",
" \"culture\": {\n",
" \"f1-score\": 0.9081761006289307,\n",
" \"precision\": 0.8848039215686274,\n",
" \"recall\": 0.9328165374677002,\n",
" \"support\": 387\n",
" },\n",
" \"diverse\": {\n",
" \"f1-score\": 0.9237983587338804,\n",
" \"precision\": 0.9471153846153846,\n",
" \"recall\": 0.9016018306636155,\n",
" \"support\": 437\n",
" },\n",
" \"economy\": {\n",
" \"f1-score\": 0.8547418967587034,\n",
" \"precision\": 0.8221709006928406,\n",
" \"recall\": 0.89,\n",
" \"support\": 400\n",
" },\n",
" \"macro avg\": {\n",
" \"f1-score\": 0.9099850933798536,\n",
" \"precision\": 0.9087524907040864,\n",
" \"recall\": 0.9125256551533433,\n",
" \"support\": 2920\n",
" },\n",
" \"micro avg\": {\n",
" \"f1-score\": 0.9277397260273973,\n",
" \"precision\": 0.9277397260273973,\n",
" \"recall\": 0.9277397260273973,\n",
" \"support\": 2920\n",
" },\n",
" \"politics\": {\n",
" \"f1-score\": 0.8734177215189873,\n",
" \"precision\": 0.8994413407821229,\n",
" \"recall\": 0.8488576449912126,\n",
" \"support\": 569\n",
" },\n",
" \"sports\": {\n",
" \"f1-score\": 0.9897913892587662,\n",
" \"precision\": 0.9902309058614565,\n",
" \"recall\": 0.9893522626441881,\n",
" \"support\": 1127\n",
" },\n",
" \"weighted avg\": {\n",
" \"f1-score\": 0.9279213601549715,\n",
" \"precision\": 0.9290922105520572,\n",
" \"recall\": 0.9277397260273973,\n",
" \"support\": 2920\n",
" }\n",
"}\n"
]
}
],
"source": [
"report = classification_report(df_test[label_col], preds, target_names=labels, output_dict=True) \n",
"accuracy = accuracy_score(df_test[label_col], preds )\n",
"print(\"accuracy: {}\".format(accuracy))\n",
"print(json.dumps(report, indent=4, sort_keys=True))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9277397260273973,
"encoder": "json",
"name": "accuracy",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "accuracy"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9087524907040864,
"encoder": "json",
"name": "precision",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "precision"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9125256551533433,
"encoder": "json",
"name": "recall",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "recall"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.9099850933798536,
"encoder": "json",
"name": "f1",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "f1"
}
},
"output_type": "display_data"
}
],
"source": [
"# for testing\n",
"sb.glue(\"accuracy\", accuracy)\n",
"sb.glue(\"precision\", report[\"macro avg\"][\"precision\"])\n",
"sb.glue(\"recall\", report[\"macro avg\"][\"recall\"])\n",
"sb.glue(\"f1\", report[\"macro avg\"][\"f1-score\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "nlp_gpu",
"language": "python",
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -440,7 +440,7 @@
" test_labels, \n",
" preds,\n",
" digits=2,\n",
" labels=test_labels.unique(),\n",
" labels=np.unique(test_labels),\n",
" target_names=label_encoder.classes_\n",
")\n",
"\n",

Просмотреть файл

@ -29,7 +29,7 @@ setup(
),
author=AUTHOR,
author_email="teamsharat@microsoft.com",
url="https://github.com/microsoft/nlp",
url="https://github.com/microsoft/nlp-recipes",
packages=["utils_nlp"],
include_package_data=True,
zip_safe=True,
@ -56,10 +56,16 @@ setup(
"Intended Audience :: Telecommunications Industry",
],
project_urls={
"Documentation": "https://github.com/microsoft/nlp/",
"Issue Tracker": "https://github.com/microsoft/nlp/issues",
"Documentation": "https://github.com/microsoft/nlp-recipes/",
"Issue Tracker": "https://github.com/microsoft/nlp-recipes/issues",
},
keywords=["Microsoft NLP", "Natural Language Processing", "Text Processing", "Word Embedding"],
keywords=[
"Microsoft NLP",
"NLP Recipes",
"Natural Language Processing",
"Text Processing",
"Word Embedding",
],
python_requires=">=3.6",
install_requires=[],
dependency_links=[],

Просмотреть файл

@ -76,12 +76,6 @@ def notebooks():
"tc_mnli_transformers": os.path.join(
folder_notebooks, "text_classification", "tc_mnli_transformers.ipynb"
),
"tc_dac_bert_ar": os.path.join(
folder_notebooks, "text_classification", "tc_dac_bert_ar.ipynb"
),
"tc_bbc_bert_hi": os.path.join(
folder_notebooks, "text_classification", "tc_bbc_bert_hi.ipynb"
),
"tc_multi_languages_transformers": os.path.join(
folder_notebooks, "text_classification", "tc_multi_languages_transformers.ipynb"
),
@ -104,6 +98,15 @@ def tmp(tmp_path_factory):
td.cleanup()
@pytest.fixture(scope="module")
def tmp_module(tmp_path_factory):
td = TemporaryDirectory(dir=tmp_path_factory.getbasetemp())
try:
yield td.name
finally:
td.cleanup()
@pytest.fixture(scope="module")
def ner_test_data():
UNIQUE_LABELS = ["O", "I-LOC", "I-MISC", "I-PER", "I-ORG", "X"]

Просмотреть файл

@ -37,50 +37,6 @@ def test_tc_mnli_transformers(notebooks, tmp):
assert pytest.approx(result["f1"], 0.89, abs=ABS_TOL)
@pytest.mark.gpu
@pytest.mark.integration
def test_tc_dac_bert_ar(notebooks, tmp):
notebook_path = notebooks["tc_dac_bert_ar"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
NUM_GPUS=1,
DATA_FOLDER=tmp,
BERT_CACHE_DIR=tmp,
MAX_LEN=175,
BATCH_SIZE=16,
NUM_EPOCHS=1,
TRAIN_SIZE=0.8,
NUM_ROWS=8000,
RANDOM_STATE=0,
),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["accuracy"], 0.871, abs=ABS_TOL)
assert pytest.approx(result["precision"], 0.865, abs=ABS_TOL)
assert pytest.approx(result["recall"], 0.852, abs=ABS_TOL)
assert pytest.approx(result["f1"], 0.845, abs=ABS_TOL)
@pytest.mark.gpu
@pytest.mark.integration
def test_tc_bbc_bert_hi(notebooks, tmp):
notebook_path = notebooks["tc_bbc_bert_hi"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(NUM_GPUS=1, DATA_FOLDER=tmp, BERT_CACHE_DIR=tmp, NUM_EPOCHS=1),
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["accuracy"], 0.71, abs=ABS_TOL)
assert pytest.approx(result["precision"], 0.25, abs=ABS_TOL)
assert pytest.approx(result["recall"], 0.28, abs=ABS_TOL)
assert pytest.approx(result["f1"], 0.26, abs=ABS_TOL)
@pytest.mark.integration
@pytest.mark.azureml
@pytest.mark.gpu
@ -118,6 +74,7 @@ def test_tc_bert_azureml(
if os.path.exists("outputs"):
shutil.rmtree("outputs")
@pytest.mark.gpu
@pytest.mark.integration
def test_multi_languages_transformer(notebooks, tmp):
@ -126,10 +83,7 @@ def test_multi_languages_transformer(notebooks, tmp):
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters={
"QUICK_RUN": True,
"USE_DATASET": "dac"
},
parameters={"QUICK_RUN": True, "USE_DATASET": "dac"},
)
result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
assert pytest.approx(result["precision"], 0.94, abs=ABS_TOL)

Просмотреть файл

@ -18,8 +18,7 @@ def data():
]
@pytest.mark.cpu
def test_sentence_encoding(data):
def test_sentence_encoding(tmp, data):
se = BERTSentenceEncoder(
language=Language.ENGLISH,
num_gpus=0,
@ -27,6 +26,7 @@ def test_sentence_encoding(data):
max_len=128,
layer_index=-2,
pooling_strategy=PoolingStrategy.MEAN,
cache_dir=tmp,
)
result = se.encode(data, as_numpy=False)

Просмотреть файл

@ -17,8 +17,8 @@ NUM_GPUS = max(1, torch.cuda.device_count())
BATCH_SIZE = 8
@pytest.fixture()
def qa_test_data(qa_test_df, tmp):
@pytest.fixture(scope="module")
def qa_test_data(qa_test_df, tmp_module):
train_dataset = QADataset(
df=qa_test_df["test_df"],
@ -63,7 +63,7 @@ def qa_test_data(qa_test_df, tmp):
qa_id_col=qa_test_df["qa_id_col"],
)
qa_processor_bert = QAProcessor()
qa_processor_bert = QAProcessor(cache_dir=tmp_module)
train_features_bert = qa_processor_bert.preprocess(
train_dataset,
batch_size=BATCH_SIZE,
@ -72,7 +72,7 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
test_features_bert = qa_processor_bert.preprocess(
@ -83,10 +83,10 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
qa_processor_xlnet = QAProcessor(model_name="xlnet-base-cased")
qa_processor_xlnet = QAProcessor(model_name="xlnet-base-cased", cache_dir=tmp_module)
train_features_xlnet = qa_processor_xlnet.preprocess(
train_dataset,
batch_size=BATCH_SIZE,
@ -95,7 +95,7 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
test_features_xlnet = qa_processor_xlnet.preprocess(
@ -106,10 +106,12 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
qa_processor_distilbert = QAProcessor(model_name="distilbert-base-uncased")
qa_processor_distilbert = QAProcessor(
model_name="distilbert-base-uncased", cache_dir=tmp_module
)
train_features_distilbert = qa_processor_distilbert.preprocess(
train_dataset,
batch_size=BATCH_SIZE,
@ -118,7 +120,7 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
test_features_distilbert = qa_processor_distilbert.preprocess(
@ -129,7 +131,7 @@ def qa_test_data(qa_test_df, tmp):
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
return {
@ -147,103 +149,136 @@ def qa_test_data(qa_test_df, tmp):
}
def test_QAProcessor(qa_test_data, tmp):
@pytest.mark.gpu
def test_QAProcessor(qa_test_data, tmp_module):
for model_name in ["bert-base-cased", "xlnet-base-cased", "distilbert-base-uncased"]:
qa_processor = QAProcessor(model_name=model_name)
qa_processor.preprocess(qa_test_data["train_dataset"], is_training=True)
qa_processor.preprocess(qa_test_data["train_dataset_list"], is_training=True)
qa_processor.preprocess(qa_test_data["test_dataset"], is_training=False)
qa_processor = QAProcessor(model_name=model_name, cache_dir=tmp_module)
qa_processor.preprocess(
qa_test_data["train_dataset"], is_training=True, feature_cache_dir=tmp_module
)
qa_processor.preprocess(
qa_test_data["train_dataset_list"], is_training=True, feature_cache_dir=tmp_module
)
qa_processor.preprocess(
qa_test_data["test_dataset"], is_training=False, feature_cache_dir=tmp_module
)
# test unsupported model type
with pytest.raises(ValueError):
qa_processor = QAProcessor(model_name="abc")
qa_processor = QAProcessor(model_name="abc", cache_dir=tmp_module)
# test training data has no ground truth exception
with pytest.raises(Exception):
qa_processor.preprocess(qa_test_data["test_dataset"], is_training=True)
qa_processor.preprocess(
qa_test_data["test_dataset"], is_training=True, feature_cache_dir=tmp_module
)
# test when answer start is a list, but answer text is not
with pytest.raises(Exception):
qa_processor.preprocess(qa_test_data["train_dataset_start_text_mismatch"], is_training=True)
qa_processor.preprocess(
qa_test_data["train_dataset_start_text_mismatch"],
is_training=True,
feature_cache_dir=tmp_module,
)
# test when training data has multiple answers
with pytest.raises(Exception):
qa_processor.preprocess(qa_test_data["train_dataset_multi_answers"], is_training=True)
qa_processor.preprocess(
qa_test_data["train_dataset_multi_answers"],
is_training=True,
feature_cache_dir=tmp_module,
)
def test_AnswerExtractor(qa_test_data, tmp):
def test_AnswerExtractor(qa_test_data, tmp_module):
# test bert
qa_extractor_bert = AnswerExtractor(cache_dir=tmp)
qa_extractor_bert = AnswerExtractor(cache_dir=tmp_module)
qa_extractor_bert.fit(qa_test_data["train_features_bert"], cache_model=True)
# test saving fine-tuned model
model_output_dir = os.path.join(tmp, "fine_tuned")
model_output_dir = os.path.join(tmp_module, "fine_tuned")
assert os.path.exists(os.path.join(model_output_dir, "pytorch_model.bin"))
assert os.path.exists(os.path.join(model_output_dir, "config.json"))
qa_extractor_from_cache = AnswerExtractor(cache_dir=tmp, load_model_from_dir=model_output_dir)
qa_extractor_from_cache = AnswerExtractor(
cache_dir=tmp_module, load_model_from_dir=model_output_dir
)
qa_extractor_from_cache.predict(qa_test_data["test_features_bert"])
qa_extractor_xlnet = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp)
qa_extractor_xlnet = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
qa_extractor_xlnet.fit(qa_test_data["train_features_xlnet"], cache_model=False)
qa_extractor_xlnet.predict(qa_test_data["test_features_xlnet"])
qa_extractor_distilbert = AnswerExtractor(model_name="distilbert-base-uncased", cache_dir=tmp)
qa_extractor_distilbert = AnswerExtractor(
model_name="distilbert-base-uncased", cache_dir=tmp_module
)
qa_extractor_distilbert.fit(qa_test_data["train_features_distilbert"], cache_model=False)
qa_extractor_distilbert.predict(qa_test_data["test_features_distilbert"])
def test_postprocess_bert_answer(qa_test_data, tmp):
qa_processor = QAProcessor()
def test_postprocess_bert_answer(qa_test_data, tmp_module):
qa_processor = QAProcessor(cache_dir=tmp_module)
test_features = qa_processor.preprocess(
qa_test_data["test_dataset"],
is_training=False,
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
qa_extractor = AnswerExtractor(cache_dir=tmp)
qa_extractor = AnswerExtractor(cache_dir=tmp_module)
predictions = qa_extractor.predict(test_features)
qa_processor.postprocess(
results=predictions,
examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
)
qa_processor.postprocess(
results=predictions,
examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
unanswerable_exists=True,
verbose_logging=True,
output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
)
def test_postprocess_xlnet_answer(qa_test_data, tmp):
qa_processor = QAProcessor(model_name="xlnet-base-cased")
def test_postprocess_xlnet_answer(qa_test_data, tmp_module):
qa_processor = QAProcessor(model_name="xlnet-base-cased", cache_dir=tmp_module)
test_features = qa_processor.preprocess(
qa_test_data["test_dataset"],
is_training=False,
max_question_length=16,
max_seq_length=64,
doc_stride=32,
feature_cache_dir=tmp,
feature_cache_dir=tmp_module,
)
qa_extractor = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp)
qa_extractor = AnswerExtractor(model_name="xlnet-base-cased", cache_dir=tmp_module)
predictions = qa_extractor.predict(test_features)
qa_processor.postprocess(
results=predictions,
examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
)
qa_processor.postprocess(
results=predictions,
examples_file=os.path.join(tmp, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp, CACHED_FEATURES_TEST_FILE),
examples_file=os.path.join(tmp_module, CACHED_EXAMPLES_TEST_FILE),
features_file=os.path.join(tmp_module, CACHED_FEATURES_TEST_FILE),
unanswerable_exists=True,
verbose_logging=True,
output_prediction_file=os.path.join(tmp_module, "qa_predictions.json"),
output_nbest_file=os.path.join(tmp_module, "nbest_predictions.json"),
output_null_log_odds_file=os.path.join(tmp_module, "null_odds.json"),
)

Просмотреть файл

@ -9,17 +9,13 @@ from utils_nlp.models.bert.common import Language
@pytest.mark.notebooks
def test_bert_encoder(notebooks):
def test_bert_encoder(notebooks, tmp):
notebook_path = notebooks["bert_encoder"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
NUM_GPUS=0,
LANGUAGE=Language.ENGLISH,
TO_LOWER=True,
MAX_SEQ_LENGTH=128,
CACHE_DIR="./temp",
NUM_GPUS=0, LANGUAGE=Language.ENGLISH, TO_LOWER=True, MAX_SEQ_LENGTH=128, CACHE_DIR=tmp
),
)

Просмотреть файл

@ -10,17 +10,13 @@ from utils_nlp.models.bert.common import Language
@pytest.mark.notebooks
@pytest.mark.gpu
def test_bert_encoder(notebooks):
def test_bert_encoder(notebooks, tmp):
notebook_path = notebooks["bert_encoder"]
pm.execute_notebook(
notebook_path,
OUTPUT_NOTEBOOK,
kernel_name=KERNEL_NAME,
parameters=dict(
NUM_GPUS=1,
LANGUAGE=Language.ENGLISH,
TO_LOWER=True,
MAX_SEQ_LENGTH=128,
CACHE_DIR="./temp",
NUM_GPUS=1, LANGUAGE=Language.ENGLISH, TO_LOWER=True, MAX_SEQ_LENGTH=128, CACHE_DIR=tmp
),
)

Просмотреть файл

@ -1,27 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
def test_preprocess_classification_tokens(xlnet_english_tokenizer):
text = ["Hello World.",
"How you doing?",
"greatttt",
"The quick, brown fox jumps over a lazy dog.",
" DJs flock by when MTV ax quiz prog",
"Quick wafting zephyrs vex bold Jim",
"Quick, Baz, get my woven flax jodhpurs!"
]
seq_length = 5
input_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(text, seq_length)
assert len(input_ids) == len(text)
assert len(input_mask) == len(text)
assert len(segment_ids) == len(text)
for sentence in range(len(text)):
assert len(input_ids[sentence]) == seq_length
assert len(input_mask[sentence]) == seq_length
assert len(segment_ids[sentence]) == seq_length

Просмотреть файл

@ -1,44 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import pytest
from utils_nlp.models.xlnet.common import Language
from utils_nlp.models.xlnet.sequence_classification import XLNetSequenceClassifier
@pytest.fixture()
def data():
return (
["hi", "hello", "what's wrong with us", "can I leave?"],
[0, 0, 1, 2],
["hey", "i will", "be working from", "home today"],
[2, 1, 1, 0],
)
def test_classifier(xlnet_english_tokenizer, data):
token_ids, input_mask, segment_ids = xlnet_english_tokenizer.preprocess_classification_tokens(
data[0], max_seq_length=10
)
val_data = xlnet_english_tokenizer.preprocess_classification_tokens(data[2], max_seq_length=10)
val_token_ids, val_input_mask, val_segment_ids = val_data
classifier = XLNetSequenceClassifier(language=Language.ENGLISHCASED, num_labels=3)
classifier.fit(
token_ids=token_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
labels=data[1],
val_token_ids=val_token_ids,
val_input_mask=val_input_mask,
val_labels=data[3],
val_token_type_ids=val_segment_ids,
)
preds = classifier.predict(
token_ids=token_ids, input_mask=input_mask, token_type_ids=segment_ids
)
assert len(preds) == len(data[1])

Просмотреть файл

@ -29,6 +29,7 @@ $ python -m ipykernel install --user --name {conda_env} \
--display-name "Python ({conda_env})"
"""
CHANNELS = ["defaults", "conda-forge", "pytorch"]
CONDA_BASE = {
@ -63,14 +64,14 @@ PIP_BASE = {
"azureml-train-automl": "azureml-train-automl==1.0.57",
"azureml-dataprep": "azureml-dataprep==1.1.8",
"azureml-widgets": "azureml-widgets==1.0.57",
"azureml-mlflow": "azureml-mlflow>=1.0.43.1",
"azureml-mlflow": "azureml-mlflow==1.0.57",
"black": "black>=18.6b4",
"cached-property": "cached-property==1.5.1",
"jsonlines": "jsonlines>=1.2.0",
"nteract-scrapbook": "nteract-scrapbook>=0.2.1",
"pydocumentdb": "pydocumentdb>=2.3.3",
"pytorch-pretrained-bert": "pytorch-pretrained-bert>=0.6",
"tqdm": "tqdm==4.31.1",
"tqdm": "tqdm==4.32.2",
"pyemd": "pyemd==0.5.1",
"ipywebrtc": "ipywebrtc==0.4.3",
"pre-commit": "pre-commit>=1.14.4",
@ -82,7 +83,7 @@ PIP_BASE = {
"https://github.com/explosion/spacy-models/releases/download/"
"en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz"
),
"transformers": "transformers>=2.0.0",
"transformers": "transformers==2.1.1",
"gensim": "gensim>=3.7.0",
"nltk": "nltk>=3.4",
"seqeval": "seqeval>=0.0.12",

Просмотреть файл

@ -26,7 +26,7 @@ ws = get_or_create_workspace(
This submodule contains high-level utilities that are commonly used in multiple algorithms as well as helper functions for managing frameworks like pytorch.
### [Dataset](dataset)
This submodule includes helper functions for interacting with well-known datasets, utility functions to process datasets for different NLP tasks, as well as utilities for splitting data for training/testing. For example, the [snli module](snli.py) will allow you to load a dataframe in pandas from the Stanford Natural Language Inference (SNLI) Corpus dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Information on the datasets used in the repo can be found [here](https://github.com/microsoft/nlp/tree/staging/utils_nlp/dataset#datasets).
This submodule includes helper functions for interacting with well-known datasets, utility functions to process datasets for different NLP tasks, as well as utilities for splitting data for training/testing. For example, the [snli module](snli.py) will allow you to load a dataframe in pandas from the Stanford Natural Language Inference (SNLI) Corpus dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Information on the datasets used in the repo can be found [here](https://github.com/microsoft/nlp-recipes/tree/staging/utils_nlp/dataset#datasets).
Most datasets may be split into `train`, `dev`, and `test`.