Merge branch 'staging' of https://github.com/microsoft/nlp into kehuan-transformers

This commit is contained in:
Ke Huang 2019-11-15 16:34:12 -05:00
Родитель 6e19905a13 36638c15e8
Коммит dc2ee5b963
5 изменённых файлов: 0 добавлений и 1586 удалений

Просмотреть файл

@ -1,909 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Copyright (c) Microsoft Corporation. All rights reserved.* \n",
"\n",
"*Licensed under the MIT License.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural Language Inference on MultiNLI Dataset using BERT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Before You Start\n",
"\n",
"The running time shown in this notebook is on a Standard_NC24s_v3 Azure Deep Learning Virtual Machine with 4 NVIDIA Tesla V100 GPUs. \n",
"> **Tip:** If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. \n",
"\n",
"The table below provides some reference running time on different machine configurations. \n",
"\n",
"|QUICK_RUN|Machine Configurations|Running time|\n",
"|:---------|:----------------------|:------------|\n",
"|True|4 **CPU**s, 14GB memory| ~ 15 minutes|\n",
"|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 5 minutes|\n",
"|False|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 10.5 hours|\n",
"|False|4 NVIDIA Tesla V100 GPUs, 64GB GPU memory| ~ 2.5 hours|\n",
"\n",
"If you run into CUDA out-of-memory error, try reducing the `BATCH_SIZE` and `MAX_SEQ_LENGTH`, but note that model performance will be compromised. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.\n",
"QUICK_RUN = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"In this notebook, we demostrate using [BERT](https://arxiv.org/abs/1810.04805) to perform Natural Language Inference (NLI). We use the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral. \n",
"The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.\n",
"<img src=\"https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG\">"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"import scrapbook as sb\n",
"import random\n",
"import numpy as np\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"import torch\n",
"\n",
"nlp_path = os.path.abspath('../../')\n",
"if nlp_path not in sys.path:\n",
" sys.path.insert(0, nlp_path)\n",
"\n",
"from utils_nlp.models.bert.sequence_classification import BERTSequenceClassifier\n",
"from utils_nlp.models.bert.common import Language, Tokenizer\n",
"from utils_nlp.dataset.multinli import load_pandas_df\n",
"from utils_nlp.common.timer import Timer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configurations"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"TRAIN_DATA_USED_PERCENT = 1\n",
"DEV_DATA_USED_PERCENT = 1\n",
"NUM_EPOCHS = 2\n",
"\n",
"if QUICK_RUN:\n",
" TRAIN_DATA_USED_PERCENT = 0.001\n",
" DEV_DATA_USED_PERCENT = 0.01\n",
" NUM_EPOCHS = 1\n",
"\n",
"if torch.cuda.is_available():\n",
" BATCH_SIZE = 32\n",
"else:\n",
" BATCH_SIZE = 16\n",
"\n",
"# set random seeds\n",
"RANDOM_SEED = 42\n",
"random.seed(RANDOM_SEED)\n",
"np.random.seed(RANDOM_SEED)\n",
"torch.manual_seed(RANDOM_SEED)\n",
"num_cuda_devices = torch.cuda.device_count()\n",
"if num_cuda_devices > 1:\n",
" torch.cuda.manual_seed_all(RANDOM_SEED)\n",
"\n",
"# model configurations\n",
"LANGUAGE = Language.ENGLISH\n",
"TO_LOWER = True\n",
"MAX_SEQ_LENGTH = 128\n",
"\n",
"# optimizer configurations\n",
"LEARNING_RATE= 5e-5\n",
"WARMUP_PROPORTION= 0.1\n",
"\n",
"# data configurations\n",
"TEXT_COL = \"text\"\n",
"LABEL_COL = \"gold_label\"\n",
"\n",
"CACHE_DIR = \"./temp\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data\n",
"The MultiNLI dataset comes with three subsets: train, dev_matched, dev_mismatched. The dev_matched dataset are from the same genres as the train dataset, while the dev_mismatched dataset are from genres not seen in the training dataset. \n",
"The `load_pandas_df` function downloads and extracts the zip files if they don't already exist in `local_cache_path` and returns the data subset specified by `file_split`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"train_df = load_pandas_df(local_cache_path=CACHE_DIR, file_split=\"train\")\n",
"dev_df_matched = load_pandas_df(local_cache_path=CACHE_DIR, file_split=\"dev_matched\")\n",
"dev_df_mismatched = load_pandas_df(local_cache_path=CACHE_DIR, file_split=\"dev_mismatched\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"dev_df_matched = dev_df_matched.loc[dev_df_matched['gold_label'] != '-']\n",
"dev_df_mismatched = dev_df_mismatched.loc[dev_df_mismatched['gold_label'] != '-']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training dataset size: 392702\n",
"Development (matched) dataset size: 9815\n",
"Development (mismatched) dataset size: 9832\n",
"\n",
" gold_label sentence1 \\\n",
"0 neutral Conceptually cream skimming has two basic dime... \n",
"1 entailment you know during the season and i guess at at y... \n",
"2 entailment One of our number will carry out your instruct... \n",
"3 entailment How do you know? All this is their information... \n",
"4 neutral yeah i tell you what though if you go price so... \n",
"\n",
" sentence2 \n",
"0 Product and geography are what make cream skim... \n",
"1 You lose the things to the following level if ... \n",
"2 A member of my team will execute your orders w... \n",
"3 This information belongs to them. \n",
"4 The tennis shoes have a range of prices. \n"
]
}
],
"source": [
"print(\"Training dataset size: {}\".format(train_df.shape[0]))\n",
"print(\"Development (matched) dataset size: {}\".format(dev_df_matched.shape[0]))\n",
"print(\"Development (mismatched) dataset size: {}\".format(dev_df_mismatched.shape[0]))\n",
"print()\n",
"print(train_df[['gold_label', 'sentence1', 'sentence2']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concatenate the first and second sentences to form the input text."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>gold_label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>(Conceptually cream skimming has two basic dim...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>(you know during the season and i guess at at ...</td>\n",
" <td>entailment</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>(One of our number will carry out your instruc...</td>\n",
" <td>entailment</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>(How do you know? All this is their informatio...</td>\n",
" <td>entailment</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>(yeah i tell you what though if you go price s...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text gold_label\n",
"0 (Conceptually cream skimming has two basic dim... neutral\n",
"1 (you know during the season and i guess at at ... entailment\n",
"2 (One of our number will carry out your instruc... entailment\n",
"3 (How do you know? All this is their informatio... entailment\n",
"4 (yeah i tell you what though if you go price s... neutral"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[TEXT_COL] = list(zip(train_df['sentence1'], train_df['sentence2']))\n",
"dev_df_matched[TEXT_COL] = list(zip(dev_df_matched['sentence1'], dev_df_matched['sentence2']))\n",
"dev_df_mismatched[TEXT_COL] = list(zip(dev_df_mismatched['sentence1'], dev_df_mismatched['sentence2']))\n",
"train_df[[TEXT_COL, LABEL_COL]].head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"train_df = train_df.sample(frac=TRAIN_DATA_USED_PERCENT).reset_index(drop=True)\n",
"dev_df_matched = dev_df_matched.sample(frac=DEV_DATA_USED_PERCENT).reset_index(drop=True)\n",
"dev_df_mismatched = dev_df_mismatched.sample(frac=DEV_DATA_USED_PERCENT).reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenize and Preprocess\n",
"Before training, we tokenize the sentence texts and convert them to lists of tokens. The following steps instantiate a BERT tokenizer given the language, and tokenize the text of the training and testing sets."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 392702/392702 [03:25<00:00, 1907.47it/s]\n",
"100%|██████████| 9815/9815 [00:05<00:00, 1961.13it/s]\n",
"100%|██████████| 9832/9832 [00:05<00:00, 1837.42it/s]\n"
]
}
],
"source": [
"tokenizer= Tokenizer(LANGUAGE, to_lower=TO_LOWER, cache_dir=CACHE_DIR)\n",
"\n",
"train_tokens = tokenizer.tokenize(train_df[TEXT_COL])\n",
"dev_matched_tokens = tokenizer.tokenize(dev_df_matched[TEXT_COL])\n",
"dev_mismatched_tokens = tokenizer.tokenize(dev_df_mismatched[TEXT_COL])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition, we perform the following preprocessing steps in the cell below:\n",
"\n",
"* Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary\n",
"* Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence\n",
"* Pad or truncate the token lists to the specified max length\n",
"* Return mask lists that indicate paddings' positions\n",
"* Return token type id lists that indicate which sentence the tokens belong to\n",
"\n",
"*See the original [implementation](https://github.com/google-research/bert/blob/master/run_classifier.py) for more information on BERT's input format.*"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"train_token_ids, train_input_mask, train_token_type_ids = \\\n",
" tokenizer.preprocess_classification_tokens(train_tokens, max_len=MAX_SEQ_LENGTH)\n",
"dev_matched_token_ids, dev_matched_input_mask, dev_matched_token_type_ids = \\\n",
" tokenizer.preprocess_classification_tokens(dev_matched_tokens, max_len=MAX_SEQ_LENGTH)\n",
"dev_mismatched_token_ids, dev_mismatched_input_mask, dev_mismatched_token_type_ids = \\\n",
" tokenizer.preprocess_classification_tokens(dev_mismatched_tokens, max_len=MAX_SEQ_LENGTH)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"label_encoder = LabelEncoder()\n",
"train_labels = label_encoder.fit_transform(train_df[LABEL_COL])\n",
"num_labels = len(np.unique(train_labels))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train and Predict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Classifier"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"classifier = BERTSequenceClassifier(language=LANGUAGE,\n",
" num_labels=num_labels,\n",
" cache_dir=CACHE_DIR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train Classifier"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 0%| | 1/12272 [00:10<35:06:53, 10.30s/it]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:1->1228/12272; average training loss:1.199178\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 10%|█ | 1229/12272 [07:20<1:03:16, 2.91it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:1229->2456/12272; average training loss:0.783637\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 20%|██ | 2457/12272 [14:28<55:44, 2.93it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:2457->3684/12272; average training loss:0.692243\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 30%|███ | 3685/12272 [21:37<48:36, 2.94it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:3685->4912/12272; average training loss:0.653206\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 40%|████ | 4913/12272 [28:45<41:36, 2.95it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:4913->6140/12272; average training loss:0.625751\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 50%|█████ | 6141/12272 [35:54<34:44, 2.94it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:6141->7368/12272; average training loss:0.605123\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 60%|██████ | 7369/12272 [42:58<27:46, 2.94it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:7369->8596/12272; average training loss:0.590521\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 70%|███████ | 8597/12272 [50:07<20:52, 2.93it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:8597->9824/12272; average training loss:0.577829\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 80%|████████ | 9825/12272 [57:14<13:46, 2.96it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:9825->11052/12272; average training loss:0.566418\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 90%|█████████ | 11053/12272 [1:04:20<06:53, 2.95it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:1/2; batch:11053->12272/12272; average training loss:0.556558\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 12272/12272 [1:11:21<00:00, 2.88it/s]\n",
"Iteration: 0%| | 1/12272 [00:00<1:12:29, 2.82it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:1->1228/12272; average training loss:0.319802\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 10%|█ | 1229/12272 [07:09<1:02:29, 2.95it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:1229->2456/12272; average training loss:0.331876\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 20%|██ | 2457/12272 [14:15<55:22, 2.95it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:2457->3684/12272; average training loss:0.333463\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 30%|███ | 3685/12272 [21:21<48:41, 2.94it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:3685->4912/12272; average training loss:0.331817\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 40%|████ | 4913/12272 [28:25<41:26, 2.96it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:4913->6140/12272; average training loss:0.327940\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 50%|█████ | 6141/12272 [35:31<34:34, 2.96it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:6141->7368/12272; average training loss:0.325802\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 60%|██████ | 7369/12272 [42:36<27:48, 2.94it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:7369->8596/12272; average training loss:0.324641\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 70%|███████ | 8597/12272 [49:42<20:53, 2.93it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:8597->9824/12272; average training loss:0.322036\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 80%|████████ | 9825/12272 [56:44<13:50, 2.95it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:9825->11052/12272; average training loss:0.321205\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 90%|█████████ | 11053/12272 [1:03:49<06:54, 2.94it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch:2/2; batch:11053->12272/12272; average training loss:0.319237\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 12272/12272 [1:10:52<00:00, 2.94it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training time : 2.374 hrs\n"
]
}
],
"source": [
"with Timer() as t:\n",
" classifier.fit(token_ids=train_token_ids,\n",
" input_mask=train_input_mask,\n",
" token_type_ids=train_token_type_ids,\n",
" labels=train_labels,\n",
" num_epochs=NUM_EPOCHS,\n",
" batch_size=BATCH_SIZE,\n",
" lr=LEARNING_RATE,\n",
" warmup_proportion=WARMUP_PROPORTION)\n",
"print(\"Training time : {:.3f} hrs\".format(t.interval / 3600))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict on Test Data"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 307/307 [00:40<00:00, 8.15it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Prediction time : 0.011 hrs\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"with Timer() as t:\n",
" predictions_matched = classifier.predict(token_ids=dev_matched_token_ids,\n",
" input_mask=dev_matched_input_mask,\n",
" token_type_ids=dev_matched_token_type_ids,\n",
" batch_size=BATCH_SIZE)\n",
"print(\"Prediction time : {:.3f} hrs\".format(t.interval / 3600))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Iteration: 100%|██████████| 308/308 [00:38<00:00, 8.30it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Prediction time : 0.011 hrs\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"with Timer() as t:\n",
" predictions_mismatched = classifier.predict(token_ids=dev_mismatched_token_ids,\n",
" input_mask=dev_mismatched_input_mask,\n",
" token_type_ids=dev_mismatched_token_type_ids,\n",
" batch_size=BATCH_SIZE)\n",
"print(\"Prediction time : {:.3f} hrs\".format(t.interval / 3600))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
"contradiction 0.848 0.865 0.857 3213\n",
" entailment 0.894 0.828 0.860 3479\n",
" neutral 0.783 0.831 0.806 3123\n",
"\n",
" micro avg 0.841 0.841 0.841 9815\n",
" macro avg 0.842 0.841 0.841 9815\n",
" weighted avg 0.844 0.841 0.842 9815\n",
"\n"
]
}
],
"source": [
"predictions_matched = label_encoder.inverse_transform(predictions_matched)\n",
"print(classification_report(dev_df_matched[LABEL_COL], predictions_matched, digits=3))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
"contradiction 0.862 0.863 0.863 3240\n",
" entailment 0.878 0.853 0.865 3463\n",
" neutral 0.791 0.815 0.803 3129\n",
"\n",
" micro avg 0.844 0.844 0.844 9832\n",
" macro avg 0.844 0.844 0.844 9832\n",
" weighted avg 0.845 0.844 0.845 9832\n",
"\n"
]
}
],
"source": [
"predictions_mismatched = label_encoder.inverse_transform(predictions_mismatched)\n",
"print(classification_report(dev_df_mismatched[LABEL_COL], predictions_mismatched, digits=3))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import scrapbook as sb\n",
"result_matched_dict = classification_report(dev_df_matched[LABEL_COL], predictions_matched, digits=3, output_dict=True)\n",
"result_mismatched_dict = classification_report(dev_df_mismatched[LABEL_COL], predictions_mismatched, digits=3, output_dict=True)\n",
"\n",
"sb.glue(\"matched_precision\", result_matched_dict[\"weighted avg\"][\"precision\"])\n",
"sb.glue(\"matched_recall\", result_matched_dict[\"weighted avg\"][\"recall\"])\n",
"sb.glue(\"matched_f1\", result_matched_dict[\"weighted avg\"][\"f1-score\"])\n",
"\n",
"sb.glue(\"mismatched_precision\", result_mismatched_dict[\"weighted avg\"][\"precision\"])\n",
"sb.glue(\"mismatched_recall\", result_mismatched_dict[\"weighted avg\"][\"recall\"])\n",
"sb.glue(\"mismatched_f1\", result_mismatched_dict[\"weighted avg\"][\"f1-score\"])"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "nlp_gpu",
"language": "python",
"name": "nlp_gpu"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -23,5 +23,4 @@ The following summarizes each notebook for Text Classification. Each notebook pr
|[BERT for text classification of Hindi BBC News](tc_bbc_bert_hi.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Hindi BBC news data|[BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1)|
|[BERT for text classification of Arabic News](tc_dac_bert_ar.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a pre-trained BERT model on Arabic news articles|[DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)|
|[Text Classification of MultiNLI Sentences using Multiple Transformer Models](tc_mnli_transformers.ipynb)|Local| A notebook which walks through fine-tuning and evaluating a number of pre-trained transformer models|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[Text Classification Pipelines with Azure Machine Learning](tc_transformers_azureml_pipelines/tc_transformers_azureml_pipelines.ipynb)|Azure ML| A notebook which walks through building Azure ML pipelines for fine-tuning multiple transformer models|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/)|
|[Text Classification of Multi Language Datasets using Transformer Model](tc_multi_languages_transformers.ipynb)|Local|A notebook which walks through fine-tuning and evaluating a pre-trained transformer model for multiple datasets in different language|[MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) <br> [BBC Hindi News](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1) <br> [DAC](https://data.mendeley.com/datasets/v524p5dhpj/2)

Просмотреть файл

@ -1,42 +0,0 @@
from utils_nlp.models.transformers.sequence_classification import SequenceClassifier, Processor
import sys
import pandas as pd
import os
import pickle
from sklearn.preprocessing import LabelEncoder
input_dir = sys.argv[1]
input_data_file = sys.argv[2]
output_dir = sys.argv[3]
text_col = sys.argv[4]
label_col = sys.argv[5]
model_name = sys.argv[6]
max_len = 150
cache_dir = "."
dataset_suffix = "_ds"
label_encoder_suffix = "_le"
if output_dir is not None:
os.makedirs(output_dir, exist_ok=True)
# read data
df = pd.read_csv(os.path.join(input_dir, input_data_file), quoting=1)
# encode labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(df[label_col])
# preprocess
processor = Processor(model_name=model_name, cache_dir=cache_dir)
ds = processor.preprocess(df[text_col], labels, max_len=max_len)
# write preprocessed dataset
output_data_file = model_name + dataset_suffix
pickle.dump(ds, open(os.path.join(output_dir, output_data_file), "wb"))
# write label encoder
label_encoder_file = model_name + label_encoder_suffix
pickle.dump(label_encoder, open(os.path.join(output_dir, label_encoder_file), "wb"))

Просмотреть файл

@ -1,36 +0,0 @@
import os
import pickle
import shutil
import sys
import torch
from utils_nlp.models.transformers.sequence_classification import SequenceClassifier
print("CUDA is{} available".format("" if torch.cuda.is_available() else " not"))
input_dir = sys.argv[1]
output_dir = sys.argv[2]
model_name = sys.argv[3]
num_gpus = int(sys.argv[4])
cache_dir = "."
num_labels = 5
batch_size = 16
dataset_suffix = "_ds"
trained_model_suffix = "_clf"
label_encoder_suffix = "_le"
if output_dir is not None:
os.makedirs(output_dir, exist_ok=True)
# load dataset
ds = pickle.load(open(os.path.join(input_dir, model_name + dataset_suffix), "rb"))
# fine-tune
classifier = SequenceClassifier(model_name=model_name, num_labels=num_labels, cache_dir=cache_dir)
classifier.fit(ds, batch_size=batch_size, num_gpus=num_gpus, verbose=False)
# write classifier
pickle.dump(classifier, open(os.path.join(output_dir, model_name + trained_model_suffix), "wb"))
# write label encoder
shutil.move(
os.path.join(input_dir, model_name + label_encoder_suffix),
os.path.join(output_dir, model_name + label_encoder_suffix),
)

Просмотреть файл

@ -1,598 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Classification Pipelines with Azure Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we fine-tune and evaluate a number of pretrained models on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset using [Azure Machine Learning Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines). Pipelines allow us to create sequential steps for preprocessing and training workflows, in addition to parallel steps that run independenly on a cluster of nodes. We demonstrate how one can submit model training jobs for multiple models, each consisting of multiple steps.\n",
"\n",
"We use a [sequence classifier](../../../utils_nlp/models/transformers/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/transformers) of different transformers, like [BERT](https://github.com/google-research/bert), [XLNet](https://github.com/zihangdai/xlnet), and [RoBERTa](https://github.com/pytorch/fairseq).\n",
"\n",
"Below is a general illustration of the pipeline and its preprocessing and training steps.\n",
"\n",
"<img src=\"https://nlpbp.blob.core.windows.net/images/tc_pipeline_graph.PNG\" width=\"500\">\n",
"\n",
"The pipeline steps we chose are generic [Python script steps](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py) of the Azure ML SDK. This allows us to run parametrized Python scripts on a remote target. For this example, we will create pipeline steps that execute the preprocessing and training scripts provided in the [scripts](scripts) folder, with different arguments for different model types.\n",
"\n",
"# Table of Contents\n",
"\n",
"- [Define Parameters](#Define-Parameters)\n",
"- [Create AML Workspace and Compute Target](#Create-AML-Workspace-and-Compute-Target)\n",
"- [Upload Training Data to Workspace](#Upload-Training-Data-to-Workspace)\n",
"- [Setup Execution Environment](#Setup-Execution-Environment)\n",
"- [Define Pipeline Graph](#Define-Pipeline-Graph)\n",
"- [Run Pipeline](#Run-Pipeline)\n",
"- [Retrieve a Trained Model from Pipeline](#Retrieve-a-Trained-Model-from-Pipeline)\n",
"- [Test Model](#Test-Model)\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"import os\n",
"import pandas as pd\n",
"import pickle\n",
"from azureml.core import Datastore, Environment, Experiment\n",
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"from azureml.core.runconfig import RunConfiguration\n",
"from azureml.data.data_reference import DataReference\n",
"from azureml.exceptions import ComputeTargetException\n",
"from azureml.pipeline.core import Pipeline, PipelineData, PipelineRun\n",
"from azureml.pipeline.steps import PythonScriptStep\n",
"from utils_nlp.azureml import azureml_utils\n",
"from utils_nlp.dataset.multinli import load_pandas_df\n",
"from utils_nlp.models.transformers.sequence_classification import Processor, SequenceClassifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Parameters"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"SUBSCRIPTION_ID = \"\"\n",
"RESOURCE_GROUP = \"\"\n",
"WORKSPACE_NAME = \"\"\n",
"WORKSPACE_REGION = \"\" # \"southcentralus\", \"eastus\", etc\n",
"\n",
"# remote target\n",
"CLUSTER_NAME = \"\" # 2-16 chars\n",
"VM_SIZE = \"STANDARD_NC12\"\n",
"MIN_NODES = 0\n",
"MAX_NODES = 2\n",
"\n",
"# local data\n",
"TEMP_DIR = \"temp\"\n",
"TRAIN_FILE = \"train.csv\"\n",
"TEXT_COL = \"text\"\n",
"LABEL_COL = \"label\"\n",
"TRAIN_SAMPLE_SIZE = 10000\n",
"# remote data\n",
"REMOTE_DATA_CONTAINER = \"data\"\n",
"\n",
"# remote env config\n",
"PIP_PACKAGES = [\"azureml-sdk==1.0.65\", \"torch==1.1\", \"tqdm==4.31.1\", \"transformers==2.1.1\"]\n",
"CONDA_PACKAGES = [\"numpy\", \"scikit-learn\", \"pandas\"]\n",
"UTILS_NLP_WHL_DIR = \"../../../dist\"\n",
"PYTHON_VERSION = \"3.6.8\"\n",
"USE_GPU = True\n",
"\n",
"# pipeline scripts\n",
"SCRIPTS_DIR = \"scripts\"\n",
"PREPROCESS_SCRIPT = \"preprocess.py\"\n",
"TRAIN_SCRIPT = \"train.py\"\n",
"TRAINED_MODEL_SUFFIX = \"_clf\"\n",
"LABEL_ENCODER_SUFFIX = \"_le\"\n",
"\n",
"# pretrained models\n",
"MODEL_NAMES = [\"bert-base-uncased\", \"xlnet-base-cased\"]\n",
"\n",
"# experiment\n",
"EXPERIMENT_NAME = \"tc_pipeline_exp\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create AML Workspace and Compute Target\n",
"\n",
"The following code block creates or retrieves an existing Azure ML workspace and a corresponding Azure ML compute target. For deep learning tasks, it is recommended that your compute nodes are GPU-enabled. Here, we're using a scalable cluster of size *(min_nodes, max_nodes)*. Setting *min_nodes* to zero ensures that the nodes are shutdown when not in use. Azure ML will allocate nodes as needed, up to *max_nodes*, and based on the jobs submitted to the compute target."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# create/get AML workspace\n",
"ws = azureml_utils.get_or_create_workspace(\n",
" subscription_id=SUBSCRIPTION_ID,\n",
" resource_group=RESOURCE_GROUP,\n",
" workspace_name=WORKSPACE_NAME,\n",
" workspace_region=WORKSPACE_REGION,\n",
")\n",
"\n",
"# create/get compute target\n",
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)\n",
"except ComputeTargetException:\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=VM_SIZE, min_nodes=MIN_NODES, max_nodes=MAX_NODES, vm_priority=\"lowpriority\"\n",
" )\n",
" compute_target = ComputeTarget.create(\n",
" workspace=ws, name=CLUSTER_NAME, provisioning_configuration=compute_config\n",
" )\n",
" compute_target.wait_for_completion(show_output=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload Training Data to Workspace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we use a subset of the MultiNLI dataset for fine-tuning the specified pre-trained models. The dataset contains a column of sentences (*sentence1*) which we will use as text input, and a *genre* column which we use as class labels."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>44423</th>\n",
" <td>Perhaps they had.</td>\n",
" <td>fiction</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15924</th>\n",
" <td>James and Hemingway give way to the shabby bac...</td>\n",
" <td>slate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>244226</th>\n",
" <td>For the next two decades boatloads of people c...</td>\n",
" <td>travel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>292929</th>\n",
" <td>'Seems fair enough,' I effortlessly broke free...</td>\n",
" <td>fiction</td>\n",
" </tr>\n",
" <tr>\n",
" <th>351247</th>\n",
" <td>The real reason, said the Pioneer , was that t...</td>\n",
" <td>slate</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text label\n",
"44423 Perhaps they had. fiction\n",
"15924 James and Hemingway give way to the shabby bac... slate\n",
"244226 For the next two decades boatloads of people c... travel\n",
"292929 'Seems fair enough,' I effortlessly broke free... fiction\n",
"351247 The real reason, said the Pioneer , was that t... slate"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create training data sample\n",
"os.makedirs(TEMP_DIR, exist_ok=True)\n",
"df = load_pandas_df(TEMP_DIR, \"train\")\n",
"df = df[df[\"gold_label\"] == \"neutral\"] # filter duplicate sentences\n",
"df = df.sample(TRAIN_SAMPLE_SIZE)\n",
"df[TEXT_COL] = df[\"sentence1\"]\n",
"df[LABEL_COL] = df[\"genre\"]\n",
"df[[TEXT_COL, LABEL_COL]].to_csv(\n",
" os.path.join(TEMP_DIR, TRAIN_FILE), header=True, index=None, quoting=1\n",
")\n",
"# inspect dataset\n",
"df[[TEXT_COL, LABEL_COL]].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Azure ML workspace comes with a default datastore that is linked to an Azure Blob storage in the same resource group. We will use this datastore to upload the CSV data file. We will also use it for the intermediate output of the pipeline steps, as well as for the final output of the training step. In practice, one can create other datastores and link them to existing Blob Storage containers."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Uploading an estimated of 1 files\n",
"Uploading temp/train.csv\n",
"Uploaded temp/train.csv, 1 files out of an estimated total of 1\n",
"Uploaded 1 files\n"
]
},
{
"data": {
"text/plain": [
"$AZUREML_DATAREFERENCE_35a15e0fc32e4ff2bd8fa3c5a42e2426"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# upload data to datastore\n",
"ds = ws.get_default_datastore()\n",
"ds.upload_files(\n",
" files=[os.path.join(TEMP_DIR, TRAIN_FILE)],\n",
" target_path=REMOTE_DATA_CONTAINER,\n",
" overwrite=True,\n",
" show_progress=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Execution Environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to the *pip* and *conda* dependencies listed in the parameters section, we would need to include the packaged utils_nlp wheel file (this can be created by running *python3 setup.py bdist_wheel* from the root dir). The utils_nlp folder of this repo includes the transformer procesor and the classifier that we will fine-tune on the remote target. The *preprocess.py* and *train.py* [scripts](scripts) import the *utils_nlp* package, as they call the preprocessing and classification functions of its wrapper classes."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# locate utils_nlp whl file\n",
"utils_nlp_whl_file = [x for x in os.listdir(UTILS_NLP_WHL_DIR) if x.endswith(\".whl\")][0]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# conda env setup\n",
"conda_dependencies = CondaDependencies.create(\n",
" conda_packages=CONDA_PACKAGES,\n",
" pip_packages=PIP_PACKAGES,\n",
" python_version=PYTHON_VERSION,\n",
")\n",
"nlp_repo_whl = Environment.add_private_pip_wheel(\n",
" workspace=ws,\n",
" file_path=os.path.join(UTILS_NLP_WHL_DIR, utils_nlp_whl_file),\n",
" exist_ok=True,\n",
")\n",
"conda_dependencies.add_pip_package(nlp_repo_whl)\n",
"run_config = RunConfiguration(conda_dependencies=conda_dependencies)\n",
"run_config.environment.docker.enabled = True\n",
"if USE_GPU:\n",
" run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_GPU_IMAGE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Pipeline Graph\n",
"\n",
"As shown in the diagram earlier, the pipeline can be represented as a graph, where nodes represent execution steps. In this example we create a pipeline with two steps for each pretrained model we want to fine-tune. The processing and fine-tuning steps need to be executed in order. However, each sequence of these two steps can be executed in parallel for many types of models on multiple nodes of the compute cluster.\n",
"\n",
"For text classification, a number of pretrained-models are available from [Hugging Face's transformers package](https://github.com/huggingface/transformers), which is used within *utils_nlp*. Here, we include preprocessing and training steps for the *MODEL_NAMES* defined in the parameters section. You can list the supported pretrained models using the following code."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', 'bert-base-multilingual-cased', 'bert-base-chinese', 'bert-base-german-cased', 'bert-large-uncased-whole-word-masking', 'bert-large-cased-whole-word-masking', 'bert-large-uncased-whole-word-masking-finetuned-squad', 'bert-large-cased-whole-word-masking-finetuned-squad', 'bert-base-cased-finetuned-mrpc', 'bert-base-german-dbmdz-cased', 'bert-base-german-dbmdz-uncased', 'roberta-base', 'roberta-large', 'roberta-large-mnli', 'xlnet-base-cased', 'xlnet-large-cased', 'distilbert-base-uncased', 'distilbert-base-uncased-distilled-squad']\n"
]
}
],
"source": [
"print(SequenceClassifier.list_supported_models())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# pipeline input\n",
"input_dir = DataReference(\n",
" datastore=ds,\n",
" data_reference_name=\"input_dir\",\n",
" path_on_datastore=REMOTE_DATA_CONTAINER,\n",
" overwrite=False,\n",
")\n",
"\n",
"# create 2 pipeline steps (preprocessing & training) for each model\n",
"all_steps = []\n",
"\n",
"for model_name in MODEL_NAMES:\n",
"\n",
" # intermediate output\n",
" preprocess_dir = PipelineData(\n",
" name=\"preprocessed\",\n",
" datastore=ds,\n",
" output_path_on_compute=REMOTE_DATA_CONTAINER + \"/\" + \"preprocessed_\" + model_name,\n",
" )\n",
" # final output\n",
" output_dir = PipelineData(\n",
" name=\"trained\",\n",
" datastore=ds,\n",
" output_path_on_compute=REMOTE_DATA_CONTAINER + \"/\" + \"trained_\" + model_name,\n",
" )\n",
"\n",
" preprocess_step = PythonScriptStep(\n",
" name=\"preprocess_step_{}\".format(model_name),\n",
" arguments=[input_dir, TRAIN_FILE, preprocess_dir, TEXT_COL, LABEL_COL, model_name],\n",
" script_name=PREPROCESS_SCRIPT,\n",
" inputs=[input_dir],\n",
" outputs=[preprocess_dir],\n",
" source_directory=SCRIPTS_DIR,\n",
" compute_target=compute_target,\n",
" runconfig=run_config,\n",
" allow_reuse=False,\n",
" )\n",
"\n",
" train_step = PythonScriptStep(\n",
" name=\"train_step_{}\".format(model_name),\n",
" arguments=[preprocess_dir, output_dir, model_name, MAX_NODES],\n",
" script_name=TRAIN_SCRIPT,\n",
" inputs=[preprocess_dir],\n",
" outputs=[output_dir],\n",
" source_directory=SCRIPTS_DIR,\n",
" compute_target=compute_target,\n",
" runconfig=run_config,\n",
" allow_reuse=False,\n",
" )\n",
"\n",
" # set execution order of steps\n",
" train_step.run_after(preprocess_step)\n",
"\n",
" all_steps.append(preprocess_step)\n",
" all_steps.append(train_step)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following image is an example of how the pipeline graph generated by Azure ML looks like. This particular graph example represents a pipeline submitted for 2 models with a total of 4 steps.\n",
"\n",
"<img src=\"https://nlpbp.blob.core.windows.net/images/pipeline_graph_example.PNG\" width=\"700\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Pipeline\n",
"\n",
"Once the pipeline and its steps are defined, we can create an experiment in the Azure ML workspace and submit a pipeline run as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create pipeline\n",
"pipeline = Pipeline(workspace=ws, steps=[all_steps])\n",
"experiment_name = EXPERIMENT_NAME + datetime.now().strftime(\"%H%M%S\")\n",
"experiment = Experiment(ws, experiment_name)\n",
"pipeline_run = experiment.submit(pipeline)\n",
"pipeline_run.wait_for_completion(show_output=False)\n",
"pipeline_run_id = pipeline_run.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieve a Trained Model from Pipeline\n",
"\n",
"The Azure ML SDK allows retrieving the pipeline runs and steps using the run id and step name. The following example downloads the output of the training step of the first model in *MODEL_NAMES*, which includes the fine-tuned classifier and the label_encoder used earlier."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# retrieve an existing training step & download corresponding model\n",
"# (from an existing experiment and pipeline run)\n",
"experiment = Experiment(ws, experiment_name)\n",
"pipeline_run = PipelineRun(experiment, pipeline_run_id)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# find the training step for the first model\n",
"train_step_run = pipeline_run.find_step_run(\"train_step_{}\".format(MODEL_NAMES[0]))[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# download its output (a traind model & a label encoder)\n",
"train_step_run.get_output_data(output_dir.name).download(local_path=TEMP_DIR)\n",
"\n",
"# load classifier and label encoder\n",
"trained_dir = (\n",
" \"./temp/azureml/\" + train_step_run.id + \"/\" + output_dir.name \n",
")\n",
"classifier = pickle.load(open(trained_dir + \"/\" + MODEL_NAMES[0] + TRAINED_MODEL_SUFFIX, \"rb\"))\n",
"label_encoder = pickle.load(open(trained_dir + \"/\" + MODEL_NAMES[0] + LABEL_ENCODER_SUFFIX, \"rb\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test Model\n",
"Finally, we can test the model by scoring some text input."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Evaluating: 100%|██████████| 1/1 [00:01<00:00, 1.56s/it]\n"
]
},
{
"data": {
"text/plain": [
"array(['fiction'], dtype=object)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# test\n",
"test_input = [\"Let's go to Orlando. I've heard it's a nice place\"]\n",
"processor = Processor(model_name=MODEL_NAMES[0], cache_dir=TEMP_DIR)\n",
"test_ds = processor.preprocess(test_input, max_len=150)\n",
"pred = classifier.predict(test_ds, device=\"cpu\")\n",
"label_encoder.inverse_transform(pred)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"file_extension": ".py",
"kernelspec": {
"display_name": "Python [conda env:nlp_gpu]",
"language": "python",
"name": "conda-env-nlp_gpu-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"mimetype": "text/x-python",
"name": "python",
"npconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3
},
"nbformat": 4,
"nbformat_minor": 2
}