Merge pull request #254 from microsoft/sequence-classifier-dist

Entailment using BERT, AzureML along with Text Classification updates
This commit is contained in:
Said Bleik 2019-08-13 14:09:20 -04:00 коммит произвёл GitHub
Родитель f99d34a05e 3789c15e15
Коммит 0642773154
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
7 изменённых файлов: 1033 добавлений и 320 удалений

Просмотреть файл

@ -0,0 +1,531 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural Language Inference on XNLI Dataset using BERT with Azure Machine Learning\n",
"\n",
"## 1. Summary\n",
"In this notebook, we demostrate using the BERT model to do language inference in English. We use the [XNLI](https://github.com/facebookresearch/XNLI) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral. \n",
"The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.\n",
"<img src=\"https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG\">\n",
"\n",
"Azure Machine Learning features higlighted in the notebook : \n",
"\n",
"- Distributed training with Horovod"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"\n",
"import sys\n",
"\n",
"sys.path.append(\"../..\")\n",
"\n",
"import os\n",
"import shutil\n",
"import torch\n",
"import json\n",
"import pandas as pd\n",
"\n",
"import azureml.core\n",
"from azureml.train.dnn import PyTorch\n",
"from azureml.core.runconfig import MpiConfiguration\n",
"from azureml.core import Experiment\n",
"from azureml.widgets import RunDetails\n",
"from azureml.core.compute import ComputeTarget\n",
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Parameters\n",
"\n",
"DEBUG = True\n",
"NODE_COUNT = 4\n",
"NUM_PROCESS = 1\n",
"DATA_PERCENT_USED = 1.0\n",
"\n",
"config_path = (\n",
" \"./.azureml\"\n",
") # Path to the directory containing config.json with azureml credentials\n",
"\n",
"# Azure resources\n",
"subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n",
"resource_group = \"YOUR_RESOURCE_GROUP_NAME\" \n",
"workspace_name = \"YOUR_WORKSPACE_NAME\" \n",
"workspace_region = \"YOUR_WORKSPACE_REGION\" # eg: eastus, eastus2.\n",
"cluster_name = \"gpu-entail\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. AzureML Setup\n",
"\n",
"### 2.1 Link to or create a Workspace\n",
"\n",
"First, go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`. This will create a config.json file containing the values needed below to create a workspace.\n",
"\n",
"**Note**: you do not need to fill in these values if you have a config.json in the same folder as this notebook"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"ws = get_or_create_workspace(\n",
" config_path=config_path,\n",
" subscription_id=subscription_id,\n",
" resource_group=resource_group,\n",
" workspace_name=workspace_name,\n",
" workspace_region=workspace_region,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\n",
" \"Workspace name: \" + ws.name,\n",
" \"Azure region: \" + ws.location,\n",
" \"Subscription id: \" + ws.subscription_id,\n",
" \"Resource group: \" + ws.resource_group,\n",
" sep=\"\\n\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Link AmlCompute Compute Target\n",
"\n",
"We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. \n",
"\n",
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found compute target: gpu-entail\n",
"{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-03T13:43:20.068000+00:00', 'errors': None, 'creationTime': '2019-07-27T02:14:46.127092+00:00', 'modifiedTime': '2019-07-27T02:15:07.181277+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6S_V2'}\n"
]
}
],
"source": [
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" print(\"Found compute target: {}\".format(cluster_name))\n",
"except ComputeTargetException:\n",
" print(\"Creating new compute target: {}\".format(cluster_name))\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_NC6\", max_nodes=1\n",
" )\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" compute_target.wait_for_completion(show_output=True)\n",
"\n",
"\n",
"print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'./entail_utils\\\\utils_nlp'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"project_dir = \"./entail_utils\"\n",
"if DEBUG and os.path.exists(project_dir):\n",
" shutil.rmtree(project_dir)\n",
"shutil.copytree(\"../../utils_nlp\", os.path.join(project_dir, \"utils_nlp\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Prepare Training Script"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Writing ./entail_utils/train.py\n"
]
}
],
"source": [
"%%writefile $project_dir/train.py\n",
"import horovod.torch as hvd\n",
"import torch\n",
"import numpy as np\n",
"import time\n",
"import argparse\n",
"from utils_nlp.common.timer import Timer\n",
"from utils_nlp.dataset.xnli_torch_dataset import XnliDataset\n",
"from utils_nlp.models.bert.common import Language\n",
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
" BERTSequenceClassifier,\n",
")\n",
"from sklearn.metrics import classification_report\n",
"\n",
"print(\"Torch version:\", torch.__version__)\n",
"\n",
"hvd.init()\n",
"\n",
"LANGUAGE_ENGLISH = \"en\"\n",
"TRAIN_FILE_SPLIT = \"train\"\n",
"TEST_FILE_SPLIT = \"test\"\n",
"TO_LOWERCASE = True\n",
"PRETRAINED_BERT_LNG = Language.ENGLISH\n",
"LEARNING_RATE = 5e-5\n",
"WARMUP_PROPORTION = 0.1\n",
"BATCH_SIZE = 32\n",
"NUM_GPUS = 1\n",
"OUTPUT_DIR = \"./outputs/\"\n",
"LABELS = [\"contradiction\", \"entailment\", \"neutral\"]\n",
"\n",
"## each machine gets it's own copy of data\n",
"CACHE_DIR = \"./xnli-%d\" % hvd.rank()\n",
"\n",
"parser = argparse.ArgumentParser()\n",
"# Training settings\n",
"parser.add_argument(\n",
" \"--seed\", type=int, default=42, metavar=\"S\", help=\"random seed (default: 42)\"\n",
")\n",
"parser.add_argument(\n",
" \"--epochs\", type=int, default=2, metavar=\"S\", help=\"random seed (default: 2)\"\n",
")\n",
"parser.add_argument(\n",
" \"--no-cuda\", action=\"store_true\", default=False, help=\"disables CUDA training\"\n",
")\n",
"parser.add_argument(\n",
" \"--data_percent_used\",\n",
" type=float,\n",
" default=1.0,\n",
" metavar=\"S\",\n",
" help=\"data percent used (default: 1.0)\",\n",
")\n",
"\n",
"args = parser.parse_args()\n",
"args.cuda = not args.no_cuda and torch.cuda.is_available()\n",
"\n",
"\"\"\"\n",
"Note: For example, you have 4 nodes and 4 GPUs each node, so you spawn 16 workers. \n",
"Every worker will have a rank [0, 15], and every worker will have a local_rank [0, 3]\n",
"\"\"\"\n",
"if args.cuda:\n",
" torch.cuda.set_device(hvd.local_rank())\n",
" torch.cuda.manual_seed(args.seed)\n",
"\n",
"# num_workers - this is equal to number of gpus per machine\n",
"kwargs = {\"num_workers\": NUM_GPUS, \"pin_memory\": True} if args.cuda else {}\n",
"\n",
"train_dataset = XnliDataset(\n",
" file_split=TRAIN_FILE_SPLIT,\n",
" cache_dir=CACHE_DIR,\n",
" language=LANGUAGE_ENGLISH,\n",
" to_lowercase=TO_LOWERCASE,\n",
" tok_language=PRETRAINED_BERT_LNG,\n",
" data_percent_used=args.data_percent_used,\n",
")\n",
"\n",
"\n",
"# set the label_encoder for evaluation\n",
"label_encoder = train_dataset.label_encoder\n",
"num_labels = len(np.unique(train_dataset.labels))\n",
"\n",
"# Train\n",
"classifier = BERTSequenceClassifier(\n",
" language=Language.ENGLISH,\n",
" num_labels=num_labels,\n",
" cache_dir=CACHE_DIR,\n",
" use_distributed=True,\n",
")\n",
"\n",
"\n",
"train_loader = classifier.create_data_loader(\n",
" train_dataset, BATCH_SIZE, mode=\"train\", **kwargs\n",
")\n",
"\n",
"\n",
"num_samples = len(train_loader.dataset)\n",
"num_batches = int(num_samples / BATCH_SIZE)\n",
"num_train_optimization_steps = num_batches * args.epochs\n",
"optimizer = classifier.create_optimizer(\n",
" num_train_optimization_steps, lr=LEARNING_RATE, warmup_proportion=WARMUP_PROPORTION\n",
")\n",
"\n",
"with Timer() as t:\n",
" for epoch in range(1, args.epochs + 1):\n",
"\n",
" # to allow data shuffling for DistributedSampler\n",
" train_loader.sampler.set_epoch(epoch)\n",
"\n",
" # epoch and num_epochs is passed in the fit function to print loss at regular batch intervals\n",
" classifier.fit(\n",
" train_loader,\n",
" epoch=epoch,\n",
" num_epochs=args.epochs,\n",
" bert_optimizer=optimizer,\n",
" num_gpus=NUM_GPUS,\n",
" )\n",
"\n",
"#if machine has multiple gpus then run predictions on only on 1 gpu since test_dataset is small.\n",
"if hvd.rank() == 0:\n",
" NUM_GPUS = 1\n",
" \n",
" test_dataset = XnliDataset(\n",
" file_split=TEST_FILE_SPLIT,\n",
" cache_dir=CACHE_DIR,\n",
" language=LANGUAGE_ENGLISH,\n",
" to_lowercase=TO_LOWERCASE,\n",
" tok_language=PRETRAINED_BERT_LNG,\n",
" )\n",
"\n",
" test_loader = classifier.create_data_loader(test_dataset, mode=\"test\")\n",
"\n",
" # predict\n",
" predictions, pred_labels = classifier.predict(test_loader, NUM_GPUS)\n",
"\n",
" predictions = label_encoder.inverse_transform(predictions)\n",
"\n",
" # Evaluate\n",
" results = classification_report(\n",
" pred_labels, predictions, target_names=LABELS, output_dict=True\n",
" )\n",
"\n",
" result_file = os.path.join(OUTPUT_DIR, \"results.json\")\n",
" with open(result_file, \"w+\") as fp:\n",
" json.dump(results, fp)\n",
"\n",
" # save model\n",
" classifier.save_model()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Create a PyTorch Estimator\n",
"\n",
"BERT is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"mpiConfig = MpiConfiguration()\n",
"mpiConfig.process_count_per_node = NUM_PROCESS\n",
"\n",
"script_params = {\n",
" '--data_percent_used': DATA_PERCENT_USED\n",
"}\n",
"\n",
"est = PyTorch(\n",
" source_directory=project_dir,\n",
" compute_target=compute_target,\n",
" entry_script=\"train.py\",\n",
" script_params = script_params,\n",
" node_count=NODE_COUNT,\n",
" distributed_training=mpiConfig,\n",
" use_gpu=True,\n",
" framework_version=\"1.0\",\n",
" conda_packages=[\"scikit-learn=0.20.3\", \"numpy\", \"spacy\", \"nltk\"],\n",
" pip_packages=[\"pandas\", \"seqeval[gpu]\", \"pytorch-pretrained-bert\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Create Experiment and Submit a Job\n",
"Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.\n",
"\n",
"**Note**: The experiment takes ~4 hours with 2 NC24 nodes and ~7hours with 4 NC6 nodes. The overhead is due to the communication time between nodes. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"experiment = Experiment(ws, name=\"NLP-Entailment-BERT\")\n",
"run = experiment.submit(est)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c8e7a44fa8804e95b21eea74d7694b1e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"RunDetails(run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the above cell is an async call, the below cell is a blocking call to stop the cells below it to execute."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run.wait_for_completion()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Analyze Results\n",
"\n",
"Download result.json from portal and open to view results. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading file outputs/results.json to ./outputs\\results.json...\n"
]
}
],
"source": [
"file_names = [\"outputs/results.json\"]\n",
"get_output_files(run, \"./outputs\", file_names=file_names)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" f1-score precision recall support\n",
"contradiction 0.838749 0.859296 0.819162 1670.0\n",
"entailment 0.817280 0.877663 0.764671 1670.0\n",
"neutral 0.777870 0.719817 0.846108 1670.0\n",
"micro avg 0.809980 0.809980 0.809980 5010.0\n",
"macro avg 0.811300 0.818925 0.809980 5010.0\n",
"weighted avg 0.811300 0.818925 0.809980 5010.0\n"
]
}
],
"source": [
"with open(\"outputs/results.json\", \"r\") as handle:\n",
" parsed = json.load(handle)\n",
" print(pd.DataFrame.from_dict(parsed).transpose())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -32,7 +32,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 29,
"metadata": {},
"outputs": [
{
@ -85,7 +85,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 30,
"metadata": {
"tags": [
"parameters"
@ -135,7 +135,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 31,
"metadata": {
"scrolled": false
},
@ -164,7 +164,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 32,
"metadata": {},
"outputs": [
{
@ -172,7 +172,7 @@
"output_type": "stream",
"text": [
"Found existing compute target.\n",
"{'currentNodeCount': 2, 'targetNodeCount': 2, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 2, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-31T22:29:42.732000+00:00', 'errors': None, 'creationTime': '2019-07-25T04:16:20.598768+00:00', 'modifiedTime': '2019-07-25T04:16:36.486727+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 2, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12'}\n"
"{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-11T08:53:18.284000+00:00', 'errors': None, 'creationTime': '2019-07-25T04:16:20.598768+00:00', 'modifiedTime': '2019-08-05T06:40:12.292030+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12'}\n"
]
}
],
@ -213,7 +213,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
@ -234,7 +234,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 34,
"metadata": {
"scrolled": true
},
@ -271,9 +271,20 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 35,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"$AZUREML_DATAREFERENCE_9609849b541244d396d06017b5729edb"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds = ws.get_default_datastore()\n",
"ds.upload(src_dir=TRAIN_FOLDER, target_path=\"mnli_data/train\", overwrite=True, show_progress=False)\n",
@ -282,7 +293,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
@ -299,7 +310,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 37,
"metadata": {
"scrolled": true
},
@ -404,7 +415,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 38,
"metadata": {},
"outputs": [
{
@ -413,7 +424,7 @@
"'../../utils_nlp/models/bert/preprocess.py'"
]
},
"execution_count": 28,
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
@ -432,7 +443,7 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
@ -461,7 +472,7 @@
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
@ -543,7 +554,7 @@
},
{
"cell_type": "code",
"execution_count": 31,
"execution_count": 41,
"metadata": {},
"outputs": [
{
@ -564,13 +575,14 @@
"import logging\n",
"import os\n",
"import torch\n",
"\n",
"from sklearn.metrics import classification_report\n",
"\n",
"from utils_nlp.models.bert.common import Language\n",
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
" BERTSequenceDistClassifier,\n",
")\n",
"from utils_nlp.common.timer import Timer\n",
"from utils_nlp.models.bert.common import Language, get_dataset_multiple_files\n",
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
" BERTSequenceClassifier,\n",
")\n",
"\n",
"BATCH_SIZE = 32\n",
"NUM_GPUS = 2\n",
@ -602,46 +614,64 @@
"# Handle square brackets from train list\n",
"train_files[0] = train_files[0][1:]\n",
"train_files[len(train_files) - 1] = train_files[len(train_files) - 1][:-1]\n",
"train_dataset = get_dataset_multiple_files(train_files)\n",
"\n",
"# Handle square brackets from test list\n",
"test_files[0] = test_files[0][1:]\n",
"test_files[len(test_files) - 1] = test_files[len(test_files) - 1][:-1]\n",
"test_dataset = get_dataset_multiple_files(test_files)\n",
"\n",
"# Train\n",
"classifier = BERTSequenceDistClassifier(\n",
" language=Language.ENGLISH, num_labels=len(LABELS)\n",
"classifier = BERTSequenceClassifier(\n",
" language=Language.ENGLISH, num_labels=len(LABELS), use_distributed=True\n",
")\n",
"\n",
"# Create data loaders.\n",
"kwargs = (\n",
" {\"num_workers\": NUM_GPUS, \"pin_memory\": True} if torch.cuda.is_available() else {}\n",
")\n",
"train_data_loader = classifier.create_data_loader(\n",
" train_dataset, batch_size=BATCH_SIZE, **kwargs\n",
")\n",
"test_data_loader = classifier.create_data_loader(\n",
" test_dataset, batch_size=BATCH_SIZE, mode=\"test\", **kwargs\n",
")\n",
"\n",
"# Create optimizer\n",
"num_examples = len(train_dataset)\n",
"num_batches = int(num_examples / BATCH_SIZE)\n",
"num_train_optimization_steps = num_batches * NUM_EPOCHS\n",
"optimizer = classifier.create_optimizer(num_train_optimization_steps)\n",
"\n",
"with Timer() as t:\n",
" classifier.fit(\n",
" train_files,\n",
" num_gpus=NUM_GPUS,\n",
" num_epochs=NUM_EPOCHS,\n",
" batch_size=BATCH_SIZE,\n",
" verbose=True,\n",
" )\n",
" for epoch in range(1, NUM_EPOCHS + 1):\n",
" train_data_loader.sampler.set_epoch(epoch)\n",
" classifier.fit(\n",
" train_data_loader,\n",
" epoch=epoch,\n",
" bert_optimizer=optimizer,\n",
" num_gpus=NUM_GPUS,\n",
" num_epochs=NUM_EPOCHS,\n",
" )\n",
"\n",
"# Predict\n",
"preds, labels_test = classifier.predict(\n",
" test_files, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
")\n",
"preds, labels_test = classifier.predict(test_data_loader, num_gpus=NUM_GPUS)\n",
"\n",
"# Evaluate\n",
"results = classification_report(\n",
" labels_test, preds, target_names=LABELS, output_dict=True\n",
")\n",
"\n",
"# Write out results.\n",
"classifier.save_model()\n",
"result_file = os.path.join(OUTPUT_DIR, \"results.json\")\n",
"with open(result_file, \"w+\") as fp:\n",
" json.dump(results, fp)\n",
"\n",
"# Save model\n",
"model_file = os.path.join(OUTPUT_DIR, \"model.pt\")\n",
"torch.save(classifier.model.state_dict(), model_file)"
" json.dump(results, fp)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 42,
"metadata": {},
"outputs": [
{
@ -650,7 +680,7 @@
"'../../utils_nlp/models/bert/train.py'"
]
},
"execution_count": 32,
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
@ -675,15 +705,14 @@
},
{
"cell_type": "code",
"execution_count": 33,
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING - framework_version is not specified, defaulting to version 1.1.\n",
"WARNING - 'process_count_per_node' parameter will be deprecated. Please use it as part of 'distributed_training' parameter.\n"
"WARNING - framework_version is not specified, defaulting to version 1.1.\n"
]
}
],
@ -692,8 +721,7 @@
" compute_target=compute_target,\n",
" entry_script='utils_nlp/models/bert/train.py',\n",
" node_count= NODE_COUNT,\n",
" distributed_training=MpiConfiguration(),\n",
" process_count_per_node=2,\n",
" distributed_training= MpiConfiguration(),\n",
" use_gpu=True,\n",
" conda_packages=['scikit-learn=0.20.3', 'numpy>=1.16.0', 'pandas'],\n",
" pip_packages=[\"tqdm==4.31.1\",\"pytorch-pretrained-bert>=0.6\"]\n",
@ -702,7 +730,7 @@
},
{
"cell_type": "code",
"execution_count": 34,
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
@ -742,7 +770,7 @@
},
{
"cell_type": "code",
"execution_count": 36,
"execution_count": 46,
"metadata": {
"scrolled": false
},
@ -750,7 +778,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "48df85f533834264a8a8b65a57d60d59",
"model_id": "060659321062486694c0acbb0184eeed",
"version_major": 2,
"version_minor": 0
},
@ -768,7 +796,7 @@
},
{
"cell_type": "code",
"execution_count": 37,
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
@ -797,7 +825,7 @@
},
{
"cell_type": "code",
"execution_count": 39,
"execution_count": 49,
"metadata": {
"scrolled": false
},
@ -807,19 +835,20 @@
"output_type": "stream",
"text": [
"Downloading file outputs/results.json to ./outputs\\results.json...\n",
"Downloading file outputs/model.pt to ./outputs\\model.pt...\n"
"Downloading file outputs/bert-large-uncased to ./outputs\\bert-large-uncased...\n",
"Downloading file outputs/bert_config.json to ./outputs\\bert_config.json...\n"
]
}
],
"source": [
"step_run = pipeline_run.find_step_run(\"Estimator-Train\")[0]\n",
"file_names = ['outputs/results.json', 'outputs/model.pt']\n",
"file_names = ['outputs/results.json', 'outputs/bert-large-uncased', 'outputs/bert_config.json' ]\n",
"azureml_utils.get_output_files(step_run, './outputs', file_names=file_names)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"execution_count": 50,
"metadata": {},
"outputs": [
{
@ -827,14 +856,14 @@
"output_type": "stream",
"text": [
" f1-score precision recall support\n",
"telephone 0.920217 0.897281 0.944356 629.0\n",
"government 0.967905 0.979487 0.956594 599.0\n",
"travel 0.856683 0.900169 0.817204 651.0\n",
"slate 0.991093 0.991896 0.990291 618.0\n",
"fiction 0.936434 0.906907 0.967949 624.0\n",
"micro avg 0.933996 0.933996 0.933996 3121.0\n",
"macro avg 0.934466 0.935148 0.935279 3121.0\n",
"weighted avg 0.933394 0.934321 0.933996 3121.0\n"
"telephone 0.904130 0.843191 0.974563 629.0\n",
"government 0.955857 0.972366 0.939900 599.0\n",
"travel 0.839966 0.935849 0.761905 651.0\n",
"slate 0.986411 0.974724 0.998382 618.0\n",
"fiction 0.938871 0.918712 0.959936 624.0\n",
"micro avg 0.925344 0.925344 0.925344 3121.0\n",
"macro avg 0.925047 0.928968 0.926937 3121.0\n",
"weighted avg 0.923913 0.928455 0.925344 3121.0\n"
]
}
],
@ -869,7 +898,7 @@
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [

Просмотреть файл

@ -69,6 +69,9 @@ def notebooks():
"entailment_multinli_bert": os.path.join(
folder_notebooks, "entailment", "entailment_multinli_bert.ipynb"
),
"entailment_bert_azureml": os.path.join(
folder_notebooks, "entailment", "entailment_xnli_bert_azureml.ipynb"
),
"tc_bert_azureml": os.path.join(
folder_notebooks, "text_classification", "tc_bert_azureml.ipynb"
),

Просмотреть файл

@ -3,8 +3,12 @@
import pytest
import papermill as pm
import os
import json
import shutil
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
ABS_TOL = 0.1
@pytest.mark.gpu
@pytest.mark.integration
@ -20,3 +24,30 @@ def test_entailment_multinli_bert(notebooks):
},
kernel_name=KERNEL_NAME,
)
@pytest.mark.integration
@pytest.mark.azureml
def test_entailment_bert_azureml(notebooks,
subscription_id,
resource_group,
workspace_name,
workspace_region,
cluster_name):
notebook_path = notebooks["entailment_bert_azureml"]
pm.execute_notebook(notebook_path,
OUTPUT_NOTEBOOK,
parameters={'DATA_PERCENT_USED': 0.0025,
"subscription_id": subscription_id,
"resource_group": resource_group,
"workspace_name": workspace_name,
"workspace_region": workspace_region,
"cluster_name": cluster_name},
kernel_name=KERNEL_NAME,)
with open("outputs/results.json", "r") as handle:
result_dict = json.load(handle)
assert result_dict["weighted avg"]["f1-score"] == pytest.approx(0.2, abs=ABS_TOL)
if os.path.exists("outputs"):
shutil.rmtree("outputs")

Просмотреть файл

@ -0,0 +1,119 @@
import numpy as np
import torch
from utils_nlp.models.bert.common import Language, Tokenizer
from torch.utils import data
from utils_nlp.dataset.xnli import load_pandas_df
from sklearn.preprocessing import LabelEncoder
MAX_SEQ_LENGTH = 128
TEXT_COL = "text"
LABEL_COL = "label"
DATA_PERCENT_USED = 1.0
TRAIN_FILE_SPLIT = "train"
TEST_FILE_SPLIT = "test"
VALIDATION_FILE_SPLIT = "dev"
CACHE_DIR = "./"
LANGUAGE_ENGLISH = "en"
TO_LOWER_CASE = False
TOK_ENGLISH = Language.ENGLISH
VALID_FILE_SPLIT = [TRAIN_FILE_SPLIT, VALIDATION_FILE_SPLIT, TEST_FILE_SPLIT]
def _load_pandas_df(cache_dir, file_split, language, data_percent_used):
df = load_pandas_df(local_cache_path=cache_dir, file_split=file_split, language=language)
data_used_count = round(data_percent_used * df.shape[0])
df = df.loc[:data_used_count]
return df
def _tokenize(tok_language, to_lowercase, cache_dir, df):
print("Create a tokenizer...")
tokenizer = Tokenizer(language=tok_language, to_lower=to_lowercase, cache_dir=cache_dir)
tokens = tokenizer.tokenize(df[TEXT_COL])
print("Tokenize and preprocess text...")
# tokenize
token_ids, input_mask, token_type_ids = tokenizer.preprocess_classification_tokens(
tokens, max_len=MAX_SEQ_LENGTH
)
return token_ids, input_mask, token_type_ids
def _fit_train_labels(df):
label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(df[LABEL_COL])
train_labels = np.array(train_labels)
return label_encoder, train_labels
class XnliDataset(data.Dataset):
def __init__(
self,
file_split=TRAIN_FILE_SPLIT,
cache_dir=CACHE_DIR,
language=LANGUAGE_ENGLISH,
to_lowercase=TO_LOWER_CASE,
tok_language=TOK_ENGLISH,
data_percent_used=DATA_PERCENT_USED,
):
"""
Load the dataset here
Args:
file_split (str, optional):The subset to load.
One of: {"train", "dev", "test"}
Defaults to "train".
cache_dir (str, optional):Path to store the data.
Defaults to "./".
language(str):Language required to load which xnli file (eg - "en", "zh")
to_lowercase(bool):flag to convert samples in dataset to lowercase
tok_language(Language, optional): language (Language, optional): The pretrained model's language.
Defaults to Language.ENGLISH.
data_percent_used(float, optional): Data used to create Torch Dataset.Defaults to "1.0" which is 100% data
"""
if file_split not in VALID_FILE_SPLIT:
raise ValueError("The file split is not part of ", VALID_FILE_SPLIT)
self.file_split = file_split
self.cache_dir = cache_dir
self.language = language
self.to_lowercase = to_lowercase
self.tok_language = tok_language
self.data_percent_used = data_percent_used
df = _load_pandas_df(self.cache_dir, self.file_split, self.language, self.data_percent_used)
self.df = df
token_ids, input_mask, token_type_ids = _tokenize(
tok_language, to_lowercase, cache_dir, self.df
)
self.token_ids = token_ids
self.input_mask = input_mask
self.token_type_ids = token_type_ids
if file_split == TRAIN_FILE_SPLIT:
label_encoder, train_labels = _fit_train_labels(self.df)
self.label_encoder = label_encoder
self.labels = train_labels
else:
# use the label_encoder passed when you create the test/validate dataset
self.labels = self.df[LABEL_COL]
def __len__(self):
""" Denotes the total number of samples """
return len(self.df)
def __getitem__(self, index):
""" Generates one sample of data """
token_ids = self.token_ids[index]
input_mask = self.input_mask[index]
token_type_ids = self.token_type_ids[index]
labels = self.labels[index]
return {
"token_ids": torch.tensor(token_ids, dtype=torch.long),
"input_mask": torch.tensor(input_mask, dtype=torch.long),
"token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
"labels": labels,
}

Просмотреть файл

@ -418,7 +418,8 @@ def create_data_loader(
class TextDataset(Dataset):
"""
Characterizes a dataset for PyTorch which can be used to load a file containing multiple rows
where each row is a training example.
where each row is a training example. The format of each line in the file is assumed to be
tokens, mask and label.
"""
def __init__(self, filename):
@ -457,11 +458,13 @@ class TextDataset(Dataset):
tokens = self._cast(row[0][1:-1].split(","))
mask = self._cast(row[1][1:-1].split(","))
return (
torch.tensor(tokens, dtype=torch.long),
torch.tensor(mask, dtype=torch.long),
torch.tensor(int(row[2]), dtype=torch.long),
)
data = {
"token_ids": torch.tensor(tokens, dtype=torch.long),
"input_mask": torch.tensor(mask, dtype=torch.long),
"labels": torch.tensor(int(row[2]), dtype=torch.long),
}
return data
def get_dataset_multiple_files(files):

Просмотреть файл

@ -1,44 +1,41 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import logging
import horovod.torch as hvd
# This script reuses some code from
# https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py
import os
import warnings
import numpy as np
import torch.nn as nn
from torch.utils.data import TensorDataset
import torch.utils.data.distributed
import torch.utils.data
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam
from tqdm import tqdm
from utils_nlp.common.pytorch_utils import get_device, move_to_device
from utils_nlp.models.bert.common import Language
from utils_nlp.models.bert.common import get_dataset_multiple_files
from utils_nlp.common.pytorch_utils import get_device, move_to_device
logger = logging.getLogger(__name__)
hvd.init()
torch.manual_seed(42)
if torch.cuda.is_available():
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
torch.cuda.manual_seed(42)
try:
import horovod.torch as hvd
except ImportError:
raise warnings.warn("No Horovod found! Can't do distributed training..")
class BERTSequenceDistClassifier:
"""Distributed BERT-based sequence classifier"""
class BERTSequenceClassifier:
"""BERT-based sequence classifier"""
def __init__(self, language=Language.ENGLISH, num_labels=2, cache_dir="."):
"""Initializes the classifier and the underlying pretrained model.
def __init__(
self, language=Language.ENGLISH, num_labels=2, cache_dir=".", use_distributed=False
):
"""
Args:
language (Language, optional): The pretrained model's language.
Defaults to Language.ENGLISH.
num_labels (int, optional): The number of unique labels in the
training data. Defaults to 2.
cache_dir (str, optional): Location of BERT's cache directory.
Defaults to ".".
language: Language passed to pre-trained BERT model to pick the appropriate model
num_labels: number of unique labels in train dataset
cache_dir: cache_dir to load pre-trained BERT model. Defaults to "."
"""
if num_labels < 2:
raise ValueError("Number of labels should be at least 2.")
@ -46,280 +43,280 @@ class BERTSequenceDistClassifier:
self.language = language
self.num_labels = num_labels
self.cache_dir = cache_dir
self.kwargs = (
{"num_workers": 1, "pin_memory": True}
if torch.cuda.is_available()
else {}
)
self.use_distributed = use_distributed
# create classifier
self.model = BertForSequenceClassification.from_pretrained(
language.value, num_labels=num_labels
language.value, cache_dir=cache_dir, num_labels=num_labels
)
def fit(
self,
token_ids,
input_mask,
labels,
token_type_ids=None,
input_files,
num_gpus=1,
num_epochs=1,
batch_size=32,
lr=2e-5,
warmup_proportion=None,
verbose=True,
fp16_allreduce=False,
):
"""fine-tunes the bert classifier using the given training data.
args:
input_files(list, required): list of paths to the training data files.
token_ids (list): List of training token id lists.
input_mask (list): List of input mask lists.
labels (list): List of training labels.
token_type_ids (list, optional): List of lists. Each sublist
contains segment ids indicating if the token belongs to
the first sentence(0) or second sentence(1). Only needed
for two-sentence tasks.
num_gpus (int, optional): the number of gpus to use.
if none is specified, all available gpus
will be used. defaults to none.
num_epochs (int, optional): number of training epochs.
defaults to 1.
batch_size (int, optional): training batch size. defaults to 32.
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
warmup_proportion (float, optional): proportion of training to
perform linear learning rate warmup for. e.g., 0.1 = 10% of
training. defaults to none.
verbose (bool, optional): if true, shows the training progress and
loss values. defaults to true.
fp16_allreduce(bool, optional)L if true, use fp16 compression during allreduce
"""
if input_files is not None:
train_dataset = get_dataset_multiple_files(input_files)
else:
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
labels_tensor = torch.tensor(labels, dtype=torch.long)
if token_type_ids:
token_type_ids_tensor = torch.tensor(
token_type_ids, dtype=torch.long
)
train_dataset = TensorDataset(
token_ids_tensor,
input_mask_tensor,
token_type_ids_tensor,
labels_tensor,
)
else:
train_dataset = TensorDataset(
token_ids_tensor, input_mask_tensor, labels_tensor
)
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank()
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
sampler=train_sampler,
**self.kwargs
)
device = get_device()
self.model.cuda()
hvd.broadcast_parameters(self.model.state_dict(), root_rank=0)
# hvd.broadcast_optimizer_state(optimizer, root_rank=0)
# define loss function
loss_func = nn.CrossEntropyLoss().to(device)
# define optimizer and model parameters
param_optimizer = list(self.model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [
p
for n, p in param_optimizer
if not any(nd in n for nd in no_decay)
],
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
"weight_decay": 0.01,
},
{
"params": [
p
for n, p in param_optimizer
if any(nd in n for nd in no_decay)
]
},
{"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)]},
]
self.optimizer_params = optimizer_grouped_parameters
self.name_parameters = self.model.named_parameters()
self.state_dict = self.model.state_dict()
num_examples = len(train_dataset)
num_batches = int(num_examples / batch_size)
num_train_optimization_steps = num_batches * num_epochs
if use_distributed:
hvd.init()
if torch.cuda.is_available():
torch.cuda.set_device(hvd.local_rank())
else:
warnings.warn("No GPU available! Using CPU.")
def create_optimizer(
self, num_train_optimization_steps, lr=2e-5, fp16_allreduce=False, warmup_proportion=None
):
"""
Method to create an BERT Optimizer based on the inputs from the user.
Args:
num_train_optimization_steps(int): Number of optimization steps.
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
warmup_proportion (float, optional): proportion of training to
perform linear learning rate warmup for. e.g., 0.1 = 10% of
training. defaults to none.
fp16_allreduce(bool, optional)L if true, use fp16 compression during allreduce
Returns:
pytorch_pretrained_bert.optimization.BertAdam : A BertAdam optimizer with user
specified config.
"""
if self.use_distributed:
lr = lr * hvd.size()
if warmup_proportion is None:
optimizer = BertAdam(
optimizer_grouped_parameters, lr=lr * hvd.size()
)
optimizer = BertAdam(self.optimizer_params, lr=lr)
else:
optimizer = BertAdam(
optimizer_grouped_parameters,
lr=lr * hvd.size(),
self.optimizer_params,
lr=lr,
t_total=num_train_optimization_steps,
warmup=warmup_proportion,
)
# Horovod: (optional) compression algorithm.
compression = (
hvd.Compression.fp16 if fp16_allreduce else hvd.Compression.none
)
if self.use_distributed:
compression = hvd.Compression.fp16 if fp16_allreduce else hvd.Compression.none
optimizer = hvd.DistributedOptimizer(
optimizer, named_parameters=self.model.named_parameters(), compression=compression
)
# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=self.model.named_parameters(),
compression=compression,
)
return optimizer
# Horovod: set epoch to sampler for shuffling.
for epoch in range(num_epochs):
self.model.train()
train_sampler.set_epoch(epoch)
for batch_idx, batch in enumerate(train_loader):
if token_type_ids:
x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(
t.to(device) for t in batch
)
else:
token_type_ids_batch = None
x_batch, mask_batch, y_batch = tuple(
t.to(device) for t in batch
)
optimizer.zero_grad()
output = self.model(
input_ids=x_batch, attention_mask=mask_batch, labels=None
)
loss = loss_func(output, y_batch).mean()
loss.backward()
optimizer.step()
if verbose and (batch_idx % ((num_batches // 10) + 1)) == 0:
# Horovod: use train_sampler to determine the number of examples in
# this worker's partition.
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
epoch,
batch_idx * len(x_batch),
len(train_sampler),
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
# empty cache
torch.cuda.empty_cache()
def predict(
self,
input_files = None,
token_ids,
input_mask,
token_type_ids=None,
input_files, num_gpus=1, batch_size=32, probabilities=False
):
"""Scores the given set of train files and returns the predicted classes.
def create_data_loader(self, dataset, batch_size=32, mode="train", **kwargs):
"""
Method to create a data loader for a given Tensor dataset.
Args:
input_files(list, required): list of paths to the test data files.
token_ids (list): List of training token lists.
input_mask (list): List of input mask lists.
token_type_ids (list, optional): List of lists. Each sublist
contains segment ids indicating if the token belongs to
the first sentence(0) or second sentence(1). Only needed
for two-sentence tasks.
mode(str): Mode for creating data loader. Could be train or test.
dataset(torch.utils.data.Dataset): A Tensor dataset.
batch_size(int): Batch size.
Returns:
torch.utils.data.DataLoader: A torch data loader to the given dataset.
"""
if mode == "test":
sampler = torch.utils.data.sampler.SequentialSampler(dataset)
elif self.use_distributed:
sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=hvd.size(), rank=hvd.rank()
)
else:
sampler = torch.utils.data.RandomSampler(dataset)
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, sampler=sampler, **kwargs
)
return data_loader
def save_model(self):
"""
Method to save the trained model.
#ToDo: Works for English Language now. Multiple language support needs to be added.
"""
# Save the model to the outputs directory for capture
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)
# Save a trained model, configuration and tokenizer
model_to_save = self.model.module if hasattr(self.model, "module") else self.model
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = "outputs/bert-large-uncased"
output_config_file = "outputs/bert_config.json"
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
def fit(
self,
train_loader,
epoch,
bert_optimizer=None,
num_epochs=1,
num_gpus=0,
lr=2e-5,
warmup_proportion=None,
fp16_allreduce=False,
num_train_optimization_steps=10,
):
"""
Method to fine-tune the bert classifier using the given training data
Args:
train_loader(torch.DataLoader): Torch Dataloader created from Torch Dataset
epoch(int): Current epoch number of training.
bert_optimizer(optimizer): optimizer can be BERTAdam for local and Dsitributed if Horovod
num_epochs(int): the number of epochs to run
num_gpus(int): the number of gpus
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
warmup_proportion (float, optional): proportion of training to
perform linear learning rate warmup for. e.g., 0.1 = 10% of
training. defaults to none.
fp16_allreduce(bool): if true, use fp16 compression during allreduce
num_train_optimization_steps: number of steps the optimizer should take.
"""
device = get_device("cpu" if num_gpus == 0 else "gpu")
if device:
self.model.cuda()
if bert_optimizer is None:
bert_optimizer = self.create_optimizer(
num_train_optimization_steps=num_train_optimization_steps,
lr=lr,
warmup_proportion=warmup_proportion,
fp16_allreduce=fp16_allreduce,
)
if self.use_distributed:
hvd.broadcast_parameters(self.model.state_dict(), root_rank=0)
loss_func = nn.CrossEntropyLoss().to(device)
# train
self.model.train() # training mode
token_type_ids_batch = None
num_print = 1000
for batch_idx, data in enumerate(train_loader):
x_batch = data["token_ids"]
x_batch = x_batch.cuda()
y_batch = data["labels"]
y_batch = y_batch.cuda()
mask_batch = data["input_mask"]
mask_batch = mask_batch.cuda()
if "token_type_ids" in data and data["token_type_ids"] is not None:
token_type_ids_batch = data["token_type_ids"]
token_type_ids_batch = token_type_ids_batch.cuda()
bert_optimizer.zero_grad()
y_h = self.model(
input_ids=x_batch,
token_type_ids=token_type_ids_batch,
attention_mask=mask_batch,
labels=None,
)
loss = loss_func(y_h, y_batch).mean()
loss.backward()
bert_optimizer.synchronize()
bert_optimizer.step()
if batch_idx % num_print == 0:
print(
"Train Epoch: {}/{} ({:.0f}%) \t Batch:{} \tLoss: {:.6f}".format(
epoch,
num_epochs,
100.0 * batch_idx / len(train_loader),
batch_idx + 1,
loss.item(),
)
)
del [x_batch, y_batch, mask_batch, token_type_ids_batch]
torch.cuda.empty_cache()
def predict(self, test_loader, num_gpus=None, probabilities=False):
"""
Method to predict the results on the test loader. Only evaluates for non distributed
workload on the head node in a distributed setup.
Args:
test_loader(torch Dataloader): Torch Dataloader created from Torch Dataset
num_gpus (int, optional): The number of gpus to use.
If None is specified, all available GPUs
will be used. Defaults to None.
batch_size (int, optional): Scoring batch size. Defaults to 32.
probabilities (bool, optional):
If True, the predicted probability distribution
is also returned. Defaults to False.
Returns:
1darray, dict(1darray, 1darray, ndarray): Predicted classes and target labels or
a dictionary with classes, target labels, probabilities) if probabilities is True.
"""
if input_files is not None:
test_dataset = get_dataset_multiple_files(input_files)
else:
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
if token_type_ids:
token_type_ids_tensor = torch.tensor(
token_type_ids, dtype=torch.long
)
test_dataset = TensorDataset(
token_ids_tensor, input_mask_tensor, token_type_ids_tensor
)
else:
test_dataset = TensorDataset(token_ids_tensor, input_mask_tensor)
# Horovod: use DistributedSampler to partition the test data.
test_sampler = torch.utils.data.sampler.SequentialSampler(test_dataset)
test_loader = torch.utils.data.DataLoader(
test_dataset,
batch_size=batch_size,
sampler=test_sampler,
**self.kwargs
)
device = get_device()
device = get_device("cpu" if num_gpus == 0 else "gpu")
self.model = move_to_device(self.model, device, num_gpus)
# score
self.model.eval()
preds = []
labels_test = []
test_labels = []
for i, data in enumerate(tqdm(test_loader, desc="Iteration")):
x_batch = data["token_ids"]
x_batch = x_batch.cuda()
with tqdm(total=len(test_loader)) as pbar:
for i, (tokens, mask, target) in enumerate(test_loader):
if torch.cuda.is_available():
tokens, mask, target = (
tokens.cuda(),
mask.cuda(),
target.cuda(),
)
mask_batch = data["input_mask"]
mask_batch = mask_batch.cuda()
with torch.no_grad():
p_batch = self.model(
input_ids=tokens, attention_mask=mask, labels=None
)
preds.append(p_batch.cpu())
labels_test.append(target.cpu())
if i % batch_size == 0:
pbar.update(batch_size)
y_batch = data["labels"]
token_type_ids_batch = None
if "token_type_ids" in data and data["token_type_ids"] is not None:
token_type_ids_batch = data["token_type_ids"]
token_type_ids_batch = token_type_ids_batch.cuda()
with torch.no_grad():
p_batch = self.model(
input_ids=x_batch,
token_type_ids=token_type_ids_batch,
attention_mask=mask_batch,
labels=None,
)
preds.append(p_batch.cpu())
test_labels.append(y_batch)
preds = np.concatenate(preds)
labels_test = np.concatenate(labels_test)
test_labels = np.concatenate(test_labels)
if probabilities:
return {
"Predictions": preds.argmax(axis=1),
"Target": labels_test,
"classes probabilities": nn.Softmax(dim=1)(
torch.Tensor(preds)
).numpy(),
"Target": test_labels,
"classes probabilities": nn.Softmax(dim=1)(torch.Tensor(preds)).numpy(),
}
else:
return preds.argmax(axis=1), labels_test
return preds.argmax(axis=1), test_labels