Merge pull request #254 from microsoft/sequence-classifier-dist
Entailment using BERT, AzureML along with Text Classification updates
This commit is contained in:
Коммит
0642773154
|
@ -0,0 +1,531 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Natural Language Inference on XNLI Dataset using BERT with Azure Machine Learning\n",
|
||||
"\n",
|
||||
"## 1. Summary\n",
|
||||
"In this notebook, we demostrate using the BERT model to do language inference in English. We use the [XNLI](https://github.com/facebookresearch/XNLI) dataset and the task is to classify sentence pairs into three classes: contradiction, entailment, and neutral. \n",
|
||||
"The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.\n",
|
||||
"<img src=\"https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG\">\n",
|
||||
"\n",
|
||||
"Azure Machine Learning features higlighted in the notebook : \n",
|
||||
"\n",
|
||||
"- Distributed training with Horovod"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Imports\n",
|
||||
"\n",
|
||||
"import sys\n",
|
||||
"\n",
|
||||
"sys.path.append(\"../..\")\n",
|
||||
"\n",
|
||||
"import os\n",
|
||||
"import shutil\n",
|
||||
"import torch\n",
|
||||
"import json\n",
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"import azureml.core\n",
|
||||
"from azureml.train.dnn import PyTorch\n",
|
||||
"from azureml.core.runconfig import MpiConfiguration\n",
|
||||
"from azureml.core import Experiment\n",
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"from azureml.core.compute import ComputeTarget\n",
|
||||
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"parameters"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Parameters\n",
|
||||
"\n",
|
||||
"DEBUG = True\n",
|
||||
"NODE_COUNT = 4\n",
|
||||
"NUM_PROCESS = 1\n",
|
||||
"DATA_PERCENT_USED = 1.0\n",
|
||||
"\n",
|
||||
"config_path = (\n",
|
||||
" \"./.azureml\"\n",
|
||||
") # Path to the directory containing config.json with azureml credentials\n",
|
||||
"\n",
|
||||
"# Azure resources\n",
|
||||
"subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n",
|
||||
"resource_group = \"YOUR_RESOURCE_GROUP_NAME\" \n",
|
||||
"workspace_name = \"YOUR_WORKSPACE_NAME\" \n",
|
||||
"workspace_region = \"YOUR_WORKSPACE_REGION\" # eg: eastus, eastus2.\n",
|
||||
"cluster_name = \"gpu-entail\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 2. AzureML Setup\n",
|
||||
"\n",
|
||||
"### 2.1 Link to or create a Workspace\n",
|
||||
"\n",
|
||||
"First, go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`. This will create a config.json file containing the values needed below to create a workspace.\n",
|
||||
"\n",
|
||||
"**Note**: you do not need to fill in these values if you have a config.json in the same folder as this notebook"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ws = get_or_create_workspace(\n",
|
||||
" config_path=config_path,\n",
|
||||
" subscription_id=subscription_id,\n",
|
||||
" resource_group=resource_group,\n",
|
||||
" workspace_name=workspace_name,\n",
|
||||
" workspace_region=workspace_region,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\n",
|
||||
" \"Workspace name: \" + ws.name,\n",
|
||||
" \"Azure region: \" + ws.location,\n",
|
||||
" \"Subscription id: \" + ws.subscription_id,\n",
|
||||
" \"Resource group: \" + ws.resource_group,\n",
|
||||
" sep=\"\\n\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.3 Link AmlCompute Compute Target\n",
|
||||
"\n",
|
||||
"We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. \n",
|
||||
"\n",
|
||||
"As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found compute target: gpu-entail\n",
|
||||
"{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-03T13:43:20.068000+00:00', 'errors': None, 'creationTime': '2019-07-27T02:14:46.127092+00:00', 'modifiedTime': '2019-07-27T02:15:07.181277+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6S_V2'}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"try:\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" print(\"Found compute target: {}\".format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print(\"Creating new compute target: {}\".format(cluster_name))\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(\n",
|
||||
" vm_size=\"STANDARD_NC6\", max_nodes=1\n",
|
||||
" )\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" compute_target.wait_for_completion(show_output=True)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"print(compute_target.get_status().serialize())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'./entail_utils\\\\utils_nlp'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"project_dir = \"./entail_utils\"\n",
|
||||
"if DEBUG and os.path.exists(project_dir):\n",
|
||||
" shutil.rmtree(project_dir)\n",
|
||||
"shutil.copytree(\"../../utils_nlp\", os.path.join(project_dir, \"utils_nlp\"))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3. Prepare Training Script"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Writing ./entail_utils/train.py\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%%writefile $project_dir/train.py\n",
|
||||
"import horovod.torch as hvd\n",
|
||||
"import torch\n",
|
||||
"import numpy as np\n",
|
||||
"import time\n",
|
||||
"import argparse\n",
|
||||
"from utils_nlp.common.timer import Timer\n",
|
||||
"from utils_nlp.dataset.xnli_torch_dataset import XnliDataset\n",
|
||||
"from utils_nlp.models.bert.common import Language\n",
|
||||
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
|
||||
" BERTSequenceClassifier,\n",
|
||||
")\n",
|
||||
"from sklearn.metrics import classification_report\n",
|
||||
"\n",
|
||||
"print(\"Torch version:\", torch.__version__)\n",
|
||||
"\n",
|
||||
"hvd.init()\n",
|
||||
"\n",
|
||||
"LANGUAGE_ENGLISH = \"en\"\n",
|
||||
"TRAIN_FILE_SPLIT = \"train\"\n",
|
||||
"TEST_FILE_SPLIT = \"test\"\n",
|
||||
"TO_LOWERCASE = True\n",
|
||||
"PRETRAINED_BERT_LNG = Language.ENGLISH\n",
|
||||
"LEARNING_RATE = 5e-5\n",
|
||||
"WARMUP_PROPORTION = 0.1\n",
|
||||
"BATCH_SIZE = 32\n",
|
||||
"NUM_GPUS = 1\n",
|
||||
"OUTPUT_DIR = \"./outputs/\"\n",
|
||||
"LABELS = [\"contradiction\", \"entailment\", \"neutral\"]\n",
|
||||
"\n",
|
||||
"## each machine gets it's own copy of data\n",
|
||||
"CACHE_DIR = \"./xnli-%d\" % hvd.rank()\n",
|
||||
"\n",
|
||||
"parser = argparse.ArgumentParser()\n",
|
||||
"# Training settings\n",
|
||||
"parser.add_argument(\n",
|
||||
" \"--seed\", type=int, default=42, metavar=\"S\", help=\"random seed (default: 42)\"\n",
|
||||
")\n",
|
||||
"parser.add_argument(\n",
|
||||
" \"--epochs\", type=int, default=2, metavar=\"S\", help=\"random seed (default: 2)\"\n",
|
||||
")\n",
|
||||
"parser.add_argument(\n",
|
||||
" \"--no-cuda\", action=\"store_true\", default=False, help=\"disables CUDA training\"\n",
|
||||
")\n",
|
||||
"parser.add_argument(\n",
|
||||
" \"--data_percent_used\",\n",
|
||||
" type=float,\n",
|
||||
" default=1.0,\n",
|
||||
" metavar=\"S\",\n",
|
||||
" help=\"data percent used (default: 1.0)\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"args = parser.parse_args()\n",
|
||||
"args.cuda = not args.no_cuda and torch.cuda.is_available()\n",
|
||||
"\n",
|
||||
"\"\"\"\n",
|
||||
"Note: For example, you have 4 nodes and 4 GPUs each node, so you spawn 16 workers. \n",
|
||||
"Every worker will have a rank [0, 15], and every worker will have a local_rank [0, 3]\n",
|
||||
"\"\"\"\n",
|
||||
"if args.cuda:\n",
|
||||
" torch.cuda.set_device(hvd.local_rank())\n",
|
||||
" torch.cuda.manual_seed(args.seed)\n",
|
||||
"\n",
|
||||
"# num_workers - this is equal to number of gpus per machine\n",
|
||||
"kwargs = {\"num_workers\": NUM_GPUS, \"pin_memory\": True} if args.cuda else {}\n",
|
||||
"\n",
|
||||
"train_dataset = XnliDataset(\n",
|
||||
" file_split=TRAIN_FILE_SPLIT,\n",
|
||||
" cache_dir=CACHE_DIR,\n",
|
||||
" language=LANGUAGE_ENGLISH,\n",
|
||||
" to_lowercase=TO_LOWERCASE,\n",
|
||||
" tok_language=PRETRAINED_BERT_LNG,\n",
|
||||
" data_percent_used=args.data_percent_used,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# set the label_encoder for evaluation\n",
|
||||
"label_encoder = train_dataset.label_encoder\n",
|
||||
"num_labels = len(np.unique(train_dataset.labels))\n",
|
||||
"\n",
|
||||
"# Train\n",
|
||||
"classifier = BERTSequenceClassifier(\n",
|
||||
" language=Language.ENGLISH,\n",
|
||||
" num_labels=num_labels,\n",
|
||||
" cache_dir=CACHE_DIR,\n",
|
||||
" use_distributed=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"train_loader = classifier.create_data_loader(\n",
|
||||
" train_dataset, BATCH_SIZE, mode=\"train\", **kwargs\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"num_samples = len(train_loader.dataset)\n",
|
||||
"num_batches = int(num_samples / BATCH_SIZE)\n",
|
||||
"num_train_optimization_steps = num_batches * args.epochs\n",
|
||||
"optimizer = classifier.create_optimizer(\n",
|
||||
" num_train_optimization_steps, lr=LEARNING_RATE, warmup_proportion=WARMUP_PROPORTION\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"with Timer() as t:\n",
|
||||
" for epoch in range(1, args.epochs + 1):\n",
|
||||
"\n",
|
||||
" # to allow data shuffling for DistributedSampler\n",
|
||||
" train_loader.sampler.set_epoch(epoch)\n",
|
||||
"\n",
|
||||
" # epoch and num_epochs is passed in the fit function to print loss at regular batch intervals\n",
|
||||
" classifier.fit(\n",
|
||||
" train_loader,\n",
|
||||
" epoch=epoch,\n",
|
||||
" num_epochs=args.epochs,\n",
|
||||
" bert_optimizer=optimizer,\n",
|
||||
" num_gpus=NUM_GPUS,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"#if machine has multiple gpus then run predictions on only on 1 gpu since test_dataset is small.\n",
|
||||
"if hvd.rank() == 0:\n",
|
||||
" NUM_GPUS = 1\n",
|
||||
" \n",
|
||||
" test_dataset = XnliDataset(\n",
|
||||
" file_split=TEST_FILE_SPLIT,\n",
|
||||
" cache_dir=CACHE_DIR,\n",
|
||||
" language=LANGUAGE_ENGLISH,\n",
|
||||
" to_lowercase=TO_LOWERCASE,\n",
|
||||
" tok_language=PRETRAINED_BERT_LNG,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" test_loader = classifier.create_data_loader(test_dataset, mode=\"test\")\n",
|
||||
"\n",
|
||||
" # predict\n",
|
||||
" predictions, pred_labels = classifier.predict(test_loader, NUM_GPUS)\n",
|
||||
"\n",
|
||||
" predictions = label_encoder.inverse_transform(predictions)\n",
|
||||
"\n",
|
||||
" # Evaluate\n",
|
||||
" results = classification_report(\n",
|
||||
" pred_labels, predictions, target_names=LABELS, output_dict=True\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" result_file = os.path.join(OUTPUT_DIR, \"results.json\")\n",
|
||||
" with open(result_file, \"w+\") as fp:\n",
|
||||
" json.dump(results, fp)\n",
|
||||
"\n",
|
||||
" # save model\n",
|
||||
" classifier.save_model()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 4. Create a PyTorch Estimator\n",
|
||||
"\n",
|
||||
"BERT is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"mpiConfig = MpiConfiguration()\n",
|
||||
"mpiConfig.process_count_per_node = NUM_PROCESS\n",
|
||||
"\n",
|
||||
"script_params = {\n",
|
||||
" '--data_percent_used': DATA_PERCENT_USED\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"est = PyTorch(\n",
|
||||
" source_directory=project_dir,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" entry_script=\"train.py\",\n",
|
||||
" script_params = script_params,\n",
|
||||
" node_count=NODE_COUNT,\n",
|
||||
" distributed_training=mpiConfig,\n",
|
||||
" use_gpu=True,\n",
|
||||
" framework_version=\"1.0\",\n",
|
||||
" conda_packages=[\"scikit-learn=0.20.3\", \"numpy\", \"spacy\", \"nltk\"],\n",
|
||||
" pip_packages=[\"pandas\", \"seqeval[gpu]\", \"pytorch-pretrained-bert\"],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 5. Create Experiment and Submit a Job\n",
|
||||
"Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.\n",
|
||||
"\n",
|
||||
"**Note**: The experiment takes ~4 hours with 2 NC24 nodes and ~7hours with 4 NC6 nodes. The overhead is due to the communication time between nodes. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"experiment = Experiment(ws, name=\"NLP-Entailment-BERT\")\n",
|
||||
"run = experiment.submit(est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "c8e7a44fa8804e95b21eea74d7694b1e",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"RunDetails(run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Since the above cell is an async call, the below cell is a blocking call to stop the cells below it to execute."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"run.wait_for_completion()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 6. Analyze Results\n",
|
||||
"\n",
|
||||
"Download result.json from portal and open to view results. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading file outputs/results.json to ./outputs\\results.json...\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"file_names = [\"outputs/results.json\"]\n",
|
||||
"get_output_files(run, \"./outputs\", file_names=file_names)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" f1-score precision recall support\n",
|
||||
"contradiction 0.838749 0.859296 0.819162 1670.0\n",
|
||||
"entailment 0.817280 0.877663 0.764671 1670.0\n",
|
||||
"neutral 0.777870 0.719817 0.846108 1670.0\n",
|
||||
"micro avg 0.809980 0.809980 0.809980 5010.0\n",
|
||||
"macro avg 0.811300 0.818925 0.809980 5010.0\n",
|
||||
"weighted avg 0.811300 0.818925 0.809980 5010.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"with open(\"outputs/results.json\", \"r\") as handle:\n",
|
||||
" parsed = json.load(handle)\n",
|
||||
" print(pd.DataFrame.from_dict(parsed).transpose())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -32,7 +32,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -85,7 +85,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"execution_count": 30,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"parameters"
|
||||
|
@ -135,7 +135,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": 31,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
|
@ -164,7 +164,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -172,7 +172,7 @@
|
|||
"output_type": "stream",
|
||||
"text": [
|
||||
"Found existing compute target.\n",
|
||||
"{'currentNodeCount': 2, 'targetNodeCount': 2, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 2, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-31T22:29:42.732000+00:00', 'errors': None, 'creationTime': '2019-07-25T04:16:20.598768+00:00', 'modifiedTime': '2019-07-25T04:16:36.486727+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 2, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12'}\n"
|
||||
"{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-11T08:53:18.284000+00:00', 'errors': None, 'creationTime': '2019-07-25T04:16:20.598768+00:00', 'modifiedTime': '2019-08-05T06:40:12.292030+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12'}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -213,7 +213,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -234,7 +234,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"execution_count": 34,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
|
@ -271,9 +271,20 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"$AZUREML_DATAREFERENCE_9609849b541244d396d06017b5729edb"
|
||||
]
|
||||
},
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"ds = ws.get_default_datastore()\n",
|
||||
"ds.upload(src_dir=TRAIN_FOLDER, target_path=\"mnli_data/train\", overwrite=True, show_progress=False)\n",
|
||||
|
@ -282,7 +293,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -299,7 +310,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"execution_count": 37,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
|
@ -404,7 +415,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -413,7 +424,7 @@
|
|||
"'../../utils_nlp/models/bert/preprocess.py'"
|
||||
]
|
||||
},
|
||||
"execution_count": 28,
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
|
@ -432,7 +443,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -461,7 +472,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"execution_count": 40,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -543,7 +554,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"execution_count": 41,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -564,13 +575,14 @@
|
|||
"import logging\n",
|
||||
"import os\n",
|
||||
"import torch\n",
|
||||
"\n",
|
||||
"from sklearn.metrics import classification_report\n",
|
||||
"\n",
|
||||
"from utils_nlp.models.bert.common import Language\n",
|
||||
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
|
||||
" BERTSequenceDistClassifier,\n",
|
||||
")\n",
|
||||
"from utils_nlp.common.timer import Timer\n",
|
||||
"from utils_nlp.models.bert.common import Language, get_dataset_multiple_files\n",
|
||||
"from utils_nlp.models.bert.sequence_classification_distributed import (\n",
|
||||
" BERTSequenceClassifier,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"BATCH_SIZE = 32\n",
|
||||
"NUM_GPUS = 2\n",
|
||||
|
@ -602,46 +614,64 @@
|
|||
"# Handle square brackets from train list\n",
|
||||
"train_files[0] = train_files[0][1:]\n",
|
||||
"train_files[len(train_files) - 1] = train_files[len(train_files) - 1][:-1]\n",
|
||||
"train_dataset = get_dataset_multiple_files(train_files)\n",
|
||||
"\n",
|
||||
"# Handle square brackets from test list\n",
|
||||
"test_files[0] = test_files[0][1:]\n",
|
||||
"test_files[len(test_files) - 1] = test_files[len(test_files) - 1][:-1]\n",
|
||||
"test_dataset = get_dataset_multiple_files(test_files)\n",
|
||||
"\n",
|
||||
"# Train\n",
|
||||
"classifier = BERTSequenceDistClassifier(\n",
|
||||
" language=Language.ENGLISH, num_labels=len(LABELS)\n",
|
||||
"classifier = BERTSequenceClassifier(\n",
|
||||
" language=Language.ENGLISH, num_labels=len(LABELS), use_distributed=True\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Create data loaders.\n",
|
||||
"kwargs = (\n",
|
||||
" {\"num_workers\": NUM_GPUS, \"pin_memory\": True} if torch.cuda.is_available() else {}\n",
|
||||
")\n",
|
||||
"train_data_loader = classifier.create_data_loader(\n",
|
||||
" train_dataset, batch_size=BATCH_SIZE, **kwargs\n",
|
||||
")\n",
|
||||
"test_data_loader = classifier.create_data_loader(\n",
|
||||
" test_dataset, batch_size=BATCH_SIZE, mode=\"test\", **kwargs\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Create optimizer\n",
|
||||
"num_examples = len(train_dataset)\n",
|
||||
"num_batches = int(num_examples / BATCH_SIZE)\n",
|
||||
"num_train_optimization_steps = num_batches * NUM_EPOCHS\n",
|
||||
"optimizer = classifier.create_optimizer(num_train_optimization_steps)\n",
|
||||
"\n",
|
||||
"with Timer() as t:\n",
|
||||
" classifier.fit(\n",
|
||||
" train_files,\n",
|
||||
" num_gpus=NUM_GPUS,\n",
|
||||
" num_epochs=NUM_EPOCHS,\n",
|
||||
" batch_size=BATCH_SIZE,\n",
|
||||
" verbose=True,\n",
|
||||
" )\n",
|
||||
" for epoch in range(1, NUM_EPOCHS + 1):\n",
|
||||
" train_data_loader.sampler.set_epoch(epoch)\n",
|
||||
" classifier.fit(\n",
|
||||
" train_data_loader,\n",
|
||||
" epoch=epoch,\n",
|
||||
" bert_optimizer=optimizer,\n",
|
||||
" num_gpus=NUM_GPUS,\n",
|
||||
" num_epochs=NUM_EPOCHS,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"# Predict\n",
|
||||
"preds, labels_test = classifier.predict(\n",
|
||||
" test_files, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE\n",
|
||||
")\n",
|
||||
"preds, labels_test = classifier.predict(test_data_loader, num_gpus=NUM_GPUS)\n",
|
||||
"\n",
|
||||
"# Evaluate\n",
|
||||
"results = classification_report(\n",
|
||||
" labels_test, preds, target_names=LABELS, output_dict=True\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Write out results.\n",
|
||||
"classifier.save_model()\n",
|
||||
"result_file = os.path.join(OUTPUT_DIR, \"results.json\")\n",
|
||||
"with open(result_file, \"w+\") as fp:\n",
|
||||
" json.dump(results, fp)\n",
|
||||
"\n",
|
||||
"# Save model\n",
|
||||
"model_file = os.path.join(OUTPUT_DIR, \"model.pt\")\n",
|
||||
"torch.save(classifier.model.state_dict(), model_file)"
|
||||
" json.dump(results, fp)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"execution_count": 42,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -650,7 +680,7 @@
|
|||
"'../../utils_nlp/models/bert/train.py'"
|
||||
]
|
||||
},
|
||||
"execution_count": 32,
|
||||
"execution_count": 42,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
|
@ -675,15 +705,14 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"execution_count": 43,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"WARNING - framework_version is not specified, defaulting to version 1.1.\n",
|
||||
"WARNING - 'process_count_per_node' parameter will be deprecated. Please use it as part of 'distributed_training' parameter.\n"
|
||||
"WARNING - framework_version is not specified, defaulting to version 1.1.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -692,8 +721,7 @@
|
|||
" compute_target=compute_target,\n",
|
||||
" entry_script='utils_nlp/models/bert/train.py',\n",
|
||||
" node_count= NODE_COUNT,\n",
|
||||
" distributed_training=MpiConfiguration(),\n",
|
||||
" process_count_per_node=2,\n",
|
||||
" distributed_training= MpiConfiguration(),\n",
|
||||
" use_gpu=True,\n",
|
||||
" conda_packages=['scikit-learn=0.20.3', 'numpy>=1.16.0', 'pandas'],\n",
|
||||
" pip_packages=[\"tqdm==4.31.1\",\"pytorch-pretrained-bert>=0.6\"]\n",
|
||||
|
@ -702,7 +730,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"execution_count": 44,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -742,7 +770,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"execution_count": 46,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
|
@ -750,7 +778,7 @@
|
|||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "48df85f533834264a8a8b65a57d60d59",
|
||||
"model_id": "060659321062486694c0acbb0184eeed",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
|
@ -768,7 +796,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"execution_count": 47,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -797,7 +825,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"execution_count": 49,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
|
@ -807,19 +835,20 @@
|
|||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading file outputs/results.json to ./outputs\\results.json...\n",
|
||||
"Downloading file outputs/model.pt to ./outputs\\model.pt...\n"
|
||||
"Downloading file outputs/bert-large-uncased to ./outputs\\bert-large-uncased...\n",
|
||||
"Downloading file outputs/bert_config.json to ./outputs\\bert_config.json...\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"step_run = pipeline_run.find_step_run(\"Estimator-Train\")[0]\n",
|
||||
"file_names = ['outputs/results.json', 'outputs/model.pt']\n",
|
||||
"file_names = ['outputs/results.json', 'outputs/bert-large-uncased', 'outputs/bert_config.json' ]\n",
|
||||
"azureml_utils.get_output_files(step_run, './outputs', file_names=file_names)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 40,
|
||||
"execution_count": 50,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -827,14 +856,14 @@
|
|||
"output_type": "stream",
|
||||
"text": [
|
||||
" f1-score precision recall support\n",
|
||||
"telephone 0.920217 0.897281 0.944356 629.0\n",
|
||||
"government 0.967905 0.979487 0.956594 599.0\n",
|
||||
"travel 0.856683 0.900169 0.817204 651.0\n",
|
||||
"slate 0.991093 0.991896 0.990291 618.0\n",
|
||||
"fiction 0.936434 0.906907 0.967949 624.0\n",
|
||||
"micro avg 0.933996 0.933996 0.933996 3121.0\n",
|
||||
"macro avg 0.934466 0.935148 0.935279 3121.0\n",
|
||||
"weighted avg 0.933394 0.934321 0.933996 3121.0\n"
|
||||
"telephone 0.904130 0.843191 0.974563 629.0\n",
|
||||
"government 0.955857 0.972366 0.939900 599.0\n",
|
||||
"travel 0.839966 0.935849 0.761905 651.0\n",
|
||||
"slate 0.986411 0.974724 0.998382 618.0\n",
|
||||
"fiction 0.938871 0.918712 0.959936 624.0\n",
|
||||
"micro avg 0.925344 0.925344 0.925344 3121.0\n",
|
||||
"macro avg 0.925047 0.928968 0.926937 3121.0\n",
|
||||
"weighted avg 0.923913 0.928455 0.925344 3121.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -869,7 +898,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"execution_count": 51,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
|
|
@ -69,6 +69,9 @@ def notebooks():
|
|||
"entailment_multinli_bert": os.path.join(
|
||||
folder_notebooks, "entailment", "entailment_multinli_bert.ipynb"
|
||||
),
|
||||
"entailment_bert_azureml": os.path.join(
|
||||
folder_notebooks, "entailment", "entailment_xnli_bert_azureml.ipynb"
|
||||
),
|
||||
"tc_bert_azureml": os.path.join(
|
||||
folder_notebooks, "text_classification", "tc_bert_azureml.ipynb"
|
||||
),
|
||||
|
|
|
@ -3,8 +3,12 @@
|
|||
|
||||
import pytest
|
||||
import papermill as pm
|
||||
import os
|
||||
import json
|
||||
import shutil
|
||||
from tests.notebooks_common import OUTPUT_NOTEBOOK, KERNEL_NAME
|
||||
|
||||
ABS_TOL = 0.1
|
||||
|
||||
@pytest.mark.gpu
|
||||
@pytest.mark.integration
|
||||
|
@ -20,3 +24,30 @@ def test_entailment_multinli_bert(notebooks):
|
|||
},
|
||||
kernel_name=KERNEL_NAME,
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.integration
|
||||
@pytest.mark.azureml
|
||||
def test_entailment_bert_azureml(notebooks,
|
||||
subscription_id,
|
||||
resource_group,
|
||||
workspace_name,
|
||||
workspace_region,
|
||||
cluster_name):
|
||||
notebook_path = notebooks["entailment_bert_azureml"]
|
||||
pm.execute_notebook(notebook_path,
|
||||
OUTPUT_NOTEBOOK,
|
||||
parameters={'DATA_PERCENT_USED': 0.0025,
|
||||
"subscription_id": subscription_id,
|
||||
"resource_group": resource_group,
|
||||
"workspace_name": workspace_name,
|
||||
"workspace_region": workspace_region,
|
||||
"cluster_name": cluster_name},
|
||||
kernel_name=KERNEL_NAME,)
|
||||
|
||||
with open("outputs/results.json", "r") as handle:
|
||||
result_dict = json.load(handle)
|
||||
assert result_dict["weighted avg"]["f1-score"] == pytest.approx(0.2, abs=ABS_TOL)
|
||||
|
||||
if os.path.exists("outputs"):
|
||||
shutil.rmtree("outputs")
|
||||
|
|
|
@ -0,0 +1,119 @@
|
|||
import numpy as np
|
||||
import torch
|
||||
from utils_nlp.models.bert.common import Language, Tokenizer
|
||||
from torch.utils import data
|
||||
from utils_nlp.dataset.xnli import load_pandas_df
|
||||
from sklearn.preprocessing import LabelEncoder
|
||||
|
||||
MAX_SEQ_LENGTH = 128
|
||||
TEXT_COL = "text"
|
||||
LABEL_COL = "label"
|
||||
DATA_PERCENT_USED = 1.0
|
||||
TRAIN_FILE_SPLIT = "train"
|
||||
TEST_FILE_SPLIT = "test"
|
||||
VALIDATION_FILE_SPLIT = "dev"
|
||||
CACHE_DIR = "./"
|
||||
LANGUAGE_ENGLISH = "en"
|
||||
TO_LOWER_CASE = False
|
||||
TOK_ENGLISH = Language.ENGLISH
|
||||
VALID_FILE_SPLIT = [TRAIN_FILE_SPLIT, VALIDATION_FILE_SPLIT, TEST_FILE_SPLIT]
|
||||
|
||||
|
||||
def _load_pandas_df(cache_dir, file_split, language, data_percent_used):
|
||||
df = load_pandas_df(local_cache_path=cache_dir, file_split=file_split, language=language)
|
||||
data_used_count = round(data_percent_used * df.shape[0])
|
||||
df = df.loc[:data_used_count]
|
||||
return df
|
||||
|
||||
|
||||
def _tokenize(tok_language, to_lowercase, cache_dir, df):
|
||||
print("Create a tokenizer...")
|
||||
tokenizer = Tokenizer(language=tok_language, to_lower=to_lowercase, cache_dir=cache_dir)
|
||||
tokens = tokenizer.tokenize(df[TEXT_COL])
|
||||
|
||||
print("Tokenize and preprocess text...")
|
||||
# tokenize
|
||||
token_ids, input_mask, token_type_ids = tokenizer.preprocess_classification_tokens(
|
||||
tokens, max_len=MAX_SEQ_LENGTH
|
||||
)
|
||||
return token_ids, input_mask, token_type_ids
|
||||
|
||||
|
||||
def _fit_train_labels(df):
|
||||
label_encoder = LabelEncoder()
|
||||
train_labels = label_encoder.fit_transform(df[LABEL_COL])
|
||||
train_labels = np.array(train_labels)
|
||||
return label_encoder, train_labels
|
||||
|
||||
|
||||
class XnliDataset(data.Dataset):
|
||||
def __init__(
|
||||
self,
|
||||
file_split=TRAIN_FILE_SPLIT,
|
||||
cache_dir=CACHE_DIR,
|
||||
language=LANGUAGE_ENGLISH,
|
||||
to_lowercase=TO_LOWER_CASE,
|
||||
tok_language=TOK_ENGLISH,
|
||||
data_percent_used=DATA_PERCENT_USED,
|
||||
):
|
||||
"""
|
||||
Load the dataset here
|
||||
Args:
|
||||
file_split (str, optional):The subset to load.
|
||||
One of: {"train", "dev", "test"}
|
||||
Defaults to "train".
|
||||
cache_dir (str, optional):Path to store the data.
|
||||
Defaults to "./".
|
||||
language(str):Language required to load which xnli file (eg - "en", "zh")
|
||||
to_lowercase(bool):flag to convert samples in dataset to lowercase
|
||||
tok_language(Language, optional): language (Language, optional): The pretrained model's language.
|
||||
Defaults to Language.ENGLISH.
|
||||
data_percent_used(float, optional): Data used to create Torch Dataset.Defaults to "1.0" which is 100% data
|
||||
"""
|
||||
if file_split not in VALID_FILE_SPLIT:
|
||||
raise ValueError("The file split is not part of ", VALID_FILE_SPLIT)
|
||||
|
||||
self.file_split = file_split
|
||||
self.cache_dir = cache_dir
|
||||
self.language = language
|
||||
self.to_lowercase = to_lowercase
|
||||
self.tok_language = tok_language
|
||||
self.data_percent_used = data_percent_used
|
||||
|
||||
df = _load_pandas_df(self.cache_dir, self.file_split, self.language, self.data_percent_used)
|
||||
|
||||
self.df = df
|
||||
|
||||
token_ids, input_mask, token_type_ids = _tokenize(
|
||||
tok_language, to_lowercase, cache_dir, self.df
|
||||
)
|
||||
|
||||
self.token_ids = token_ids
|
||||
self.input_mask = input_mask
|
||||
self.token_type_ids = token_type_ids
|
||||
|
||||
if file_split == TRAIN_FILE_SPLIT:
|
||||
label_encoder, train_labels = _fit_train_labels(self.df)
|
||||
self.label_encoder = label_encoder
|
||||
self.labels = train_labels
|
||||
else:
|
||||
# use the label_encoder passed when you create the test/validate dataset
|
||||
self.labels = self.df[LABEL_COL]
|
||||
|
||||
def __len__(self):
|
||||
""" Denotes the total number of samples """
|
||||
return len(self.df)
|
||||
|
||||
def __getitem__(self, index):
|
||||
""" Generates one sample of data """
|
||||
token_ids = self.token_ids[index]
|
||||
input_mask = self.input_mask[index]
|
||||
token_type_ids = self.token_type_ids[index]
|
||||
labels = self.labels[index]
|
||||
|
||||
return {
|
||||
"token_ids": torch.tensor(token_ids, dtype=torch.long),
|
||||
"input_mask": torch.tensor(input_mask, dtype=torch.long),
|
||||
"token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
|
||||
"labels": labels,
|
||||
}
|
|
@ -418,7 +418,8 @@ def create_data_loader(
|
|||
class TextDataset(Dataset):
|
||||
"""
|
||||
Characterizes a dataset for PyTorch which can be used to load a file containing multiple rows
|
||||
where each row is a training example.
|
||||
where each row is a training example. The format of each line in the file is assumed to be
|
||||
tokens, mask and label.
|
||||
"""
|
||||
|
||||
def __init__(self, filename):
|
||||
|
@ -457,11 +458,13 @@ class TextDataset(Dataset):
|
|||
tokens = self._cast(row[0][1:-1].split(","))
|
||||
mask = self._cast(row[1][1:-1].split(","))
|
||||
|
||||
return (
|
||||
torch.tensor(tokens, dtype=torch.long),
|
||||
torch.tensor(mask, dtype=torch.long),
|
||||
torch.tensor(int(row[2]), dtype=torch.long),
|
||||
)
|
||||
data = {
|
||||
"token_ids": torch.tensor(tokens, dtype=torch.long),
|
||||
"input_mask": torch.tensor(mask, dtype=torch.long),
|
||||
"labels": torch.tensor(int(row[2]), dtype=torch.long),
|
||||
}
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def get_dataset_multiple_files(files):
|
||||
|
|
|
@ -1,44 +1,41 @@
|
|||
# Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
# Licensed under the MIT License.
|
||||
import logging
|
||||
|
||||
import horovod.torch as hvd
|
||||
# This script reuses some code from
|
||||
# https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py
|
||||
|
||||
import os
|
||||
import warnings
|
||||
|
||||
import numpy as np
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import TensorDataset
|
||||
import torch.utils.data.distributed
|
||||
import torch.utils.data
|
||||
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
|
||||
from pytorch_pretrained_bert.optimization import BertAdam
|
||||
from tqdm import tqdm
|
||||
|
||||
from utils_nlp.common.pytorch_utils import get_device, move_to_device
|
||||
from utils_nlp.models.bert.common import Language
|
||||
|
||||
from utils_nlp.models.bert.common import get_dataset_multiple_files
|
||||
from utils_nlp.common.pytorch_utils import get_device, move_to_device
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
hvd.init()
|
||||
torch.manual_seed(42)
|
||||
|
||||
if torch.cuda.is_available():
|
||||
# Horovod: pin GPU to local rank.
|
||||
torch.cuda.set_device(hvd.local_rank())
|
||||
torch.cuda.manual_seed(42)
|
||||
try:
|
||||
import horovod.torch as hvd
|
||||
except ImportError:
|
||||
raise warnings.warn("No Horovod found! Can't do distributed training..")
|
||||
|
||||
|
||||
class BERTSequenceDistClassifier:
|
||||
"""Distributed BERT-based sequence classifier"""
|
||||
class BERTSequenceClassifier:
|
||||
"""BERT-based sequence classifier"""
|
||||
|
||||
def __init__(self, language=Language.ENGLISH, num_labels=2, cache_dir="."):
|
||||
"""Initializes the classifier and the underlying pretrained model.
|
||||
def __init__(
|
||||
self, language=Language.ENGLISH, num_labels=2, cache_dir=".", use_distributed=False
|
||||
):
|
||||
|
||||
"""
|
||||
|
||||
Args:
|
||||
language (Language, optional): The pretrained model's language.
|
||||
Defaults to Language.ENGLISH.
|
||||
num_labels (int, optional): The number of unique labels in the
|
||||
training data. Defaults to 2.
|
||||
cache_dir (str, optional): Location of BERT's cache directory.
|
||||
Defaults to ".".
|
||||
language: Language passed to pre-trained BERT model to pick the appropriate model
|
||||
num_labels: number of unique labels in train dataset
|
||||
cache_dir: cache_dir to load pre-trained BERT model. Defaults to "."
|
||||
"""
|
||||
if num_labels < 2:
|
||||
raise ValueError("Number of labels should be at least 2.")
|
||||
|
@ -46,280 +43,280 @@ class BERTSequenceDistClassifier:
|
|||
self.language = language
|
||||
self.num_labels = num_labels
|
||||
self.cache_dir = cache_dir
|
||||
self.kwargs = (
|
||||
{"num_workers": 1, "pin_memory": True}
|
||||
if torch.cuda.is_available()
|
||||
else {}
|
||||
)
|
||||
self.use_distributed = use_distributed
|
||||
|
||||
# create classifier
|
||||
self.model = BertForSequenceClassification.from_pretrained(
|
||||
language.value, num_labels=num_labels
|
||||
language.value, cache_dir=cache_dir, num_labels=num_labels
|
||||
)
|
||||
|
||||
def fit(
|
||||
self,
|
||||
token_ids,
|
||||
input_mask,
|
||||
labels,
|
||||
token_type_ids=None,
|
||||
input_files,
|
||||
num_gpus=1,
|
||||
num_epochs=1,
|
||||
batch_size=32,
|
||||
lr=2e-5,
|
||||
warmup_proportion=None,
|
||||
verbose=True,
|
||||
fp16_allreduce=False,
|
||||
):
|
||||
"""fine-tunes the bert classifier using the given training data.
|
||||
|
||||
args:
|
||||
input_files(list, required): list of paths to the training data files.
|
||||
token_ids (list): List of training token id lists.
|
||||
input_mask (list): List of input mask lists.
|
||||
labels (list): List of training labels.
|
||||
token_type_ids (list, optional): List of lists. Each sublist
|
||||
contains segment ids indicating if the token belongs to
|
||||
the first sentence(0) or second sentence(1). Only needed
|
||||
for two-sentence tasks.
|
||||
num_gpus (int, optional): the number of gpus to use.
|
||||
if none is specified, all available gpus
|
||||
will be used. defaults to none.
|
||||
num_epochs (int, optional): number of training epochs.
|
||||
defaults to 1.
|
||||
batch_size (int, optional): training batch size. defaults to 32.
|
||||
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
|
||||
warmup_proportion (float, optional): proportion of training to
|
||||
perform linear learning rate warmup for. e.g., 0.1 = 10% of
|
||||
training. defaults to none.
|
||||
verbose (bool, optional): if true, shows the training progress and
|
||||
loss values. defaults to true.
|
||||
fp16_allreduce(bool, optional)L if true, use fp16 compression during allreduce
|
||||
"""
|
||||
|
||||
if input_files is not None:
|
||||
train_dataset = get_dataset_multiple_files(input_files)
|
||||
else:
|
||||
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
|
||||
input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
|
||||
labels_tensor = torch.tensor(labels, dtype=torch.long)
|
||||
|
||||
if token_type_ids:
|
||||
token_type_ids_tensor = torch.tensor(
|
||||
token_type_ids, dtype=torch.long
|
||||
)
|
||||
train_dataset = TensorDataset(
|
||||
token_ids_tensor,
|
||||
input_mask_tensor,
|
||||
token_type_ids_tensor,
|
||||
labels_tensor,
|
||||
)
|
||||
else:
|
||||
train_dataset = TensorDataset(
|
||||
token_ids_tensor, input_mask_tensor, labels_tensor
|
||||
)
|
||||
|
||||
train_sampler = torch.utils.data.distributed.DistributedSampler(
|
||||
train_dataset, num_replicas=hvd.size(), rank=hvd.rank()
|
||||
)
|
||||
train_loader = torch.utils.data.DataLoader(
|
||||
train_dataset,
|
||||
batch_size=batch_size,
|
||||
sampler=train_sampler,
|
||||
**self.kwargs
|
||||
)
|
||||
|
||||
device = get_device()
|
||||
self.model.cuda()
|
||||
|
||||
hvd.broadcast_parameters(self.model.state_dict(), root_rank=0)
|
||||
# hvd.broadcast_optimizer_state(optimizer, root_rank=0)
|
||||
|
||||
# define loss function
|
||||
loss_func = nn.CrossEntropyLoss().to(device)
|
||||
|
||||
# define optimizer and model parameters
|
||||
param_optimizer = list(self.model.named_parameters())
|
||||
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
|
||||
optimizer_grouped_parameters = [
|
||||
{
|
||||
"params": [
|
||||
p
|
||||
for n, p in param_optimizer
|
||||
if not any(nd in n for nd in no_decay)
|
||||
],
|
||||
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
|
||||
"weight_decay": 0.01,
|
||||
},
|
||||
{
|
||||
"params": [
|
||||
p
|
||||
for n, p in param_optimizer
|
||||
if any(nd in n for nd in no_decay)
|
||||
]
|
||||
},
|
||||
{"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)]},
|
||||
]
|
||||
self.optimizer_params = optimizer_grouped_parameters
|
||||
self.name_parameters = self.model.named_parameters()
|
||||
self.state_dict = self.model.state_dict()
|
||||
|
||||
num_examples = len(train_dataset)
|
||||
num_batches = int(num_examples / batch_size)
|
||||
num_train_optimization_steps = num_batches * num_epochs
|
||||
if use_distributed:
|
||||
hvd.init()
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.set_device(hvd.local_rank())
|
||||
else:
|
||||
warnings.warn("No GPU available! Using CPU.")
|
||||
|
||||
def create_optimizer(
|
||||
self, num_train_optimization_steps, lr=2e-5, fp16_allreduce=False, warmup_proportion=None
|
||||
):
|
||||
|
||||
"""
|
||||
Method to create an BERT Optimizer based on the inputs from the user.
|
||||
|
||||
Args:
|
||||
num_train_optimization_steps(int): Number of optimization steps.
|
||||
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
|
||||
warmup_proportion (float, optional): proportion of training to
|
||||
perform linear learning rate warmup for. e.g., 0.1 = 10% of
|
||||
training. defaults to none.
|
||||
fp16_allreduce(bool, optional)L if true, use fp16 compression during allreduce
|
||||
|
||||
Returns:
|
||||
pytorch_pretrained_bert.optimization.BertAdam : A BertAdam optimizer with user
|
||||
specified config.
|
||||
|
||||
"""
|
||||
if self.use_distributed:
|
||||
lr = lr * hvd.size()
|
||||
|
||||
if warmup_proportion is None:
|
||||
optimizer = BertAdam(
|
||||
optimizer_grouped_parameters, lr=lr * hvd.size()
|
||||
)
|
||||
optimizer = BertAdam(self.optimizer_params, lr=lr)
|
||||
else:
|
||||
optimizer = BertAdam(
|
||||
optimizer_grouped_parameters,
|
||||
lr=lr * hvd.size(),
|
||||
self.optimizer_params,
|
||||
lr=lr,
|
||||
t_total=num_train_optimization_steps,
|
||||
warmup=warmup_proportion,
|
||||
)
|
||||
|
||||
# Horovod: (optional) compression algorithm.
|
||||
compression = (
|
||||
hvd.Compression.fp16 if fp16_allreduce else hvd.Compression.none
|
||||
)
|
||||
if self.use_distributed:
|
||||
compression = hvd.Compression.fp16 if fp16_allreduce else hvd.Compression.none
|
||||
optimizer = hvd.DistributedOptimizer(
|
||||
optimizer, named_parameters=self.model.named_parameters(), compression=compression
|
||||
)
|
||||
|
||||
# Horovod: wrap optimizer with DistributedOptimizer.
|
||||
optimizer = hvd.DistributedOptimizer(
|
||||
optimizer,
|
||||
named_parameters=self.model.named_parameters(),
|
||||
compression=compression,
|
||||
)
|
||||
return optimizer
|
||||
|
||||
# Horovod: set epoch to sampler for shuffling.
|
||||
for epoch in range(num_epochs):
|
||||
self.model.train()
|
||||
train_sampler.set_epoch(epoch)
|
||||
for batch_idx, batch in enumerate(train_loader):
|
||||
|
||||
if token_type_ids:
|
||||
x_batch, mask_batch, token_type_ids_batch, y_batch = tuple(
|
||||
t.to(device) for t in batch
|
||||
)
|
||||
else:
|
||||
token_type_ids_batch = None
|
||||
x_batch, mask_batch, y_batch = tuple(
|
||||
t.to(device) for t in batch
|
||||
)
|
||||
|
||||
optimizer.zero_grad()
|
||||
|
||||
output = self.model(
|
||||
input_ids=x_batch, attention_mask=mask_batch, labels=None
|
||||
)
|
||||
|
||||
loss = loss_func(output, y_batch).mean()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
if verbose and (batch_idx % ((num_batches // 10) + 1)) == 0:
|
||||
# Horovod: use train_sampler to determine the number of examples in
|
||||
# this worker's partition.
|
||||
print(
|
||||
"Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
|
||||
epoch,
|
||||
batch_idx * len(x_batch),
|
||||
len(train_sampler),
|
||||
100.0 * batch_idx / len(train_loader),
|
||||
loss.item(),
|
||||
)
|
||||
)
|
||||
|
||||
# empty cache
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
def predict(
|
||||
self,
|
||||
input_files = None,
|
||||
token_ids,
|
||||
input_mask,
|
||||
token_type_ids=None,
|
||||
input_files, num_gpus=1, batch_size=32, probabilities=False
|
||||
):
|
||||
"""Scores the given set of train files and returns the predicted classes.
|
||||
def create_data_loader(self, dataset, batch_size=32, mode="train", **kwargs):
|
||||
"""
|
||||
Method to create a data loader for a given Tensor dataset.
|
||||
|
||||
Args:
|
||||
input_files(list, required): list of paths to the test data files.
|
||||
token_ids (list): List of training token lists.
|
||||
input_mask (list): List of input mask lists.
|
||||
token_type_ids (list, optional): List of lists. Each sublist
|
||||
contains segment ids indicating if the token belongs to
|
||||
the first sentence(0) or second sentence(1). Only needed
|
||||
for two-sentence tasks.
|
||||
mode(str): Mode for creating data loader. Could be train or test.
|
||||
dataset(torch.utils.data.Dataset): A Tensor dataset.
|
||||
batch_size(int): Batch size.
|
||||
|
||||
Returns:
|
||||
torch.utils.data.DataLoader: A torch data loader to the given dataset.
|
||||
|
||||
"""
|
||||
|
||||
if mode == "test":
|
||||
sampler = torch.utils.data.sampler.SequentialSampler(dataset)
|
||||
elif self.use_distributed:
|
||||
sampler = torch.utils.data.distributed.DistributedSampler(
|
||||
dataset, num_replicas=hvd.size(), rank=hvd.rank()
|
||||
)
|
||||
else:
|
||||
sampler = torch.utils.data.RandomSampler(dataset)
|
||||
|
||||
data_loader = torch.utils.data.DataLoader(
|
||||
dataset, batch_size=batch_size, sampler=sampler, **kwargs
|
||||
)
|
||||
|
||||
return data_loader
|
||||
|
||||
def save_model(self):
|
||||
"""
|
||||
Method to save the trained model.
|
||||
#ToDo: Works for English Language now. Multiple language support needs to be added.
|
||||
|
||||
"""
|
||||
# Save the model to the outputs directory for capture
|
||||
output_dir = "outputs"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
# Save a trained model, configuration and tokenizer
|
||||
model_to_save = self.model.module if hasattr(self.model, "module") else self.model
|
||||
|
||||
# If we save using the predefined names, we can load using `from_pretrained`
|
||||
output_model_file = "outputs/bert-large-uncased"
|
||||
output_config_file = "outputs/bert_config.json"
|
||||
|
||||
torch.save(model_to_save.state_dict(), output_model_file)
|
||||
model_to_save.config.to_json_file(output_config_file)
|
||||
|
||||
def fit(
|
||||
self,
|
||||
train_loader,
|
||||
epoch,
|
||||
bert_optimizer=None,
|
||||
num_epochs=1,
|
||||
num_gpus=0,
|
||||
lr=2e-5,
|
||||
warmup_proportion=None,
|
||||
fp16_allreduce=False,
|
||||
num_train_optimization_steps=10,
|
||||
):
|
||||
"""
|
||||
Method to fine-tune the bert classifier using the given training data
|
||||
|
||||
Args:
|
||||
train_loader(torch.DataLoader): Torch Dataloader created from Torch Dataset
|
||||
epoch(int): Current epoch number of training.
|
||||
bert_optimizer(optimizer): optimizer can be BERTAdam for local and Dsitributed if Horovod
|
||||
num_epochs(int): the number of epochs to run
|
||||
num_gpus(int): the number of gpus
|
||||
lr (float): learning rate of the adam optimizer. defaults to 2e-5.
|
||||
warmup_proportion (float, optional): proportion of training to
|
||||
perform linear learning rate warmup for. e.g., 0.1 = 10% of
|
||||
training. defaults to none.
|
||||
fp16_allreduce(bool): if true, use fp16 compression during allreduce
|
||||
num_train_optimization_steps: number of steps the optimizer should take.
|
||||
"""
|
||||
|
||||
device = get_device("cpu" if num_gpus == 0 else "gpu")
|
||||
|
||||
if device:
|
||||
self.model.cuda()
|
||||
|
||||
if bert_optimizer is None:
|
||||
bert_optimizer = self.create_optimizer(
|
||||
num_train_optimization_steps=num_train_optimization_steps,
|
||||
lr=lr,
|
||||
warmup_proportion=warmup_proportion,
|
||||
fp16_allreduce=fp16_allreduce,
|
||||
)
|
||||
|
||||
if self.use_distributed:
|
||||
hvd.broadcast_parameters(self.model.state_dict(), root_rank=0)
|
||||
|
||||
loss_func = nn.CrossEntropyLoss().to(device)
|
||||
|
||||
# train
|
||||
self.model.train() # training mode
|
||||
|
||||
token_type_ids_batch = None
|
||||
|
||||
num_print = 1000
|
||||
for batch_idx, data in enumerate(train_loader):
|
||||
|
||||
x_batch = data["token_ids"]
|
||||
x_batch = x_batch.cuda()
|
||||
|
||||
y_batch = data["labels"]
|
||||
y_batch = y_batch.cuda()
|
||||
|
||||
mask_batch = data["input_mask"]
|
||||
mask_batch = mask_batch.cuda()
|
||||
|
||||
if "token_type_ids" in data and data["token_type_ids"] is not None:
|
||||
token_type_ids_batch = data["token_type_ids"]
|
||||
token_type_ids_batch = token_type_ids_batch.cuda()
|
||||
|
||||
bert_optimizer.zero_grad()
|
||||
|
||||
y_h = self.model(
|
||||
input_ids=x_batch,
|
||||
token_type_ids=token_type_ids_batch,
|
||||
attention_mask=mask_batch,
|
||||
labels=None,
|
||||
)
|
||||
|
||||
loss = loss_func(y_h, y_batch).mean()
|
||||
loss.backward()
|
||||
|
||||
bert_optimizer.synchronize()
|
||||
bert_optimizer.step()
|
||||
|
||||
if batch_idx % num_print == 0:
|
||||
print(
|
||||
"Train Epoch: {}/{} ({:.0f}%) \t Batch:{} \tLoss: {:.6f}".format(
|
||||
epoch,
|
||||
num_epochs,
|
||||
100.0 * batch_idx / len(train_loader),
|
||||
batch_idx + 1,
|
||||
loss.item(),
|
||||
)
|
||||
)
|
||||
|
||||
del [x_batch, y_batch, mask_batch, token_type_ids_batch]
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
def predict(self, test_loader, num_gpus=None, probabilities=False):
|
||||
"""
|
||||
|
||||
Method to predict the results on the test loader. Only evaluates for non distributed
|
||||
workload on the head node in a distributed setup.
|
||||
|
||||
Args:
|
||||
test_loader(torch Dataloader): Torch Dataloader created from Torch Dataset
|
||||
num_gpus (int, optional): The number of gpus to use.
|
||||
If None is specified, all available GPUs
|
||||
will be used. Defaults to None.
|
||||
batch_size (int, optional): Scoring batch size. Defaults to 32.
|
||||
probabilities (bool, optional):
|
||||
If True, the predicted probability distribution
|
||||
is also returned. Defaults to False.
|
||||
|
||||
Returns:
|
||||
1darray, dict(1darray, 1darray, ndarray): Predicted classes and target labels or
|
||||
a dictionary with classes, target labels, probabilities) if probabilities is True.
|
||||
"""
|
||||
|
||||
if input_files is not None:
|
||||
test_dataset = get_dataset_multiple_files(input_files)
|
||||
|
||||
else:
|
||||
token_ids_tensor = torch.tensor(token_ids, dtype=torch.long)
|
||||
input_mask_tensor = torch.tensor(input_mask, dtype=torch.long)
|
||||
|
||||
if token_type_ids:
|
||||
token_type_ids_tensor = torch.tensor(
|
||||
token_type_ids, dtype=torch.long
|
||||
)
|
||||
test_dataset = TensorDataset(
|
||||
token_ids_tensor, input_mask_tensor, token_type_ids_tensor
|
||||
)
|
||||
else:
|
||||
test_dataset = TensorDataset(token_ids_tensor, input_mask_tensor)
|
||||
|
||||
# Horovod: use DistributedSampler to partition the test data.
|
||||
test_sampler = torch.utils.data.sampler.SequentialSampler(test_dataset)
|
||||
|
||||
test_loader = torch.utils.data.DataLoader(
|
||||
test_dataset,
|
||||
batch_size=batch_size,
|
||||
sampler=test_sampler,
|
||||
**self.kwargs
|
||||
)
|
||||
|
||||
device = get_device()
|
||||
device = get_device("cpu" if num_gpus == 0 else "gpu")
|
||||
self.model = move_to_device(self.model, device, num_gpus)
|
||||
|
||||
# score
|
||||
self.model.eval()
|
||||
|
||||
preds = []
|
||||
labels_test = []
|
||||
test_labels = []
|
||||
for i, data in enumerate(tqdm(test_loader, desc="Iteration")):
|
||||
x_batch = data["token_ids"]
|
||||
x_batch = x_batch.cuda()
|
||||
|
||||
with tqdm(total=len(test_loader)) as pbar:
|
||||
for i, (tokens, mask, target) in enumerate(test_loader):
|
||||
if torch.cuda.is_available():
|
||||
tokens, mask, target = (
|
||||
tokens.cuda(),
|
||||
mask.cuda(),
|
||||
target.cuda(),
|
||||
)
|
||||
mask_batch = data["input_mask"]
|
||||
mask_batch = mask_batch.cuda()
|
||||
|
||||
with torch.no_grad():
|
||||
p_batch = self.model(
|
||||
input_ids=tokens, attention_mask=mask, labels=None
|
||||
)
|
||||
preds.append(p_batch.cpu())
|
||||
labels_test.append(target.cpu())
|
||||
if i % batch_size == 0:
|
||||
pbar.update(batch_size)
|
||||
y_batch = data["labels"]
|
||||
|
||||
token_type_ids_batch = None
|
||||
if "token_type_ids" in data and data["token_type_ids"] is not None:
|
||||
token_type_ids_batch = data["token_type_ids"]
|
||||
token_type_ids_batch = token_type_ids_batch.cuda()
|
||||
|
||||
with torch.no_grad():
|
||||
p_batch = self.model(
|
||||
input_ids=x_batch,
|
||||
token_type_ids=token_type_ids_batch,
|
||||
attention_mask=mask_batch,
|
||||
labels=None,
|
||||
)
|
||||
preds.append(p_batch.cpu())
|
||||
test_labels.append(y_batch)
|
||||
|
||||
preds = np.concatenate(preds)
|
||||
labels_test = np.concatenate(labels_test)
|
||||
test_labels = np.concatenate(test_labels)
|
||||
|
||||
if probabilities:
|
||||
return {
|
||||
"Predictions": preds.argmax(axis=1),
|
||||
"Target": labels_test,
|
||||
"classes probabilities": nn.Softmax(dim=1)(
|
||||
torch.Tensor(preds)
|
||||
).numpy(),
|
||||
"Target": test_labels,
|
||||
"classes probabilities": nn.Softmax(dim=1)(torch.Tensor(preds)).numpy(),
|
||||
}
|
||||
else:
|
||||
return preds.argmax(axis=1), labels_test
|
||||
return preds.argmax(axis=1), test_labels
|
||||
|
|
Загрузка…
Ссылка в новой задаче