Merge pull request #85 from microsoft/casey-senteval

SentEval examples (local and with azureml support)
This commit is contained in:
Said Bleik 2019-06-11 14:39:51 -04:00 коммит произвёл GitHub
Родитель 2afdb73a29 ae67549185
Коммит 98a7071294
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
8 изменённых файлов: 944 добавлений и 3 удалений

Просмотреть файл

@ -0,0 +1,404 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SentEval with AzureML\n",
"[SentEval](https://github.com/facebookresearch/SentEval) is a widely used benchmarking tool for evaluating general-purpose sentence embeddings. It provides a simple interface for evaluating your embeddings on up to 17 supported downstream tasks (such as sentiment classification, natural language inference, semantic similarity, etc.)\n",
"\n",
"This notebook shows how to use SentEval with the AzureML SDK. Running SentEval locally is easy, but not necessarily efficient depending on the model specs. For example, it can quickly become expensive if you are trying to benchmark a model that runs on GPU, even if you are starting with pretrained weights (loading the embeddings and vocabulary for inferencing can take a nontrivial amount of time). In this example we show how to run SentEval for [Gensen](https://github.com/Maluuba/gensen), where\n",
"- the model weights are on AzureML Datastore. To download the pre-trained Gensen model, run `bash download_models.sh` from the gensen/data/models directory. \n",
"- the embeddings are on AzureML Datastore. To download the pre-trained embeddings, run `bash glove2h5.sh` from the gensen/data/embedding directory.\n",
"- the data for the SentEval transfer tasks are on AzureML Datastore. To download these datasets, run `bash get_transfer_data.bash` from the SentEval/data/downstream directory.\n",
"- evaluation runs on the AzureML Workspace GPU Compute Target (no extra provisioning/config needed)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Global Settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"\n",
"import azureml.core\n",
"from azureml.core.workspace import Workspace\n",
"\n",
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"from azureml.core import Datastore\n",
"import azureml.data\n",
"from azureml.data.azure_storage_datastore import AzureFileDatastore\n",
"\n",
"from azureml.train.dnn import PyTorch\n",
"from azureml.core.runconfig import MpiConfiguration\n",
"from azureml.core import Experiment\n",
"from azureml.widgets import RunDetails\n",
"\n",
"sys.path.append(\"../../\")\n",
"from utils_nlp.azureml.azureml_utils import get_or_create_workspace\n",
"\n",
"AZUREML_VERBOSE = False\n",
"\n",
"PATH_TO_GENSEN = (\n",
" \"../../../gensen\"\n",
") # Set this path to where you have cloned the gensen source code\n",
"PATH_TO_SENTEVAL = (\n",
" \"../../../SentEval\"\n",
") # Set this path to where you have cloned the senteval source code\n",
"\n",
"cluster_name = \"eval-gpu\" # Name of AzureML Compute Target cluster\n",
"ds_root = \"senteval_pytorch_gensen\" # Name of root directory for the datastore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define the AzureML Workspace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"ws = get_or_create_workspace(\n",
" subscription_id=\"<SUBSCRIPTION_ID>\",\n",
" resource_group=\"<RESOURCE_GROUP>\",\n",
" workspace_name=\"<WORKSPACE_NAME>\",\n",
" workspace_region=\"<WORKSPACE_REGION>\",\n",
")\n",
"\n",
"if AZUREML_VERBOSE:\n",
" print(\"Workspace name: {}\".format(ws.name))\n",
" print(\"Resource group: {}\".format(ws.resource_group))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Attach the gpu-enabled compute target, or create a new one if it doesn't already exist."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" print(\"Found compute target: {}\".format(cluster_name))\n",
"except ComputeTargetException:\n",
" print(\"Creating new compute target: {}\".format(cluster_name))\n",
" compute_config = AmlCompute.provisioning_configuration(\n",
" vm_size=\"STANDARD_NC6\", max_nodes=4\n",
" )\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" compute_target.wait_for_completion(show_output=True)\n",
"\n",
"if AZUREML_VERBOSE:\n",
" print(compute_target.get_status().serialize())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define the datastore. Here we will use the default datastore and then upload our external dependencies. \n",
"\n",
"If your data is already on the cloud, you can register your resource on any Azure storage account as the datastore. (Currently, the list of supported Azure storage services that can be registered as datastores are Azure Blob Container, Azure File Share, Azure Data Lake, Azure Data Lake Gen2, Azure SQL Database, Azure PostgreSQL, and Databricks File System. Learn more about the Datastore module [here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore?view=azure-ml-py).)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"if AZUREML_VERBOSE:\n",
" print(\"Default datastore: {}\".format(ds.name))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Upload the gensen dependency\n",
"ds.upload(\n",
" src_dir=os.path.join(PATH_TO_GENSEN),\n",
" target_path=os.path.join(ds_root, \"gensen_lib\"),\n",
" overwrite=False,\n",
" show_progress=AZUREML_VERBOSE,\n",
")\n",
"\n",
"# Upload the senteval dependency\n",
"ds.upload(\n",
" src_dir=os.path.join(PATH_TO_SENTEVAL),\n",
" target_path=os.path.join(ds_root, \"senteval_lib\"),\n",
" overwrite=False,\n",
" show_progress=AZUREML_VERBOSE,\n",
")\n",
"\n",
"# Upload the utils_nlp/eval/senteval.py dependency (this defines the azureml-compatible wrapper for senteval)\n",
"ds.upload_files(\n",
" files=[\"../../utils_nlp/eval/senteval.py\"],\n",
" target_path=os.path.join(ds_root, \"utils_nlp/eval\"),\n",
" overwrite=False,\n",
" show_progress=AZUREML_VERBOSE,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that after the upload is complete, you can safely delete the dependencies from your local machine to free up some memory."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the evaluation script"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"src_dir = \"./senteval-pytorch-gensen\"\n",
"os.makedirs(src_dir, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile $src_dir/evaluate.py\n",
"import os\n",
"import sys\n",
"import argparse\n",
"import torch\n",
"import pandas as pd\n",
"\n",
"if __name__ == \"__main__\":\n",
" parser = argparse.ArgumentParser()\n",
" parser.add_argument(\"--ds_gensen\", type=str, dest=\"ds_gensen\")\n",
" parser.add_argument(\"--ds_senteval\", type=str, dest=\"ds_senteval\")\n",
" parser.add_argument(\"--ds_utils\", type=str, dest=\"ds_utils\")\n",
" args = parser.parse_args()\n",
" \n",
" # Import the dependencies\n",
" sys.path.insert(0, args.ds_gensen)\n",
" from gensen import GenSen, GenSenSingle\n",
" sys.path.insert(0, args.ds_utils)\n",
" from eval.senteval import SentEvalRunner\n",
"\n",
" # Define the model\n",
" model_params = {}\n",
" model_params[\"folder_path\"] = os.path.join(args.ds_gensen, \"data/models\")\n",
" model_params[\"prefix_1\"] = \"nli_large_bothskip_parse\"\n",
" model_params[\"prefix_2\"] = \"nli_large_bothskip\"\n",
" model_params[\"pretrain\"] = os.path.join(\n",
" args.ds_gensen, \"data/embedding/glove.840B.300d.h5\"\n",
" )\n",
" model_params[\"cuda\"] = torch.cuda.is_available()\n",
"\n",
" gensen_1 = GenSenSingle(\n",
" model_folder=model_params[\"folder_path\"],\n",
" filename_prefix=model_params[\"prefix_1\"],\n",
" pretrained_emb=model_params[\"pretrain\"],\n",
" cuda=model_params[\"cuda\"],\n",
" )\n",
" gensen_2 = GenSenSingle(\n",
" model_folder=model_params[\"folder_path\"],\n",
" filename_prefix=model_params[\"prefix_2\"],\n",
" pretrained_emb=model_params[\"pretrain\"],\n",
" cuda=model_params[\"cuda\"],\n",
" )\n",
" gensen = GenSen(gensen_1, gensen_2)\n",
"\n",
" # Define the SentEval Runner, an AzureML-compatible wrapper class for SentEval\n",
" ser = SentEvalRunner(path_to_senteval=args.ds_senteval, use_azureml=True)\n",
" ser.set_transfer_data_path(relative_path=\"data\")\n",
" ser.set_transfer_tasks(\n",
" [\"STSBenchmark\", \"STS12\", \"STS13\", \"STS14\", \"STS15\", \"STS16\"]\n",
" )\n",
" ser.set_model(gensen)\n",
" ser.set_params_senteval() # accepts defaults\n",
"\n",
" # Define the batcher and prepare functions for SentEval\n",
" def prepare(params, samples):\n",
" vocab = set()\n",
" for sample in samples:\n",
" if params.current_task != \"TREC\":\n",
" sample = \" \".join(sample).lower().split()\n",
" else:\n",
" sample = \" \".join(sample).split()\n",
" for word in sample:\n",
" if word not in vocab:\n",
" vocab.add(word)\n",
"\n",
" vocab.add(\"<s>\")\n",
" vocab.add(\"<pad>\")\n",
" vocab.add(\"<unk>\")\n",
" vocab.add(\"</s>\")\n",
" # Optional vocab expansion\n",
" # params[\"model\"].vocab_expansion(vocab)\n",
"\n",
" def batcher(params, batch):\n",
" # batch contains list of words\n",
" max_tasks = [\"MR\", \"CR\", \"SUBJ\", \"MPQA\", \"ImageCaptionRetrieval\"]\n",
" if params.current_task in max_tasks:\n",
" strategy = \"max\"\n",
" else:\n",
" strategy = \"last\"\n",
"\n",
" sentences = [\" \".join(s).lower() for s in batch]\n",
" _, embeddings = params[\"model\"].get_representation(\n",
" sentences, pool=strategy, return_numpy=True\n",
" )\n",
" return embeddings\n",
"\n",
" # Run SentEval\n",
" results = ser.run(batcher, prepare)\n",
"\n",
" # Print results as table\n",
" eval_metrics = ser.print_mean(\n",
" results,\n",
" selected_metrics=[\"pearson\", \"spearman\"],\n",
" )\n",
" print(eval_metrics.head(eval_metrics.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a Pytorch Estimator to submit the evaluation script to the compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"est = PyTorch(\n",
" source_directory=src_dir,\n",
" script_params={\n",
" \"--ds_gensen\": ds.path(\"{}/gensen_lib\".format(ds_root)).as_mount(),\n",
" \"--ds_senteval\": ds.path(\"{}/senteval_lib\".format(ds_root)).as_mount(),\n",
" \"--ds_utils\": ds.path(\"{}/utils_nlp\".format(ds_root)).as_mount(),\n",
" },\n",
" compute_target=compute_target,\n",
" entry_script=\"evaluate.py\",\n",
" node_count=4,\n",
" process_count_per_node=1,\n",
" distributed_training=MpiConfiguration(),\n",
" use_gpu=True,\n",
" framework_version=\"1.0\",\n",
" conda_packages=[\"h5py\", \"nltk\"],\n",
" pip_packages=[\"pandas\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run Evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment = Experiment(ws, name=\"senteval-pytorch-gensen\")\n",
"run = experiment.submit(est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the run via a Jupyter widget."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, block until the script has completed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#run.wait_for_completion(show_output=AZUREML_VERBOSE)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (nlp_cpu)",
"language": "python",
"name": "nlp_cpu"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,336 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SentEval on Local\n",
"\n",
"SentEval is a widely used benchmarking tool for evaluating general-purpose sentence embeddings. It provides a simple interface for evaluating your embeddings on up to 17 supported downstream tasks (such as sentiment classification, natural language inference, semantic similarity, etc.)\n",
"\n",
"Running SentEval locally is simple. Clone the [repository](https://github.com/facebookresearch/SentEval), follow their setup instructions to get the data for the transfer tasks, and implement two functions `prepare(params, samples)` and `batcher(params, batch)` specific to your model. The authors provide some guidance on how to do this in the [examples](https://github.com/facebookresearch/SentEval/tree/master/examples) directory of their repository. In this notebook we show an example for evaluating the GenSen model on the available STS downstream tasks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 00 Global Settings"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) \n",
"[GCC 7.3.0]\n",
"Torch version: 1.0.1\n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"import json\n",
"import torch\n",
"import pandas as pd\n",
"\n",
"sys.path.append(\"../../\")\n",
"from utils_nlp.eval.senteval import SentEvalRunner\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"Torch version: {}\".format(torch.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 01 SentEval Settings"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"PATH_TO_SENTEVAL = (\n",
" \"../../../SentEval\"\n",
") # Set this path to where you have cloned the senteval source code\n",
"sys.path.insert(0, PATH_TO_SENTEVAL)\n",
"import senteval\n",
"\n",
"transfer_tasks = [\"STSBenchmark\", \"STS12\", \"STS13\", \"STS14\", \"STS15\", \"STS16\"]\n",
"\n",
"params_senteval = {\n",
" \"task_path\": os.path.join(PATH_TO_SENTEVAL, \"data\"),\n",
" \"usepytorch\": True,\n",
" \"kfold\": 10,\n",
"}\n",
"params_senteval[\"classifier\"] = {\n",
" \"nhid\": 0,\n",
" \"optim\": \"adam\",\n",
" \"batch_size\": 64,\n",
" \"tenacity\": 5,\n",
" \"epoch_size\": 4,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 02 GenSen Settings"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"model params: {\n",
" \"folder_path\": \"../../../gensen/data/models\",\n",
" \"prefix_1\": \"nli_large_bothskip_parse\",\n",
" \"prefix_2\": \"nli_large_bothskip\",\n",
" \"pretrain\": \"../../../gensen/data/embedding/glove.840B.300d.h5\",\n",
" \"cuda\": true\n",
"}\n"
]
}
],
"source": [
"PATH_TO_GENSEN = (\n",
" \"../../../gensen\"\n",
") # Set this path to where you have cloned the gensen source code\n",
"sys.path.append(PATH_TO_GENSEN)\n",
"from gensen import GenSen, GenSenSingle\n",
"\n",
"model_params = {}\n",
"model_params[\"folder_path\"] = os.path.join(PATH_TO_GENSEN, \"data/models\")\n",
"model_params[\"prefix_1\"] = \"nli_large_bothskip_parse\"\n",
"model_params[\"prefix_2\"] = \"nli_large_bothskip\"\n",
"model_params[\"pretrain\"] = os.path.join(\n",
" PATH_TO_GENSEN, \"data/embedding/glove.840B.300d.h5\"\n",
")\n",
"model_params[\"cuda\"] = torch.cuda.is_available()\n",
"\n",
"print(\"model params: {}\".format(json.dumps(model_params, indent=4)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 03 SentEval Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As specified in the SentEval [repo](https://github.com/facebookresearch/SentEval#how-to-use-senteval), we implement 2 functions:\n",
"\n",
"<b>prepare</b> (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc) \n",
"<b>batcher</b> (transforms a batch of text sentences into sentence embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def prepare(params, samples):\n",
" vocab = set()\n",
" for sample in samples:\n",
" if params.current_task != \"TREC\":\n",
" sample = \" \".join(sample).lower().split()\n",
" else:\n",
" sample = \" \".join(sample).split()\n",
" for word in sample:\n",
" if word not in vocab:\n",
" vocab.add(word)\n",
"\n",
" vocab.add(\"<s>\")\n",
" vocab.add(\"<pad>\")\n",
" vocab.add(\"<unk>\")\n",
" vocab.add(\"</s>\")\n",
" # Optional vocab expansion\n",
" # params[\"model\"].vocab_expansion(vocab)\n",
"\n",
"\n",
"def batcher(params, batch):\n",
" # batch contains list of words\n",
" max_tasks = [\"MR\", \"CR\", \"SUBJ\", \"MPQA\", \"ImageCaptionRetrieval\"]\n",
" if params.current_task in max_tasks:\n",
" strategy = \"max\"\n",
" else:\n",
" strategy = \"last\"\n",
"\n",
" sentences = [\" \".join(s).lower() for s in batch]\n",
" _, embeddings = params[\"model\"].get_representation(\n",
" sentences, pool=strategy, return_numpy=True\n",
" )\n",
" return embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 04 Run SentEval on GenSen"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"gensen_1 = GenSenSingle(\n",
" model_folder=model_params[\"folder_path\"],\n",
" filename_prefix=model_params[\"prefix_1\"],\n",
" pretrained_emb=model_params[\"pretrain\"],\n",
" cuda=model_params[\"cuda\"],\n",
")\n",
"gensen_2 = GenSenSingle(\n",
" model_folder=model_params[\"folder_path\"],\n",
" filename_prefix=model_params[\"prefix_2\"],\n",
" pretrained_emb=model_params[\"pretrain\"],\n",
" cuda=model_params[\"cuda\"],\n",
")\n",
"gensen = GenSen(gensen_1, gensen_2)\n",
"\n",
"ser = SentEvalRunner(path_to_senteval=PATH_TO_SENTEVAL, use_azureml=False)\n",
"ser.set_transfer_data_path(\"data\")\n",
"ser.set_transfer_tasks(transfer_tasks)\n",
"ser.set_model(gensen)\n",
"ser.set_params_senteval()\n",
"results = ser.run(batcher, prepare)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print selected metrics from the model's results on the transfer tasks as a table."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pearson</th>\n",
" <th>spearman</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>STSBenchmark</th>\n",
" <td>0.782</td>\n",
" <td>0.786</td>\n",
" </tr>\n",
" <tr>\n",
" <th>STS12</th>\n",
" <td>0.608</td>\n",
" <td>0.609</td>\n",
" </tr>\n",
" <tr>\n",
" <th>STS13</th>\n",
" <td>0.540</td>\n",
" <td>0.551</td>\n",
" </tr>\n",
" <tr>\n",
" <th>STS14</th>\n",
" <td>0.651</td>\n",
" <td>0.636</td>\n",
" </tr>\n",
" <tr>\n",
" <th>STS15</th>\n",
" <td>0.736</td>\n",
" <td>0.738</td>\n",
" </tr>\n",
" <tr>\n",
" <th>STS16</th>\n",
" <td>0.668</td>\n",
" <td>0.672</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pearson spearman\n",
"STSBenchmark 0.782 0.786\n",
"STS12 0.608 0.609\n",
"STS13 0.540 0.551\n",
"STS14 0.651 0.636\n",
"STS15 0.736 0.738\n",
"STS16 0.668 0.672"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval_metrics = ser.print_mean(results, selected_metrics=[\"pearson\", \"spearman\"])\n",
"eval_metrics.head(eval_metrics.shape[0])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -36,6 +36,7 @@ def test_load_pretrained_vectors_word2vec():
shutil.rmtree(os.path.join(os.getcwd(), dir_path))
assert isinstance(load_word2vec(dir_path), Word2VecKeyedVectors)
def test_load_pretrained_vectors_glove():
dir_path = "temp_data/"

Просмотреть файл

@ -0,0 +1,75 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import os
from azureml.core import Workspace
def get_or_create_workspace(
config_path=None,
subscription_id=None,
resource_group=None,
workspace_name=None,
workspace_region=None,
):
"""Get or create AzureML Workspace this will save the config to the path specified for later use
Args:
config_path (str): optional directory to look for / store config.json file (defaults to current directory)
subscription_id (str): subscription id
resource_group (str): resource group
workspace_name (str): workspace name
workspace_region (str): region
Returns:
Workspace
"""
# use environment variables if needed
if subscription_id is None:
subscription_id = os.getenv("SUBSCRIPTION_ID")
if resource_group is None:
resource_group = os.getenv("RESOURCE_GROUP")
if workspace_name is None:
workspace_name = os.getenv("WORKSPACE_NAME")
if workspace_region is None:
workspace_region = os.getenv("WORKSPACE_REGION")
# define fallback options in order to try
options = [
(
Workspace,
dict(
subscription_id=subscription_id,
resource_group=resource_group,
workspace_name=workspace_name,
),
),
(Workspace.from_config, dict(path=config_path)),
(
Workspace.create,
dict(
subscription_id=subscription_id,
resource_group=resource_group,
name=workspace_name,
location=workspace_region,
create_resource_group=True,
exist_ok=True,
),
),
]
for function, kwargs in options:
try:
ws = function(**kwargs)
break
except Exception:
continue
else:
raise ValueError(
"Failed to get or create AzureML Workspace with the configuration information provided"
)
ws.write_config(path=config_path)
return ws

128
utils_nlp/eval/senteval.py Normal file
Просмотреть файл

@ -0,0 +1,128 @@
import os
import sys
import pandas as pd
class SentEvalRunner:
def __init__(self, path_to_senteval=".", use_azureml=False):
"""AzureML-compatible wrapper class that interfaces with the original implementation of SentEval
Args:
path_to_senteval (str, optional): Path to the SentEval source code.
use_azureml (bool, optional): Defaults to false.
"""
self.path_to_senteval = path_to_senteval
self.use_azureml = use_azureml
def set_transfer_data_path(self, relative_path):
"""Set the datapath that contains the datasets for the SentEval transfer tasks
Args:
relative_path (str): Relative datapath
"""
self.transfer_data_path = os.path.join(
self.path_to_senteval, relative_path
)
def set_transfer_tasks(self, task_list):
"""Set the transfer tasks to use for evaluation
Args:
task_list (list(str)): List of downstream transfer tasks
"""
self.transfer_tasks = task_list
def set_model(self, model):
"""Set the model to evaluate"""
self.model = model
def set_params_senteval(
self,
use_pytorch=True,
kfold=10,
nhid=0,
optim="adam",
batch_size=64,
tenacity=5,
epoch_size=4,
):
"""
Define the required parameters for SentEval (model, task_path, usepytorch, kfold).
Also gives the option to directly set parameters for a classifier if necessary.
"""
self.params_senteval = {
"model": self.model,
"task_path": self.transfer_data_path,
"usepytorch": use_pytorch,
"kfold": kfold,
}
classifying_tasks = {
"MR",
"CR",
"SUBJ",
"MPQA",
"SST2",
"SST5",
"TREC",
"SICKEntailment",
"SNLI",
"MRPC",
}
if any(t in classifying_tasks for t in self.transfer_tasks):
self.params_senteval["classifier"] = {
"nhid": nhid,
"optim": optim,
"batch_size": batch_size,
"tenacity": tenacity,
"epoch_size": epoch_size,
}
def run(self, batcher_func, prepare_func):
"""Run the SentEval engine on the model on the transfer tasks
Args:
batcher_func (function): Function required by SentEval that transforms a batch of text sentences into
sentence embeddings
prepare_func (function): Function that sees the whole dataset of each task and can thus construct the word
vocabulary, the dictionary of word vectors, etc
Returns:
dict: Dictionary of results
"""
if self.use_azureml:
sys.path.insert(
0, os.path.relpath(self.path_to_senteval, os.getcwd())
)
import senteval
else:
sys.path.insert(0, self.path_to_senteval)
import senteval
se = senteval.engine.SE(
self.params_senteval, batcher_func, prepare_func
)
return se.eval(self.transfer_tasks)
def print_mean(self, results, selected_metrics=[], round_decimals=3):
"""Print the means of selected metrics of the transfer tasks as a table
Args:
results (dict): Results from the SentEval evaluation engine
selected_metrics (list(str), optional): List of metric names
round_decimals (int, optional): Number of decimal digits to round to; defaults to 3
"""
data = []
for task in self.transfer_tasks:
if "all" in results[task]:
row = [
results[task]["all"][metric]["mean"]
for metric in selected_metrics
]
else:
row = [results[task][metric] for metric in selected_metrics]
data.append(row)
table = pd.DataFrame(
data=data, columns=selected_metrics, index=self.transfer_tasks
)
return table.round(round_decimals)

Просмотреть файл

@ -31,7 +31,6 @@ def _extract_fasttext_vectors(zip_path, dest_path="."):
os.remove(zip_path)
return dest_path
def _download_fasttext_vectors(download_dir, file_name="wiki.simple.zip"):
""" Downloads pre-trained word vectors for English, trained on Wikipedia using
fastText. You can directly download the vectors from here:

Просмотреть файл

@ -87,7 +87,6 @@ def load_pretrained_vectors(dir_path, file_name="glove.840B.300d.txt", limit=Non
Returns:
gensim.models.keyedvectors.Word2VecKeyedVectors: Loaded word2vectors
"""
file_path = _maybe_download_and_extract(dir_path, file_name)

Просмотреть файл

@ -16,7 +16,6 @@ def _extract_word2vec_vectors(zip_path, dest_filepath):
Args:
zip_path: Path to the downloaded compressed file.
dest_filepath: Final destination file path to the extracted zip file.
"""
if os.path.exists(zip_path):