Merge pull request #179 from microsoft/abhiram-gensen

Notebook test for Gensen Local.
2019-07-26 11:41:28 +01:00 · 2019-07-26 11:41:28 +01:00 · 79e99dc826
--- a/scenarios/sentence_similarity/gensen_local.ipynb
+++ b/scenarios/sentence_similarity/gensen_local.ipynb
@ -2,11 +2,7 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "75caf421-c00a-4d6d-8a3d-47ebe7493af5"
-    }
-   },
+   "metadata": {},
   "source": [
    "\n",
    "Copyright (c) Microsoft Corporation. All rights reserved.\n",
@ -16,11 +12,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "0738bb22-14af-45ca-9ad7-e0c068f280cf"
-    }
-   },
+   "metadata": {},
   "source": [
    "# GenSen with Pytorch\n",
    "In this tutorial, you will train a GenSen model for the sentence similarity task. We use the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset in this example. For a more detailed walkthrough about data processing jump to [SNLI Data Prep](../01-prep-data/snli.ipynb). A quickstart version of this notebook can be found [here](../00-quick-start/)\n",
@ -59,21 +51,17 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "e91468d4-7bb8-469b-95a6-4e6f4dfcdf55"
-    }
-   },
+   "metadata": {},
   "source": [
    "## 0. Global Settings"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 11,
   "metadata": {
-    "nbpresent": {
-     "id": "a6e277ee-edbb-44a5-81d4-93565d2f3a83"
+    "pycharm": {
+     "name": "#%%\n"
    }
   },
   "outputs": [
@ -96,42 +84,46 @@
    "from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens\n",
    "from utils_nlp.dataset import snli, preprocess\n",
    "from scenarios.sentence_similarity.gensen_wrapper import GenSenClassifier\n",
-    "from utils_nlp.models.pretrained_embeddings.glove import download_and_extract \n",
+    "from utils_nlp.models.pretrained_embeddings.glove import download_and_extract\n",
+    "import scrapbook as sb\n",
    "\n",
    "\n",
-    "print(\"System version: {}\".format(sys.version))\n",
-    "BASE_DATA_PATH = '../../data'"
+    "print(\"System version: {}\".format(sys.version))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "max_epoch = None\n",
+    "config_filepath = 'gensen_config.json'\n",
+    "base_data_path = '../../data'\n",
+    "nrows = None"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "aee768e5-f317-4dfb-807c-cb4f5f0c0204"
-    }
-   },
+   "metadata": {},
   "source": [
    "## 1. Data Preparation and inspection"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "4c570c1b-0e4e-41e9-aa27-5ab1ce8c13a1"
-    }
-   },
+   "metadata": {},
   "source": [
    "The [SNLI](https://nlp.stanford.edu/projects/snli/) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). "
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "99c241e1-2f23-4fb3-9d3c-8f479c6b0030"
-    }
-   },
+   "metadata": {},
   "source": [
    "### 1.1 Load the dataset\n",
    "\n",
@ -152,10 +144,10 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 3,
   "metadata": {
-    "nbpresent": {
-     "id": "5952e06d-1dae-462d-8fce-66eb7ef536dd"
+    "pycharm": {
+     "name": "#%%\n"
    }
   },
   "outputs": [
@ -337,38 +329,34 @@
       "4  2267923837.jpg#2r1e     entailment    NaN    NaN    NaN    NaN  "
      ]
     },
-     "execution_count": 7,
+     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "train = snli.load_pandas_df(BASE_DATA_PATH, file_split=\"train\")\n",
-    "dev = snli.load_pandas_df(BASE_DATA_PATH, file_split=\"dev\")\n",
-    "test = snli.load_pandas_df(BASE_DATA_PATH, file_split=\"test\")\n",
+    "train = snli.load_pandas_df(base_data_path, file_split=\"train\", nrows=nrows)\n",
+    "dev = snli.load_pandas_df(base_data_path, file_split=\"dev\", nrows=nrows)\n",
+    "test = snli.load_pandas_df(base_data_path, file_split=\"test\", nrows=nrows)\n",
    "\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "6d5f7565-1f84-4489-8d06-dabd6bd99190"
-    }
-   },
+   "metadata": {},
   "source": [
    "### 1.2 Tokenize\n",
    "\n",
-    "We have loaded the dataset into pandas.DataFrame, we now convert sentences to tokens. We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2. Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK] (https://www.nltk.org/) library for tokenization."
+    "We have loaded the dataset into pandas.DataFrame, we now convert sentences to tokens. We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 4,
   "metadata": {
-    "nbpresent": {
-     "id": "e6160617-03f0-4809-9360-8b040dc4395f"
+    "pycharm": {
+     "name": "#%%\n"
    }
   },
   "outputs": [],
@ -385,12 +373,19 @@
    "test = clean_and_tokenize(test)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK] (https://www.nltk.org/) library for tokenization."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 5,
   "metadata": {
-    "nbpresent": {
-     "id": "4912b609-8141-4212-a6ad-814d73f724ed"
+    "pycharm": {
+     "name": "#%%\n"
    }
   },
   "outputs": [
@ -497,7 +492,7 @@
       "4  [two, kids, at, a, ballgame, wash, their, hand...  "
      ]
     },
-     "execution_count": 9,
+     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -508,11 +503,7 @@
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "59494d88-c7c9-4efc-a191-f16d6ac2ac40"
-    }
-   },
+   "metadata": {},
   "source": [
    "##  2. Model application, performance and analysis of the results\n",
    "The model has been implemented as a GenSen class with the specifics hidden inside the fit() method, so that no explicit call is needed. The algorithm operates in three different steps:\n",
@ -538,8 +529,12 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
+   "execution_count": 14,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
   "outputs": [
    {
     "name": "stdout",
@ -550,69 +545,47 @@
    }
   ],
   "source": [
-    "pretrained_embedding_path = download_and_extract(BASE_DATA_PATH)"
+    "pretrained_embedding_path = download_and_extract(base_data_path)"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "ab565124-43de-4862-b286-2b5db3a868fe"
-    }
-   },
+   "metadata": {},
   "source": [
    "### 2.1 Initialize Model"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 15,
   "metadata": {
-    "nbpresent": {
-     "id": "641a9c74-974c-4aac-8c16-3b44d686f0f3"
-    },
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "The autoreload extension is already loaded. To reload it, use:\n",
-      "  %reload_ext autoreload\n"
-     ]
+    "pycharm": {
+     "name": "#%%\n"
    }
-   ],
+   },
+   "outputs": [],
   "source": [
-    "%load_ext autoreload\n",
-    "%autoreload 2\n",
-    "\n",
-    "config_filepath = 'gensen_config.json'\n",
    "clf = GenSenClassifier(config_file = config_filepath, \n",
    "                       pretrained_embedding_path = pretrained_embedding_path,\n",
    "                       learning_rate = 0.0001, \n",
-    "                       cache_dir=BASE_DATA_PATH)"
+    "                       cache_dir=base_data_path,\n",
+    "                      max_epoch=max_epoch)"
   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {
-    "nbpresent": {
-     "id": "5f87d13c-d04f-4d38-820e-fb82082153c4"
-    }
-   },
+   "metadata": {},
   "source": [
    "### 2.2 Train Model"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 8,
   "metadata": {
-    "nbpresent": {
-     "id": "6ea45671-c7a5-4fe8-a450-8b54161f26c5"
-    },
-    "scrolled": false
+    "pycharm": {
+     "name": "#%%\n"
+    }
   },
   "outputs": [
    {
@ -621,7 +594,7 @@
     "text": [
      "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.8 and num_layers=1\n",
      "  \"num_layers={}\".format(dropout, num_layers))\n",
-      "../../scenarios/sentence_similarity/gensen_train.py:428: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
+      "../../scenarios/sentence_similarity/gensen_train.py:431: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
      "  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
      "../../utils_nlp/models/gensen/utils.py:364: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.\n",
      "  Variable(torch.LongTensor(sorted_src_lens), volatile=True)\n",
@ -629,13 +602,13 @@
      "  warnings.warn(\"nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.\")\n",
      "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.\n",
      "  warnings.warn(\"nn.functional.tanh is deprecated. Use torch.tanh instead.\")\n",
-      "../../scenarios/sentence_similarity/gensen_train.py:520: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
+      "../../scenarios/sentence_similarity/gensen_train.py:523: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.\n",
      "  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)\n",
      "/data/anaconda/envs/nlp_gpu/lib/python3.6/site-packages/horovod/torch/__init__.py:163: UserWarning: optimizer.step(synchronize=True) called after optimizer.synchronize(). This can cause training slowdown. You may want to consider using optimizer.step(synchronize=False) if you use optimizer.synchronize() in your code.\n",
      "  warnings.warn(\"optimizer.step(synchronize=True) called after \"\n",
-      "../../scenarios/sentence_similarity/gensen_train.py:241: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
+      "../../scenarios/sentence_similarity/gensen_train.py:243: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
      "  f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n",
-      "../../scenarios/sentence_similarity/gensen_train.py:260: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
+      "../../scenarios/sentence_similarity/gensen_train.py:262: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
      "  f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)\n"
     ]
    },
@ -643,8 +616,8 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "CPU times: user 29min 21s, sys: 8min 11s, total: 37min 32s\n",
-      "Wall time: 37min 29s\n"
+      "CPU times: user 1h 19min 28s, sys: 22min 1s, total: 1h 41min 30s\n",
+      "Wall time: 1h 41min 22s\n"
     ]
    }
   ],
@ -657,13 +630,19 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### 2.3 Predict"
+    "### 2.3 Predict\n",
+    "\n",
+    "In the predict method we perform Pearson's Correlation computation [\\[2\\]](#References) on the outputs of the model. The predictions of the model can be further improved by hyperparameter tuning which we walk through in the other example [here](gensen_aml_deep_dive.ipynb). "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
+   "execution_count": 16,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
   "outputs": [
    {
     "name": "stdout",
@ -671,20 +650,23 @@
     "text": [
      "******** Similarity Score for sentences **************\n",
      "          0         1\n",
-      "0  1.000000  0.936469\n",
-      "1  0.936469  1.000000\n"
+      "0  1.000000  0.966793\n",
+      "1  0.966793  1.000000\n"
     ]
    }
   ],
   "source": [
    "sentences = [\n",
-    "        'the quick brown fox jumped over the lazy dog',\n",
-    "        'bright sunshiny day tomorrow.'\n",
+    "        'The sky is blue and beautiful',\n",
+    "        'Love this blue and beautiful sky!'\n",
    "    ]\n",
    "\n",
    "results = clf.predict(sentences)\n",
    "print(\"******** Similarity Score for sentences **************\")\n",
-    "print(results)"
+    "print(results)\n",
+    "\n",
+    "# Record results with scrapbook for tests\n",
+    "sb.glue(\"results\", results.to_dict())"
   ]
  },
  {
@ -694,11 +676,15 @@
    "## References\n",
    "\n",
    "1. Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J, [*Learning general purpose distributed sentence representations via large scale multi-task learning*](https://arxiv.org/abs/1804.00079), ICLR, 2018.\n",
-    "3. Semantic textual similarity. url: http://nlpprogress.com/english/semantic_textual_similarity.html"
+    "2. Pearson's Correlation Coefficient. url: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient\n",
+    "3. Semantic textual similarity. url: http://nlpprogress.com/english/semantic_textual_similarity.html\n",
+    "4. Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. [*Multi-task sequence to sequence learning*](https://arxiv.org/abs/1511.06114), 2015.\n",
+    "5. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. [*Learned in translation: Contextualized word vectors](https://arxiv.org/abs/1708.00107), 2017. "
   ]
  }
 ],
 "metadata": {
+  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (nlp_gpu)",
   "language": "python",
@ -715,6 +701,15 @@
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
  }
 },
 "nbformat": 4,
--- a/scenarios/sentence_similarity/gensen_train.py
+++ b/scenarios/sentence_similarity/gensen_train.py
@ -134,10 +134,12 @@ def evaluate(
    save_dir,
    starting_time,
    model_state,
+    max_epoch,
 ):
    """ Function to validate the model.

    Args:
+        max_epoch(int): Limit training to specified number of epochs.
        model_state(dict): Saved model weights.
        config(dict): Config object.
        train_iterator(BufferedDataIterator): BufferedDataIterator object.
@ -197,7 +199,7 @@ def evaluate(
        )
        if (monitor_epoch - min_val_loss_epoch) > config["training"][
            "stop_patience"
-        ]:
+        ] or (max_epoch is not None and monitor_epoch >= max_epoch):
            logging.info("Saving model ...")
            # Save the name with validation loss.
            torch.save(
@ -269,10 +271,11 @@ def evaluate_nli(nli_iterator, model, batch_size, n_gpus):
    logging.info("******************************************************")


-def train(config, data_folder, learning_rate=0.0001):
+def train(config, data_folder, learning_rate=0.0001, max_epoch=None):
    """ Train the Gensen model.

    Args:
+        max_epoch(int): Limit training to specified number of epochs.
        config(dict): Loaded json file as a python object.
        data_folder(str): Path to the folder containing the data.
        learning_rate(float): Learning rate for the model.
@ -588,6 +591,7 @@ def train(config, data_folder, learning_rate=0.0001):
                        save_dir=save_dir,
                        starting_time=start,
                        model_state=model_state,
+                        max_epoch=max_epoch,
                    )
                    if training_complete:
                        break
@ -621,6 +625,12 @@ if __name__ == "__main__":
    parser.add_argument(
        "--learning_rate", type=float, default=0.0001, help="learning rate"
    )
+    parser.add_argument(
+        "--max_epoch",
+        type=int,
+        default=None,
+        help="Limit training to specified number of epochs.",
+    )

    args = parser.parse_args()
    data_path = args.data_folder
--- a/scenarios/sentence_similarity/gensen_wrapper.py
+++ b/scenarios/sentence_similarity/gensen_wrapper.py
@ -3,11 +3,11 @@
 import json
 import os

-import numpy as np
-import pandas as pd
-
 from scenarios.sentence_similarity.gensen_train import train
-from utils_nlp.models.gensen.create_gensen_model import create_multiseq2seq_model
+from utils_nlp.eval.classification import compute_correlation_coefficients
+from utils_nlp.models.gensen.create_gensen_model import (
+    create_multiseq2seq_model,
+)
 from utils_nlp.models.gensen.gensen import GenSenSingle
 from utils_nlp.models.gensen.preprocess_utils import gensen_preprocess

@ -30,12 +30,14 @@ class GenSenClassifier:
        pretrained_embedding_path,
        learning_rate=0.0001,
        cache_dir=".",
+        max_epoch=None,
    ):
        self.learning_rate = learning_rate
        self.config_file = config_file
        self.cache_dir = cache_dir
        self.pretrained_embedding_path = pretrained_embedding_path
        self.model_name = "gensen_multiseq2seq"
+        self.max_epoch = max_epoch

        self._validate_params()

@ -118,6 +120,7 @@ class GenSenClassifier:
            data_folder=os.path.abspath(self.cache_dir),
            config=self.config,
            learning_rate=self.learning_rate,
+            max_epoch=self.max_epoch,
        )

        self._create_multiseq2seq_model()
@ -132,13 +135,13 @@ class GenSenClassifier:
            sentences(list) : List of sentences.

        Returns
-            array: A pairwise cosine similarity for the sentences provided based on their gensen
-            vector representations.
+            pd.Dataframe: A pairwise cosine similarity for the sentences provided based on their
+            gensen vector representations.

        """

        # self.cache_dir = os.path.join(self.cache_dir, "clean/snli_1.0")
-        self._create_multiseq2seq_model()
+        # self._create_multiseq2seq_model()

        gensen_model = GenSenSingle(
            model_folder=os.path.join(
@ -149,7 +152,7 @@ class GenSenClassifier:
        )

        reps_h, reps_h_t = gensen_model.get_representation(
-            sentences, pool="last", return_numpy=True
+            sentences, pool="last", return_numpy=True, tokenize=True
        )

-        return pd.DataFrame(np.corrcoef(reps_h_t))
+        return compute_correlation_coefficients(reps_h_t)
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -31,7 +31,12 @@ def notebooks():
        "embedding_trainer": os.path.join(
            folder_notebooks, "embeddings", "embedding_trainer.ipynb"
        ),
-        "bert_encoder": os.path.join(folder_notebooks, "sentence_similarity", "bert_encoder.ipynb")
+        "bert_encoder": os.path.join(
+            folder_notebooks, "sentence_similarity", "bert_encoder.ipynb"
+        ),
+        "gensen_local": os.path.join(
+            folder_notebooks, "sentence_similarity", "gensen_local.ipynb"
+        ),
    }
    return paths

--- a/tests/integration/test_notebooks_sentence_similarity.py
+++ b/tests/integration/test_notebooks_sentence_similarity.py
@ -5,11 +5,14 @@ import sys
 import pytest
 import papermill as pm
 import scrapbook as sb
-
+from azureml.core import Experiment
+from azureml.core.run import Run
+from utils_nlp.azureml.azureml_utils import get_or_create_workspace
 from tests.notebooks_common import OUTPUT_NOTEBOOK

-
+sys.path.append("../../")
 ABS_TOL = 0.2
+ABS_TOL_PEARSONS = 0.05


@pytest.fixture(scope="module")
@ -42,3 +45,65 @@ def test_similarity_embeddings_baseline_runs(notebooks, baseline_results):
    for key, value in baseline_results.items():
        assert results[key] == pytest.approx(value, abs=ABS_TOL)

+
+@pytest.mark.notebooks
+@pytest.mark.gpu
+def test_similarity_senteval_local_runs(notebooks, gensen_senteval_results):
+    notebook_path = notebooks["senteval_local"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(
+            PATH_TO_SENTEVAL="../SentEval", PATH_TO_GENSEN="../gensen"
+        ),
+    )
+    out = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
+    for key, val in gensen_senteval_results.items():
+        for task, result in val.items():
+            assert out[key][task] == result
+
+
+@pytest.mark.notebooks
+@pytest.mark.azureml
+def test_similarity_senteval_azureml_runs(notebooks, gensen_senteval_results):
+    notebook_path = notebooks["senteval_azureml"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(
+            PATH_TO_SENTEVAL="../SentEval",
+            PATH_TO_GENSEN="../gensen",
+            PATH_TO_SER="utils_nlp/eval/senteval.py",
+            AZUREML_VERBOSE=False,
+            config_path="tests/ci",
+        ),
+    )
+    result = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict
+    ws = get_or_create_workspace(config_path="tests/ci")
+    experiment = Experiment(ws, name=result["experiment_name"])
+    run = Run(experiment, result["run_id"])
+    assert run.get_metrics()["STSBenchmark::pearson"] == pytest.approx(
+        gensen_senteval_results["pearson"]["STSBenchmark"], abs=ABS_TOL
+    )
+
+
+@pytest.mark.notebooks
+@pytest.mark.gpu
+def test_gensen_local(notebooks):
+    notebook_path = notebooks["gensen_local"]
+    pm.execute_notebook(
+        notebook_path,
+        OUTPUT_NOTEBOOK,
+        parameters=dict(
+            max_epoch=1,
+            config_filepath="../../scenarios/sentence_similarity/gensen_config.json",
+            base_data_path="../../data",
+        ),
+    )
+
+    results = sb.read_notebook(OUTPUT_NOTEBOOK).scraps.data_dict["results"]
+    expected = {"0": {"0": 1, "1": 0.95}, "1": {"0": 0.95, "1": 1}}
+
+    for key, value in expected.items():
+        for k, v in value.items():
+            assert results[key][k] == pytest.approx(v, abs=ABS_TOL_PEARSONS)
--- a/tests/unit/test_eval_classification.py
+++ b/tests/unit/test_eval_classification.py
@ -0,0 +1,16 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+import numpy as np
+
+from utils_nlp.eval.classification import compute_correlation_coefficients
+
+
+def test_compute():
+    x = np.random.rand(2, 100)
+    df = compute_correlation_coefficients(x)
+    assert df.shape == (2, 2)
+
+    y = np.random.rand(2, 100)
+    df = compute_correlation_coefficients(x, y)
+    assert df.shape == (4, 4)
--- a/utils_nlp/eval/classification.py
+++ b/utils_nlp/eval/classification.py
@ -8,6 +8,9 @@ from sklearn.metrics import (
    f1_score,
 )

+from numpy import corrcoef
+import pandas as pd
+

 def eval_classification(actual, predicted, round_decimals=4):
    """Returns common classification evaluation metrics.
@ -32,3 +35,23 @@ def eval_classification(actual, predicted, round_decimals=4):
            f1_score(actual, predicted, average=None).round(round_decimals)
        ),
    }
+
+
+def compute_correlation_coefficients(x, y=None):
+    """
+    Compute Pearson product-moment correlation coefficients.
+
+    Args:
+        x: array_like
+            A 1-D or 2-D array containing multiple variables and observations.
+            Each row of `x` represents a variable, and each column a single
+            observation of all those variables.
+
+        y: array_like, optional
+            An additional set of variables and observations. `y` has the same
+            shape as `x`.
+
+    Returns:
+        pd.DataFrame : A pandas dataframe from the correlation coefficient matrix of the variables.
+    """
+    return pd.DataFrame(corrcoef(x, y))
--- a/utils_nlp/models/gensen/utils.py
+++ b/utils_nlp/models/gensen/utils.py
@ -393,7 +393,7 @@ class NLIIterator(DataIterator):
            test(torch.Tensor): Testing dataset.
            vocab_size(int): The size of the vocabulary.
            lowercase(bool): If lowercase the dataset.
-            vocab(list): The list of the vocabulary.
+            vocab(Union[bytes,str): The list of the vocabulary.
        """
        self.seed = seed
        self.train = train