MLHyperparameterTuning/02_Testing_Script.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Testing Script\n",
    "In this notebook, we create the testing script for a trained model. This script is stored alone in a `scripts` directory both for ease of reference and because the Azure ML SDK limits the contents of this directory to at most 300 MB.\n",
    "\n",
    "The notebook cells are each appended in turn in the training script, so it is essential that you run the notebook's cells _in order_ for the script to run correctly. If you edit this notebook's cells, be sure to preserve the blank lines at the start and end of the cells, as they prevent the contents of consecutive cells from being improperly concatenated.\n",
    "\n",
    "The script sections are\n",
    "- [import libraries](#import),\n",
    "- [define utility functions and classes](#utility),\n",
    "- [define the script input parameters](#parameters),\n",
    "- [load and prepare the testing data](#data),\n",
    "- [load the trained pipeline](#pipeline),\n",
    "- [score the test data](#score), and\n",
    "- [compute the trained pipeline's performance](#performance).\n",
    "\n",
    "Start by creating the `scripts` directory, if it does not already exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.makedirs(\"scripts\", exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load libraries <a id='import'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile scripts/TestClassifier.py\n",
    "\n",
    "from __future__ import print_function\n",
    "import os\n",
    "import argparse\n",
    "import pandas as pd\n",
    "from itertools import groupby\n",
    "import joblib\n",
    "from azureml.core import Run\n",
    "import azureml.core\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define utility functions and classes <a id='utility'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "def cumulative_gain(y_true, y_pred, groups, max_gain=1.0, score_at=1):\n",
    "    \"\"\"\n",
    "    Compute the normalized cumulative gain.\n",
    "    This function assumes the data are sorted by groups.\n",
    "    \"\"\"\n",
    "    gain = sum([sum([v\n",
    "                     for _, _, v in sorted(g,\n",
    "                                           key=lambda x: x[1],\n",
    "                                           reverse=True)[:score_at]])\n",
    "                for _, g in groupby(zip(groups, y_pred, y_true),\n",
    "                                    key=lambda x: x[0])])\n",
    "    eval_result = gain / max_gain\n",
    "    return eval_result\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the input parameters <a id='parameters'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    print('azureml.core.VERSION={}'.format(azureml.core.VERSION))\n",
    "    \n",
    "    parser = argparse.ArgumentParser(description='Test a model.')\n",
    "    parser.add_argument('--data-folder', help='the path to the data',\n",
    "                        dest='data_folder', default='.')\n",
    "    parser.add_argument('--inputs', help='the inputs directory',\n",
    "                        default='data')\n",
    "    parser.add_argument('--test', help='the test dataset name',\n",
    "                        default='balanced_pairs_test.tsv')\n",
    "    parser.add_argument('--outputs', help='the outputs directory',\n",
    "                        default='outputs')\n",
    "    parser.add_argument('--model', help='the model file base name',\n",
    "                        default='model')\n",
    "    parser.add_argument(\"--rank\", help=\"the maximum rank of a correct match\",\n",
    "                        type=int, default=3)\n",
    "    args = parser.parse_args()\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and prepare the testing data <a id='data'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "    # Get a run logger.\n",
    "    run = Run.get_context()\n",
    "\n",
    "    # What to name the metric logged\n",
    "    metric_name = \"accuracy\"\n",
    "\n",
    "    print('Prepare the testing data.')\n",
    "    \n",
    "    # Paths to the input data.\n",
    "    data_path = args.data_folder\n",
    "    inputs_path = os.path.join(data_path, args.inputs)\n",
    "    test_path = os.path.join(inputs_path, args.test)\n",
    "\n",
    "    # Define the input data columns.\n",
    "    feature_columns = ['Text_x', 'Text_y']\n",
    "    label_column = 'Label'\n",
    "    group_column = 'Id_x'\n",
    "    dupes_answerid_column = 'AnswerId_x'\n",
    "    questions_answerid_column = 'AnswerId_y'\n",
    "    score_column = 'score'\n",
    "\n",
    "    # Load the testing data.\n",
    "    print('Reading {}'.format(test_path))\n",
    "    test = pd.read_csv(test_path, sep='\\t', encoding='latin1')\n",
    "    \n",
    "    # Sort the data by groups\n",
    "    test.sort_values(group_column, inplace=True)\n",
    "\n",
    "    # Report on the dataset.\n",
    "    print('test: {:,} rows with {:.2%} matches'\n",
    "          .format(test.shape[0], test[label_column].mean()))\n",
    "    \n",
    "    # Select and format the testing data.\n",
    "    test_X = test[feature_columns]\n",
    "    test_y = test[label_column]\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the trained model<a id='pipeline'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "    print('Load the model pipeline.')\n",
    "\n",
    "    # Paths for the model data.\n",
    "    outputs_path = args.outputs\n",
    "    model_path = os.path.join(outputs_path, '{}.pkl'.format(args.model))\n",
    "\n",
    "    print('Loading the model from {}'.format(model_path))\n",
    "    model = joblib.load(model_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Score the test data using the model <a id='score'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "    # Collect the model predictions.\n",
    "    print('Scoring the test data.')\n",
    "    test[score_column] = model.predict_proba(test_X)[:, 1]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Report the model's performance statistics on the test data <a id='performance'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TestClassifier.py\n",
    "\n",
    "    print(\"Evaluating the model's performance on the test data.\")\n",
    "    \n",
    "    metric_name = \"gain\"\n",
    "    max_gain = test[label_column].sum()\n",
    "    for i in range(1, args.rank+1):\n",
    "        gain = cumulative_gain(y_true=test[label_column].values,\n",
    "                               y_pred=test[score_column].values,\n",
    "                               groups=test[group_column].values,\n",
    "                               max_gain=max_gain,\n",
    "                               score_at=i)\n",
    "        print('{}@{} = {:.2%}'.format(metric_name, i, gain))\n",
    "    \n",
    "    # Log the gain@rank\n",
    "    run.log(\"{}@{}\".format(metric_name, i), gain)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the script to see that it works <a id='run'></a>\n",
    "This will take about a minute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run -t scripts/TestClassifier.py --rank 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In [the next notebook](03_Run_Locally.ipynb), we set up and use the AML SDK to run the training script."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}