MLHyperparameterTuning/01_Training_Script.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training Script\n",
    "In this notebook, we create the training script whose hyperparameters will be tuned. This script is stored alone in a `scripts` directory both for ease of reference and because the Azure ML SDK limits the contents of this directory to at most 300 MB.\n",
    "\n",
    "The notebook cells are each appended in turn in the training script, so it is essential that you run the notebook's cells _in order_ for the script to run correctly. If you edit this notebook's cells, be sure to preserve the blank lines at the start and end of the cells, as they prevent the contents of consecutive cells from being improperly concatenated.\n",
    "\n",
    "The script sections are\n",
    "- [import libraries](#import),\n",
    "- [define utility functions and classes](#utility),\n",
    "- [define the script input parameters](#parameters),\n",
    "- [load and prepare the training and tuning data](#data),\n",
    "- [define the training pipeline](#pipeline),\n",
    "- [train the model](#train),\n",
    "- [score the tuning data](#score), and\n",
    "- [compute the tuning data performance](#performance).\n",
    "\n",
    "[The cell following the script](#run) runs that script using the training and tuning data created by [the first notebook](00_Data_Prep.ipynb).\n",
    "\n",
    "Start by creating the `scripts` directory, if it does not already exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.makedirs(\"scripts\", exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load libraries <a id='import'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile scripts/TrainClassifier.py\n",
    "\n",
    "from __future__ import print_function\n",
    "import os\n",
    "import warnings\n",
    "import argparse\n",
    "import json\n",
    "import pandas as pd\n",
    "from itertools import groupby\n",
    "import lightgbm as lgb\n",
    "from sklearn.feature_extraction import text\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "import joblib\n",
    "from azureml.core import Run\n",
    "import azureml.core\n",
    "\n",
    "warnings.filterwarnings(action='ignore', category=UserWarning, module='lightgbm')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define utility functions and classes <a id='utility'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TrainClassifier.py\n",
    "\n",
    "def log_evaluation(logger, metric_index=0, period=1):\n",
    "    \"\"\"Create a callback that logs the evaluation results.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    logger : function\n",
    "    period : int, optional (default=1)\n",
    "        The period to print the evaluation results.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    callback : function\n",
    "        The callback that logs the evaluation results every ``period`` iteration(s).\n",
    "    \"\"\"\n",
    "    def callback(env):\n",
    "        \"\"\"internal function\"\"\"\n",
    "        if period > 0 and env.evaluation_result_list and (env.iteration + 1) % period == 0:\n",
    "            value = env.evaluation_result_list[metric_index]\n",
    "            logger(value[1], value[2])\n",
    "    callback.order = 10\n",
    "    return callback\n",
    "\n",
    "\n",
    "def cumulative_gain_metric(groups, max_gain, score_at=1, metric_name=None):\n",
    "    \"\"\"Return a function that computes the normalized cumulative gain metric.\"\"\"\n",
    "    if metric_name is None:\n",
    "        eval_name = \"gain@\" + str(score_at)\n",
    "    else:\n",
    "        eval_name = metric_name\n",
    "\n",
    "    def cumulative_gain(y_true, y_pred, weight):\n",
    "        \"\"\"\n",
    "        Compute the normalized cumulative gain. Returns the tuple:\n",
    "            (eval_name, eval_result, is_bigger_better)\n",
    "        This function assumes the data are sorted by groups.\n",
    "        \"\"\"\n",
    "        gain = sum([sum([v\n",
    "                         for _, _, v in sorted(g,\n",
    "                                               key=lambda x: x[1],\n",
    "                                               reverse=True)[:score_at]])\n",
    "                    for _, g in groupby(zip(groups, y_pred, y_true),\n",
    "                                        key=lambda x: x[0])])\n",
    "        eval_result = gain / max_gain\n",
    "        return (eval_name, eval_result, True)\n",
    "\n",
    "    return cumulative_gain\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the input parameters <a id='parameters'></a>\n",
    "One of the most important parameters is `estimators`, the number of estimators that allows you to trade-off model gain, modeling time, and model size. The table below should give you an idea of the relationships between the number of estimators and the metrics. The default value is 100.\n",
    "\n",
    "| Estimators | Run time (s) | Size (MB) | Gain@1 | Gain@2 | Gain@3 |\n",
    "|------------|--------------|-----------|------------|------------|------------|\n",
    "|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |\n",
    "|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |\n",
    "|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |\n",
    "|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |\n",
    "|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |\n",
    "\n",
    "Other parameters that may be useful to tune include the following:\n",
    "* `ngrams`: the maximum n-gram size for features, an integer ranging from 1 (default 1),\n",
    "* `min_child_samples`: the minimum number of samples in a leaf, an integer ranging from 1 (default 20),\n",
    "* `match`: the maximum number of training examples per duplicate question, an integer ranging from 2 (default 10), and\n",
    "* `unweighted`: whether to use sample weights to compensate for unbalanced data, a boolean (default weighted)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TrainClassifier.py\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    \n",
    "    print('azureml.core.VERSION={}'.format(azureml.core.VERSION))\n",
    "    \n",
    "    parser = argparse.ArgumentParser(description='Fit and evaluate a model'\n",
    "                                     ' based on train and tune datasets.')\n",
    "    parser.add_argument('--data-folder', help='the path to the data',\n",
    "                        dest='data_folder', default='.')\n",
    "    parser.add_argument('--inputs', help='the inputs directory',\n",
    "                        default='data')\n",
    "    parser.add_argument('--data', help='the training dataset name',\n",
    "                        default='balanced_pairs_train.tsv')\n",
    "    parser.add_argument('--tune', help='the tune dataset name',\n",
    "                        default='balanced_pairs_tune.tsv')\n",
    "    parser.add_argument('--estimators',\n",
    "                        help='the number of learner estimators',\n",
    "                        type=int, default=100)\n",
    "    parser.add_argument('--min_child_samples',\n",
    "                        help='the minimum number of samples in a child(leaf)',\n",
    "                        type=int, default=20)\n",
    "    parser.add_argument('--ngrams',\n",
    "                        help='the maximum size of word ngrams',\n",
    "                        type=int, default=1)\n",
    "    parser.add_argument('--match',\n",
    "                        help='the maximum number of duplicate matches',\n",
    "                        type=int, default=20)\n",
    "    parser.add_argument('--unweighted',\n",
    "                        help='whether or not to use instance weights',\n",
    "                        default='No')\n",
    "    parser.add_argument('--period',\n",
    "                        help='the period for performance reporting',\n",
    "                        type=int, default=100)\n",
    "    parser.add_argument(\"--rank\", help=\"the maximum rank of a correct match\",\n",
    "                        type=int, default=3)\n",
    "    parser.add_argument('--outputs', help='the outputs directory',\n",
    "                        default='outputs')\n",
    "    parser.add_argument('--save', help='the model file base name', default='None')\n",
    "    parser.add_argument('--verbose',\n",
    "                        help='the verbosity of the estimator',\n",
    "                        type=int, default=-1)\n",
    "    parser.add_argument('--input-steps-data', dest=\"input_steps_data\",\n",
    "                        help='to share data between different steps in a pipeline',\n",
    "                        default='data')\n",
    "    parser.add_argument('--hyperparameters',\n",
    "                        help='hyperparameter config file base name', default='None')\n",
    "    args = parser.parse_args()\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and prepare the training data <a id='data'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TrainClassifier.py\n",
    "\n",
    "    # Get a run logger.\n",
    "    run = Run.get_context()\n",
    "\n",
    "    # What to name the metric logged\n",
    "    score_at = args.rank\n",
    "    metric_name = \"gain@\" + str(score_at)\n",
    "\n",
    "    print('Prepare the training data.')\n",
    "    \n",
    "    # Paths to the input data.\n",
    "    data_folder_path = args.data_folder\n",
    "    inputs_path = os.path.join(data_folder_path, args.inputs)\n",
    "    data_path = os.path.join(inputs_path, args.data)\n",
    "    tune_path = os.path.join(inputs_path, args.tune)\n",
    "\n",
    "    # Paths for the output data.\n",
    "    outputs_path = args.outputs\n",
    "    model_path = os.path.join(outputs_path, '{}.pkl'.format(args.save))\n",
    "    \n",
    "    # Paths for steps data\n",
    "    hyperparameters_path = os.path.join(args.input_steps_data, args.hyperparameters)\n",
    "\n",
    "    # Create the outputs folder.\n",
    "    os.makedirs(outputs_path, exist_ok=True)\n",
    "    \n",
    "    # Create a dict of hyperparameters from the input flags.\n",
    "    hyperparameters = {\n",
    "        \"estimators\": args.estimators,\n",
    "        \"ngrams\": args.ngrams,\n",
    "        \"min_child_samples\": args.min_child_samples,\n",
    "        \"match\": args.match,\n",
    "        \"unweighted\": args.unweighted\n",
    "    }\n",
    "    \n",
    "    # Update the hyperparameters with values from a config file.\n",
    "    if args.hyperparameters != \"None\" and os.path.isfile(hyperparameters_path):\n",
    "        with open(hyperparameters_path) as fp:\n",
    "            hyperparameters_config = json.load(fp)\n",
    "        for key in hyperparameters.keys():\n",
    "            if key in hyperparameters_config:\n",
    "                hyperparameters_config[key] = type(hyperparameters[key])(hyperparameters_config[key])\n",
    "        hyperparameters.update(hyperparameters_config)\n",
    "\n",
    "    # Define the input data columns.\n",
    "    feature_columns = ['Text_x', 'Text_y']\n",
    "    label_column = 'Label'\n",
    "    group_column = 'Id_x'\n",
    "    dupes_answerid_column = 'AnswerId_x'\n",
    "    questions_answerid_column = 'AnswerId_y'\n",
    "    name_columns = ['Id_x', 'Id_y']\n",
    "    weight_column = 'Weight'\n",
    "\n",
    "    # Load the training data.\n",
    "    print('Reading {}'.format(data_path))\n",
    "    train = pd.read_csv(data_path, sep='\\t', encoding='latin1')\n",
    "\n",
    "    # Limit the number of training duplicate matches.\n",
    "    train = train[train.n < hyperparameters[\"match\"]]\n",
    "    \n",
    "    # Sort the data by the group column (needed for computing the gain)\n",
    "    train.sort_values(group_column, inplace=True)\n",
    "\n",
    "    # Report on the dataset.\n",
    "    print('train: {:,} rows with {:.2%} matches'\n",
    "          .format(train.shape[0], train[label_column].mean()))\n",
    "    \n",
    "    # Load the tunning data.\n",
    "    print('Reading {}'.format(tune_path))\n",
    "    tune = pd.read_csv(tune_path, sep='\\t', encoding='latin1')\n",
    "    \n",
    "    # Sort the data by the group column (needed for computing the gain)\n",
    "    tune.sort_values(group_column, inplace=True)\n",
    "\n",
    "    # Report on the dataset.\n",
    "    print('tune: {:,} rows with {:.2%} matches'\n",
    "          .format(tune.shape[0], tune[label_column].mean()))\n",
    "    \n",
    "    # Compute instance weights.\n",
    "    if hyperparameters[\"unweighted\"] == 'Yes':\n",
    "        print('No sample weights.')\n",
    "        labels = train[label_column].unique()\n",
    "        weight = pd.Series([1.0] * labels.shape[0], labels)\n",
    "    else:\n",
    "        print('Using sample weights.')\n",
    "        label_counts = train[label_column].value_counts()\n",
    "        weight = train.shape[0] / (label_counts.shape[0] * label_counts)\n",
    "        print(weight)\n",
    "    train[weight_column] = train[label_column].apply(lambda x: weight[x])\n",
    "\n",
    "    # Select and format the training data.\n",
    "    train_X = train[feature_columns]\n",
    "    train_y = train[label_column]\n",
    "    train_group = train[group_column]\n",
    "    train_sample_weight = train[weight_column]\n",
    "    train_names = train[name_columns]\n",
    "    tune_X = tune[feature_columns]\n",
    "    tune_y = tune[label_column]\n",
    "    tune_group = tune[group_column]\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the featurization and estimator <a id='pipeline'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TrainClassifier.py\n",
    "\n",
    "    print('Define the model pipeline.')\n",
    "\n",
    "    # Select the training hyperparameters.\n",
    "    n_estimators = hyperparameters[\"estimators\"]\n",
    "    min_child_samples = hyperparameters[\"min_child_samples\"]\n",
    "    if hyperparameters[\"ngrams\"] > 0:\n",
    "        ngram_range = (1, hyperparameters[\"ngrams\"])\n",
    "    else:\n",
    "        ngram_range = None\n",
    "    period = args.period\n",
    "\n",
    "    # Verify that the hyperparameter settings are valid.\n",
    "    if n_estimators <= 0:\n",
    "        raise Exception('n_estimators must be > 0')\n",
    "    if min_child_samples <= 0:\n",
    "        raise Exception('min_child_samples must be > 0')\n",
    "    if (ngram_range is None\n",
    "        or type(ngram_range) is not tuple\n",
    "        or len(ngram_range) != 2\n",
    "        or ngram_range[0] < 1\n",
    "        or ngram_range[0] > ngram_range[1]):\n",
    "        raise Exception('ngram_range must be a tuple with two integers (a, b) where a > 0 and a <= b')\n",
    "\n",
    "    # Define the featurization pipeline\n",
    "    featurization = [\n",
    "        (column, text.TfidfVectorizer(ngram_range=ngram_range), column)\n",
    "        for column in feature_columns]\n",
    "    features = ColumnTransformer(featurization)\n",
    "\n",
    "    # Define the estimator.\n",
    "    estimator = lgb.LGBMClassifier(n_estimators=n_estimators,\n",
    "                                   min_child_samples=min_child_samples,\n",
    "                                   verbose=args.verbose)\n",
    "\n",
    "    # Put them together into the model pipeline.\n",
    "    model = Pipeline([\n",
    "        ('features', features),\n",
    "        ('estimator', estimator)\n",
    "    ])\n",
    "    \n",
    "    # Report the featurization.\n",
    "    print('Estimators={:,}'.format(n_estimators))\n",
    "    print('Ngram range={}'.format(ngram_range))\n",
    "    print('Min child samples={}'.format(min_child_samples))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the model <a id='train'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile --append scripts/TrainClassifier.py\n",
    "\n",
    "    print('Fitting the model.')\n",
    "\n",
    "    # Featurize the train and tune dataset.  It's important to only fit the\n",
    "    # featurizer on the training data, so that the tuning data is treated the\n",
    "    # same way the testing data will be later on.\n",
    "    train_X_features = model.named_steps[\"features\"].fit_transform(train_X)\n",
    "    tune_X_features = model.named_steps[\"features\"].transform(tune_X)\n",
    "\n",
    "    # Fit the model.\n",
    "    eval_callback = log_evaluation(run.log, metric_index=1, period=period)\n",
    "    max_gain = tune_y.sum()\n",
    "    eval_metric = cumulative_gain_metric(\n",
    "        tune_group, max_gain, score_at=score_at, metric_name=metric_name)\n",
    "    model.named_steps[\"estimator\"].fit(\n",
    "        train_X_features, train_y, sample_weight=train_sample_weight,\n",
    "        feature_name=model.named_steps[\"features\"].get_feature_names(),\n",
    "        eval_set=[(tune_X_features, tune_y)], eval_names=[\"tune\"],\n",
    "        eval_metric=eval_metric,\n",
    "        callbacks=[eval_callback], verbose=False\n",
    "    )\n",
    "    if period > 0 and n_estimators % period != 0:\n",
    "        metric_value = (model.named_steps[\"estimator\"]\n",
    "                        .evals_result_[\"tune\"][metric_name][-1])\n",
    "        run.log(metric_name, metric_value)\n",
    "\n",
    "    # Write the model to file.\n",
    "    if args.save != 'None':\n",
    "        print('Saving the model to {}'.format(model_path))\n",
    "        joblib.dump(model, model_path)\n",
    "        print('{}: {:.2f} MB'\n",
    "              .format(model_path, os.path.getsize(model_path)/(2**20)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the script to see that it works <a id='run'></a>\n",
    "Set the effort expended to train the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "estimators = 1000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the classifier script. This should take about 10 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "%run -t scripts/TrainClassifier.py --estimators $estimators --match 5 --ngrams 2 --min_child_samples 10 --save model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In [the next notebook](02_Testing_Script.ipynb), we create and run the testing script."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}