MLHyperparameterTuning/04_Hyperparameter_Random_Se...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HyperDrive Random Search\n",
    "In this notebook, we create an AML cluster, and use it to search for the best set of hyperparameters for the model.\n",
    "\n",
    "The steps in this notebook are\n",
    "- [import libraries](#import),\n",
    "- [read in the Azure ML workspace](#workspace),\n",
    "- [create an AML cluster](#cluster),\n",
    "- [upload the data to the cloud](#upload),\n",
    "- [define a hyperparameter search configuration](#configuration),\n",
    "- [create an estimator](#estimator),\n",
    "- [submit the estimator](#submit), and\n",
    "- [get the results](#results).\n",
    "\n",
    "## Imports  <a id='import'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import time\n",
    "from azureml.core import Workspace, Experiment\n",
    "from azureml.core.compute import ComputeTarget, AmlCompute\n",
    "from azureml.core.compute_target import ComputeTargetException\n",
    "from azureml.train.estimator import Estimator\n",
    "from azureml.train.hyperdrive import (\n",
    "    RandomParameterSampling, choice, PrimaryMetricGoal,\n",
    "    HyperDriveConfig, MedianStoppingPolicy)\n",
    "from azureml.widgets import RunDetails\n",
    "import azureml.core\n",
    "from get_auth import get_auth\n",
    "print('azureml.core.VERSION={}'.format(azureml.core.VERSION))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read in the Azure ML workspace <a id='workspace'></a>\n",
    "Read in the the workspace created in a previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "auth = get_auth()\n",
    "ws = Workspace.from_config(auth=auth)\n",
    "ws_details = ws.get_details()\n",
    "print('Name:\\t\\t{}\\nLocation:\\t{}'\n",
    "      .format(ws_details['name'],\n",
    "              ws_details['location']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an AML cluster <a id='cluster'></a>\n",
    "Define the properties of the cluster needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster_name = 'hypetuning'\n",
    "provisioning_config = AmlCompute.provisioning_configuration(\n",
    "        vm_size='Standard_D4_v2',\n",
    "        # vm_priority = 'lowpriority', # optional\n",
    "        max_nodes=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the configured cluster if it doesn't already exist, or retrieve if if it does exist. Creation can take about a minute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if cluster_name in ws.compute_targets:\n",
    "    compute_target = ws.compute_targets[cluster_name]\n",
    "    if type(compute_target) is not AmlCompute:\n",
    "        raise Exception('Compute target {} is not an AML cluster.'\n",
    "                        .format(cluster_name))\n",
    "    print('Using pre-existing AML cluster {}'.format(cluster_name))\n",
    "else:\n",
    "    # Create the cluster\n",
    "    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)\n",
    "\n",
    "    # You can poll for a minimum number of nodes and set a specific timeout. \n",
    "    # If min node count is provided, provisioning will use the scale settings for the cluster.\n",
    "    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print a detailed view of the cluster.    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "pd.Series(compute_target.get_status().serialize(), name='Value').to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload the data to the cloud <a id='upload'></a>\n",
    "We put the data in a particular directory on the workspace's default data store. This will show up in the same location in the file system of every job running on the Batch AI cluster.\n",
    "\n",
    "Get a handle to the workspace's default data store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = ws.get_default_datastore()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload the data. We use `overwrite=False` to avoid taking the time to re-upload the data should files with the same names be already present. If you change the data and want to refresh what's uploaded, use `overwrite=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.upload(src_dir=os.path.join('.', 'data'), target_path='data', overwrite=False, show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define a hyperparameter search configuration <a id='configuration'></a>\n",
    "Define the hyperparameter space for a random search.  We will use a constant value for the number of estimators that is enough to let us reliably identify the best of the parameter configurations. Once we have the best combination, we will build a model using a larger number of estimators to boost the performance. The table below should give you an idea of the trade-off between the number of estimators and the modeling run time, model size, and model gain.\n",
    "\n",
    "| Estimators | Run time (s) | Size (MB) | Gain@1 | Gain@2 | Gain@3 |\n",
    "|------------|--------------|-----------|------------|------------|------------|\n",
    "|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |\n",
    "|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |\n",
    "|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |\n",
    "|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |\n",
    "|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameter_sampling = RandomParameterSampling({\n",
    "    'ngrams': choice(range(1, 5)),\n",
    "    'match': choice(range(2, 41)),\n",
    "    'min_child_samples': choice(range(1, 31)),\n",
    "    'unweighted': choice('Yes', 'No')\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This hyperparameter space specifies a grid of 9,360 unique configuration points (4 `ngrams` X 39 `match` X 30 `min_child_samples` X 2 `unweighted`). We control the resources used by the search through specifying a maximum number of configuration points to sample as `max_total_runs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "max_total_runs = 96"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to specify a maximum duration for the tuning experiment by setting `max_duration_minutes`. If both of these parameters are specified, any remaining runs are terminated once `max_duration_minutes` have passed.\n",
    "\n",
    "Specify the primary metric to be optimized as the gain at 3, and that it should be maximized. This metric is logged by the training script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "primary_metric_name = \"gain@3\"\n",
    "primary_metric_goal = PrimaryMetricGoal.MAXIMIZE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training script logs the metric throughout training, so we may specify an early termination policy. If no policy is specified, the hyperparameter tuning service will let all training runs run to completion. We use a median stopping policy that terminates runs whose best metrics on the tune dataset are worse than the median of the running averages of the metrics on all training runs, and we delay the policy's application until each run's fifth metric report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "policy = MedianStoppingPolicy(delay_evaluation=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create an estimator <a id='estimator'></a>\n",
    "Create an estimator that specifies the location of the script, sets up its fixed parameters, including the location of the data, the compute target, and specifies the packages needed to run the script. It may take a while to prepare the run environment the first time an estimator is used, but that environment will be used until the list of packages is changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator = Estimator(source_directory=os.path.join('.', 'scripts'),\n",
    "                      entry_script='TrainClassifier.py',\n",
    "                      script_params={'--data-folder': ds.as_mount(),\n",
    "                                     '--estimators': 1000},\n",
    "                      compute_target=compute_target,\n",
    "                      conda_packages=['pandas==0.23.4',\n",
    "                                      'scikit-learn==0.20.0'],\n",
    "                      pip_packages=['lightgbm==2.1.2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put the estimator and the configuration information together into an HyperDrive run configuration object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperdrive_run_config = HyperDriveConfig(\n",
    "    estimator=estimator,\n",
    "    hyperparameter_sampling=hyperparameter_sampling,\n",
    "    policy=policy,\n",
    "    primary_metric_name=primary_metric_name,\n",
    "    primary_metric_goal=primary_metric_goal,\n",
    "    max_total_runs=max_total_runs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the search <a id='submit'></a>\n",
    "Get an experiment to run the search; create it if it doesn't already exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exp = Experiment(workspace=ws, name='hypetuning')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit the configuration to be run. This should return almost immediately, and the value will be a run object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "run = exp.submit(hyperdrive_run_config)\n",
    "run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The experiment returns a run that when printed shows a table with a link to the `Details Page` in the Azure Portal. That page will let you monitor the status of this run and that of its children runs. By clicking on a particular child run, you can see its details, files output by the script for that configuration, and the logs of the run, including the `driver.log` with the script's print outs.\n",
    "\n",
    "If you want to cancel this trial, run the code in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run.cancel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the ID of the run in a file. You may use this at a later time to recover the run, as is shown in the next notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_id = run.id\n",
    "run_id_path = \"run_id.txt\"\n",
    "with open(run_id_path, \"w\") as fp:\n",
    "    fp.write(run_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Until all children runs have either failed or completed, the parent run's status will not be `Completed`. Other possible run statuses include `Preparing`, `Running`, `Finalizing`, and `Failed`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run.get_status()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can programmatically monitor the progress of the run. You need to first obtain the list of its child runs. Poll every 60 seconds until all of the child runs are available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_children = list(run.get_children())\n",
    "while len(run_children) < max_total_runs:\n",
    "    time.sleep(60)\n",
    "    run_children = list(run.get_children())\n",
    "print('{:,} child runs'.format(len(run_children)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can now report of the status of the child runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run_children_status = pd.Series({child.id : child.get_status() for child in run_children}, name=\"Count\")\n",
    "run_children_status.value_counts().to_frame().transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait for the runs to complete, which can take more than an hour. This returns a `dict` with detailed information about the run. Here, we see that the run has `Completed`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "run_status = run.wait_for_completion()\n",
    "print(run_status['status'])\n",
    "if run_status['status'] != 'Completed':\n",
    "    raise Exception('The run did not successfully complete.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select the best model <a id='results'></a>\n",
    "We can automatically select the best run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_run = run.get_best_run_by_primary_metric()\n",
    "best_run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [next notebook](05_Train_Best_Model.ipynb) shows how to use the best HyperDrive run to create, download, and evaluate the model on held-aside test data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}