FLAML/notebook/tune_synapseml.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "# Hyperparameter Tuning with FLAML\n",
        "\n",
        "|  | | | |\n",
        "|-----|--------|--------|--------|\n",
        "|![synapse](https://microsoft.github.io/SynapseML/img/logo.svg)| <img src=\"https://www.microsoft.com/en-us/research/uploads/prod/2020/02/flaml-1024x406.png\" alt=\"drawing\" width=\"200\"/> | \n",
        "\n",
        "\n",
        "<style>\n",
        "td, th {\n",
        "   border: none!important;\n",
        "}\n",
        "</style>\n",
        "In this notebook, we use FLAML to finetune a SynapseML LightGBM regression model for predicting house price. We use [*california_housing* dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing). The data consists of 20640 entries with 8 features.\n",
        "\n",
        "The result shows that with **2 mins** of tuning, FLAML **improved** the metric R^2 **from 0.71 to 0.81**.\n",
        "\n",
        "We will perform the task in following steps:\n",
        "- **Setup** environment\n",
        "- **Prepare** train and test datasets\n",
        "- **Train** with initial parameters\n",
        "- **Finetune** with FLAML\n",
        "- **Check** results\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## 1. Setup environment\n",
        "\n",
        "In this step, we first install FLAML and MLFlow, then setup mlflow autologging to make sure we've the proper environment for the task. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "jupyter": {
          "outputs_hidden": true
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "d48224ad-8201-4266-b8e0-8e9c198e9dd0",
              "queued_time": "2023-04-09T13:53:09.4702521Z",
              "session_id": null,
              "session_start_time": "2023-04-09T13:53:09.5127728Z",
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {},
          "execution_count": 1,
          "metadata": {},
          "output_type": "execute_result"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting flaml[synapse]==1.1.3\n",
            "  Downloading FLAML-1.1.3-py3-none-any.whl (224 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m224.2/224.2 KB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting xgboost==1.6.1\n",
            "  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m192.9/192.9 MB\u001b[0m \u001b[31m34.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting pandas==1.5.1\n",
            "  Downloading pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.2/12.2 MB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting numpy==1.23.4\n",
            "  Downloading numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.1/17.1 MB\u001b[0m \u001b[31m135.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting openml\n",
            "  Downloading openml-0.13.1.tar.gz (127 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.6/127.6 KB\u001b[0m \u001b[31m70.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
            "\u001b[?25hCollecting scipy>=1.4.1\n",
            "  Downloading scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m34.5/34.5 MB\u001b[0m \u001b[31m120.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting lightgbm>=2.3.1\n",
            "  Downloading lightgbm-3.3.5-py3-none-manylinux1_x86_64.whl (2.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m170.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting scikit-learn>=0.24\n",
            "  Downloading scikit_learn-1.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m9.8/9.8 MB\u001b[0m \u001b[31m186.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting pyspark>=3.0.0\n",
            "  Downloading pyspark-3.3.2.tar.gz (281.4 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m281.4/281.4 MB\u001b[0m \u001b[31m26.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
            "\u001b[?25hCollecting joblibspark>=0.5.0\n",
            "  Downloading joblibspark-0.5.1-py3-none-any.whl (15 kB)\n",
            "Collecting optuna==2.8.0\n",
            "  Downloading optuna-2.8.0-py3-none-any.whl (301 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m302.0/302.0 KB\u001b[0m \u001b[31m104.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting python-dateutil>=2.8.1\n",
            "  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m247.7/247.7 KB\u001b[0m \u001b[31m98.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting pytz>=2020.1\n",
            "  Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m502.3/502.3 KB\u001b[0m \u001b[31m126.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting alembic\n",
            "  Downloading alembic-1.10.3-py3-none-any.whl (212 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.3/212.3 KB\u001b[0m \u001b[31m88.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting colorlog\n",
            "  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)\n",
            "Collecting tqdm\n",
            "  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.1/77.1 KB\u001b[0m \u001b[31m39.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting cliff\n",
            "  Downloading cliff-4.2.0-py3-none-any.whl (81 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m81.0/81.0 KB\u001b[0m \u001b[31m37.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting sqlalchemy>=1.1.0\n",
            "  Downloading SQLAlchemy-2.0.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m190.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting cmaes>=0.8.2\n",
            "  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)\n",
            "Collecting packaging>=20.0\n",
            "  Downloading packaging-23.0-py3-none-any.whl (42 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.7/42.7 KB\u001b[0m \u001b[31m25.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting liac-arff>=2.4.0\n",
            "  Downloading liac-arff-2.5.0.tar.gz (13 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
            "\u001b[?25hCollecting xmltodict\n",
            "  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)\n",
            "Collecting requests\n",
            "  Downloading requests-2.28.2-py3-none-any.whl (62 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.8/62.8 KB\u001b[0m \u001b[31m25.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting minio\n",
            "  Downloading minio-7.1.14-py3-none-any.whl (77 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.2/77.2 KB\u001b[0m \u001b[31m40.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting pyarrow\n",
            "  Downloading pyarrow-11.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.0 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m35.0/35.0 MB\u001b[0m \u001b[31m119.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
            "\u001b[?25hCollecting joblib>=0.14\n",
            "  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m298.0/298.0 KB\u001b[0m \u001b[31m104.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting wheel\n",
            "  Downloading wheel-0.40.0-py3-none-any.whl (64 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.5/64.5 KB\u001b[0m \u001b[31m35.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting py4j==0.10.9.5\n",
            "  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.7/199.7 KB\u001b[0m \u001b[31m88.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting six>=1.5\n",
            "  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)\n",
            "Collecting threadpoolctl>=2.0.0\n",
            "  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)\n",
            "Collecting urllib3\n",
            "  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m140.9/140.9 KB\u001b[0m \u001b[31m70.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting certifi\n",
            "  Downloading certifi-2022.12.7-py3-none-any.whl (155 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m155.3/155.3 KB\u001b[0m \u001b[31m78.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting charset-normalizer<4,>=2\n",
            "  Downloading charset_normalizer-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (195 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m195.9/195.9 KB\u001b[0m \u001b[31m86.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting idna<4,>=2.5\n",
            "  Downloading idna-3.4-py3-none-any.whl (61 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.5/61.5 KB\u001b[0m \u001b[31m34.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting greenlet!=0.4.17\n",
            "  Downloading greenlet-2.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (618 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m618.5/618.5 KB\u001b[0m \u001b[31m137.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting typing-extensions>=4.2.0\n",
            "  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)\n",
            "Collecting Mako\n",
            "  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.7/78.7 KB\u001b[0m \u001b[31m44.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting importlib-resources\n",
            "  Downloading importlib_resources-5.12.0-py3-none-any.whl (36 kB)\n",
            "Collecting importlib-metadata\n",
            "  Downloading importlib_metadata-6.2.0-py3-none-any.whl (21 kB)\n",
            "Collecting stevedore>=2.0.1\n",
            "  Downloading stevedore-5.0.0-py3-none-any.whl (49 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.6/49.6 KB\u001b[0m \u001b[31m27.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting PyYAML>=3.12\n",
            "  Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m701.2/701.2 KB\u001b[0m \u001b[31m136.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting autopage>=0.4.0\n",
            "  Downloading autopage-0.5.1-py3-none-any.whl (29 kB)\n",
            "Collecting cmd2>=1.0.0\n",
            "  Downloading cmd2-2.4.3-py3-none-any.whl (147 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m147.2/147.2 KB\u001b[0m \u001b[31m71.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting PrettyTable>=0.7.2\n",
            "  Downloading prettytable-3.6.0-py3-none-any.whl (27 kB)\n",
            "Collecting attrs>=16.3.0\n",
            "  Downloading attrs-22.2.0-py3-none-any.whl (60 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.0/60.0 KB\u001b[0m \u001b[31m38.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting pyperclip>=1.6\n",
            "  Downloading pyperclip-1.8.2.tar.gz (20 kB)\n",
            "  Preparing metadata (setup.py) ... \u001b[?25l-\b \bdone\n",
            "\u001b[?25hCollecting wcwidth>=0.1.7\n",
            "  Downloading wcwidth-0.2.6-py2.py3-none-any.whl (29 kB)\n",
            "Collecting zipp>=0.5\n",
            "  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)\n",
            "Collecting pbr!=2.1.0,>=2.0.0\n",
            "  Downloading pbr-5.11.1-py2.py3-none-any.whl (112 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m112.7/112.7 KB\u001b[0m \u001b[31m59.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hCollecting MarkupSafe>=0.9.2\n",
            "  Downloading MarkupSafe-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)\n",
            "Building wheels for collected packages: openml, liac-arff, pyspark, pyperclip\n",
            "  Building wheel for openml (setup.py) ... \u001b[?25l-\b \b\\\b \bdone\n",
            "\u001b[?25h  Created wheel for openml: filename=openml-0.13.1-py3-none-any.whl size=142787 sha256=a8434d2ac76ac96031814803c3e41204c26927e9f4429117e59a494e4b592adb\n",
            "  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/c4/1c/5e/5775d391b42f19ce45a465873d8ce87da9ea56f0cd3af920c4\n",
            "  Building wheel for liac-arff (setup.py) ... \u001b[?25l-\b \bdone\n",
            "\u001b[?25h  Created wheel for liac-arff: filename=liac_arff-2.5.0-py3-none-any.whl size=11731 sha256=07dd6471e0004d4f00aec033896502af0b23e073f0c43e95afa97db2b545ce83\n",
            "  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/a2/de/68/bf3972de3ecb31e32bef59a7f4c75f0687a3674c476b347c14\n",
            "  Building wheel for pyspark (setup.py) ... \u001b[?25l-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \b\\\b \b|\b \b/\b \b-\b \bdone\n",
            "\u001b[?25h  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824026 sha256=a0064b8d2ed7587f48ff6c4bc6afd36c683af7c568084f16ebd143aa6955a0a8\n",
            "  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/b1/59/a0/a1a0624b5e865fd389919c1a10f53aec9b12195d6747710baf\n",
            "  Building wheel for pyperclip (setup.py) ... \u001b[?25l-\b \b\\\b \bdone\n",
            "\u001b[?25h  Created wheel for pyperclip: filename=pyperclip-1.8.2-py3-none-any.whl size=11107 sha256=b3ad4639c1af2d7f2e4c5c8c0e40b4ff849b5c5b26730285f3d7ad320badd2c3\n",
            "  Stored in directory: /home/trusted-service-user/.cache/pip/wheels/7f/1a/65/84ff8c386bec21fca6d220ea1f5498a0367883a78dd5ba6122\n",
            "Successfully built openml liac-arff pyspark pyperclip\n",
            "Installing collected packages: wcwidth, pytz, pyperclip, py4j, zipp, xmltodict, wheel, urllib3, typing-extensions, tqdm, threadpoolctl, six, PyYAML, pyspark, PrettyTable, pbr, packaging, numpy, MarkupSafe, liac-arff, joblib, idna, greenlet, colorlog, charset-normalizer, certifi, autopage, attrs, stevedore, sqlalchemy, scipy, requests, python-dateutil, pyarrow, minio, Mako, joblibspark, importlib-resources, importlib-metadata, cmd2, cmaes, xgboost, scikit-learn, pandas, cliff, alembic, optuna, openml, lightgbm, flaml\n",
            "  Attempting uninstall: wcwidth\n",
            "    Found existing installation: wcwidth 0.2.5\n",
            "    Not uninstalling wcwidth at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'wcwidth'. No files were found to uninstall.\n",
            "  Attempting uninstall: pytz\n",
            "    Found existing installation: pytz 2021.1\n",
            "    Not uninstalling pytz at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'pytz'. No files were found to uninstall.\n",
            "  Attempting uninstall: pyperclip\n",
            "    Found existing installation: pyperclip 1.8.2\n",
            "    Not uninstalling pyperclip at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'pyperclip'. No files were found to uninstall.\n",
            "  Attempting uninstall: py4j\n",
            "    Found existing installation: py4j 0.10.9.3\n",
            "    Not uninstalling py4j at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'py4j'. No files were found to uninstall.\n",
            "  Attempting uninstall: zipp\n",
            "    Found existing installation: zipp 3.5.0\n",
            "    Not uninstalling zipp at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'zipp'. No files were found to uninstall.\n",
            "  Attempting uninstall: wheel\n",
            "    Found existing installation: wheel 0.36.2\n",
            "    Not uninstalling wheel at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'wheel'. No files were found to uninstall.\n",
            "  Attempting uninstall: urllib3\n",
            "    Found existing installation: urllib3 1.26.4\n",
            "    Not uninstalling urllib3 at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'urllib3'. No files were found to uninstall.\n",
            "  Attempting uninstall: typing-extensions\n",
            "    Found existing installation: typing-extensions 3.10.0.0\n",
            "    Not uninstalling typing-extensions at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'typing-extensions'. No files were found to uninstall.\n",
            "  Attempting uninstall: tqdm\n",
            "    Found existing installation: tqdm 4.61.2\n",
            "    Not uninstalling tqdm at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'tqdm'. No files were found to uninstall.\n",
            "  Attempting uninstall: threadpoolctl\n",
            "    Found existing installation: threadpoolctl 2.1.0\n",
            "    Not uninstalling threadpoolctl at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'threadpoolctl'. No files were found to uninstall.\n",
            "  Attempting uninstall: six\n",
            "    Found existing installation: six 1.16.0\n",
            "    Not uninstalling six at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'six'. No files were found to uninstall.\n",
            "  Attempting uninstall: PyYAML\n",
            "    Found existing installation: PyYAML 5.4.1\n",
            "    Not uninstalling pyyaml at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'PyYAML'. No files were found to uninstall.\n",
            "  Attempting uninstall: pyspark\n",
            "    Found existing installation: pyspark 3.2.1\n",
            "    Not uninstalling pyspark at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'pyspark'. No files were found to uninstall.\n",
            "  Attempting uninstall: PrettyTable\n",
            "    Found existing installation: prettytable 2.4.0\n",
            "    Not uninstalling prettytable at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'prettytable'. No files were found to uninstall.\n",
            "  Attempting uninstall: packaging\n",
            "    Found existing installation: packaging 21.0\n",
            "    Not uninstalling packaging at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'packaging'. No files were found to uninstall.\n",
            "  Attempting uninstall: numpy\n",
            "    Found existing installation: numpy 1.19.4\n",
            "    Not uninstalling numpy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'numpy'. No files were found to uninstall.\n",
            "  Attempting uninstall: MarkupSafe\n",
            "    Found existing installation: MarkupSafe 2.0.1\n",
            "    Not uninstalling markupsafe at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'MarkupSafe'. No files were found to uninstall.\n",
            "  Attempting uninstall: liac-arff\n",
            "    Found existing installation: liac-arff 2.5.0\n",
            "    Not uninstalling liac-arff at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'liac-arff'. No files were found to uninstall.\n",
            "  Attempting uninstall: joblib\n",
            "    Found existing installation: joblib 1.0.1\n",
            "    Not uninstalling joblib at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'joblib'. No files were found to uninstall.\n",
            "  Attempting uninstall: idna\n",
            "    Found existing installation: idna 2.10\n",
            "    Not uninstalling idna at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'idna'. No files were found to uninstall.\n",
            "  Attempting uninstall: greenlet\n",
            "    Found existing installation: greenlet 1.1.0\n",
            "    Not uninstalling greenlet at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'greenlet'. No files were found to uninstall.\n",
            "  Attempting uninstall: certifi\n",
            "    Found existing installation: certifi 2021.5.30\n",
            "    Not uninstalling certifi at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'certifi'. No files were found to uninstall.\n",
            "  Attempting uninstall: attrs\n",
            "    Found existing installation: attrs 21.2.0\n",
            "    Not uninstalling attrs at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'attrs'. No files were found to uninstall.\n",
            "  Attempting uninstall: sqlalchemy\n",
            "    Found existing installation: SQLAlchemy 1.4.20\n",
            "    Not uninstalling sqlalchemy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'SQLAlchemy'. No files were found to uninstall.\n",
            "  Attempting uninstall: scipy\n",
            "    Found existing installation: scipy 1.5.3\n",
            "    Not uninstalling scipy at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'scipy'. No files were found to uninstall.\n",
            "  Attempting uninstall: requests\n",
            "    Found existing installation: requests 2.25.1\n",
            "    Not uninstalling requests at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'requests'. No files were found to uninstall.\n",
            "  Attempting uninstall: python-dateutil\n",
            "    Found existing installation: python-dateutil 2.8.1\n",
            "    Not uninstalling python-dateutil at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'python-dateutil'. No files were found to uninstall.\n",
            "  Attempting uninstall: pyarrow\n",
            "    Found existing installation: pyarrow 3.0.0\n",
            "    Not uninstalling pyarrow at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'pyarrow'. No files were found to uninstall.\n",
            "  Attempting uninstall: importlib-resources\n",
            "    Found existing installation: importlib-resources 5.10.0\n",
            "    Not uninstalling importlib-resources at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'importlib-resources'. No files were found to uninstall.\n",
            "  Attempting uninstall: importlib-metadata\n",
            "    Found existing installation: importlib-metadata 4.6.1\n",
            "    Not uninstalling importlib-metadata at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'importlib-metadata'. No files were found to uninstall.\n",
            "  Attempting uninstall: xgboost\n",
            "    Found existing installation: xgboost 1.4.0\n",
            "    Not uninstalling xgboost at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'xgboost'. No files were found to uninstall.\n",
            "  Attempting uninstall: scikit-learn\n",
            "    Found existing installation: scikit-learn 0.23.2\n",
            "    Not uninstalling scikit-learn at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'scikit-learn'. No files were found to uninstall.\n",
            "  Attempting uninstall: pandas\n",
            "    Found existing installation: pandas 1.2.3\n",
            "    Not uninstalling pandas at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'pandas'. No files were found to uninstall.\n",
            "  Attempting uninstall: lightgbm\n",
            "    Found existing installation: lightgbm 3.2.1\n",
            "    Not uninstalling lightgbm at /home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages, outside environment /nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39\n",
            "    Can't uninstall 'lightgbm'. No files were found to uninstall.\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.\n",
            "tensorflow 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 4.5.0 which is incompatible.\n",
            "pmdarima 1.8.2 requires numpy~=1.19.0, but you have numpy 1.23.4 which is incompatible.\n",
            "koalas 1.8.0 requires numpy<1.20.0,>=1.14, but you have numpy 1.23.4 which is incompatible.\n",
            "gevent 21.1.2 requires greenlet<2.0,>=0.4.17; platform_python_implementation == \"CPython\", but you have greenlet 2.0.2 which is incompatible.\n",
            "azureml-dataset-runtime 1.34.0 requires pyarrow<4.0.0,>=0.17.0, but you have pyarrow 11.0.0 which is incompatible.\n",
            "azureml-core 1.34.0 requires urllib3<=1.26.6,>=1.23, but you have urllib3 1.26.15 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed Mako-1.2.4 MarkupSafe-2.1.2 PrettyTable-3.6.0 PyYAML-6.0 alembic-1.10.3 attrs-22.2.0 autopage-0.5.1 certifi-2022.12.7 charset-normalizer-3.1.0 cliff-4.2.0 cmaes-0.9.1 cmd2-2.4.3 colorlog-6.7.0 flaml-1.1.3 greenlet-2.0.2 idna-3.4 importlib-metadata-6.2.0 importlib-resources-5.12.0 joblib-1.2.0 joblibspark-0.5.1 liac-arff-2.5.0 lightgbm-3.3.5 minio-7.1.14 numpy-1.23.4 openml-0.13.1 optuna-2.8.0 packaging-23.0 pandas-1.5.1 pbr-5.11.1 py4j-0.10.9.5 pyarrow-11.0.0 pyperclip-1.8.2 pyspark-3.3.2 python-dateutil-2.8.2 pytz-2023.3 requests-2.28.2 scikit-learn-1.2.2 scipy-1.10.1 six-1.16.0 sqlalchemy-2.0.9 stevedore-5.0.0 threadpoolctl-3.1.0 tqdm-4.65.0 typing-extensions-4.5.0 urllib3-1.26.15 wcwidth-0.2.6 wheel-0.40.0 xgboost-1.6.1 xmltodict-0.13.0 zipp-3.15.0\n",
            "\u001b[33mWARNING: You are using pip version 22.0.4; however, version 23.0.1 is available.\n",
            "You should consider upgrading via the '/nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39/bin/python -m pip install --upgrade pip' command.\u001b[0m\u001b[33m\n",
            "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n"
          ]
        },
        {
          "data": {},
          "execution_count": 1,
          "metadata": {},
          "output_type": "execute_result"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Warning: PySpark kernel has been restarted to use updated packages.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "%pip install flaml[synapse]==1.1.3 xgboost==1.6.1 pandas==1.5.1 numpy==1.23.4 openml --force-reinstall"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Uncomment `_init_spark()` if run in local spark env."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def _init_spark():\n",
        "    import pyspark\n",
        "\n",
        "    spark = (\n",
        "        pyspark.sql.SparkSession.builder.appName(\"MyApp\")\n",
        "        .master(\"local[2]\")\n",
        "        .config(\n",
        "            \"spark.jars.packages\",\n",
        "            (\n",
        "                \"com.microsoft.azure:synapseml_2.12:0.10.2,\"\n",
        "                \"org.apache.hadoop:hadoop-azure:3.3.5,\"\n",
        "                \"com.microsoft.azure:azure-storage:8.6.6\"\n",
        "            ),\n",
        "        )\n",
        "        .config(\"spark.jars.repositories\", \"https://mmlspark.azureedge.net/maven\")\n",
        "        .config(\"spark.sql.debug.maxToStringFields\", \"100\")\n",
        "        .getOrCreate()\n",
        "    )\n",
        "    return spark\n",
        "\n",
        "# spark = _init_spark()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## 2. Prepare train and test datasets\n",
        "In this step, we first download the dataset with sklearn.datasets, then convert it into a spark dataframe. After that, we split the dataset into train, validation and test datasets."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "b48443c1-a512-4624-b047-1a04eeba9a9d",
              "queued_time": "2023-04-09T13:53:09.3733824Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:471: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Dataframe has 20640 rows\n"
          ]
        }
      ],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn.datasets import fetch_california_housing\n",
        "\n",
        "data = fetch_california_housing()\n",
        "\n",
        "feature_cols = [\"f\" + str(i) for i in range(data.data.shape[1])]\n",
        "header = [\"target\"] + feature_cols\n",
        "df = spark.createDataFrame(\n",
        "    pd.DataFrame(data=np.column_stack((data.target, data.data)), columns=header)\n",
        ").repartition(1)\n",
        "\n",
        "print(\"Dataframe has {} rows\".format(df.count()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "Here, we split the datasets randomly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "0600f529-d1d0-4132-a55c-24464a10a9c3",
              "queued_time": "2023-04-09T13:53:09.3762563Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "Row(target=0.14999, features=DenseVector([2.1, 19.0, 3.7744, 1.4573, 490.0, 2.9878, 36.4, -117.02]))"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from pyspark.ml.feature import VectorAssembler\n",
        "\n",
        "# Convert features into a single vector column\n",
        "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
        "data = featurizer.transform(df)[\"target\", \"features\"]\n",
        "\n",
        "train_data, test_data = data.randomSplit([0.85, 0.15], seed=41)\n",
        "train_data_sub, val_data_sub = train_data.randomSplit([0.85, 0.15], seed=41)\n",
        "\n",
        "train_data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## 3. Train with initial parameters\n",
        "In this step, we prepare a train function which can accept different config of parameters. And we train a model with initial parameters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "3c41f117-9de6-4f81-b9fe-697842cb7d87",
              "queued_time": "2023-04-09T13:53:09.377987Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from synapse.ml.lightgbm import LightGBMRegressor\n",
        "from pyspark.ml.evaluation import RegressionEvaluator\n",
        "\n",
        "def train(alpha, learningRate, numLeaves, numIterations, train_data=train_data_sub, val_data=val_data_sub):\n",
        "    \"\"\"\n",
        "    This train() function:\n",
        "     - takes hyperparameters as inputs (for tuning later)\n",
        "     - returns the R2 score on the validation dataset\n",
        "\n",
        "    Wrapping code as a function makes it easier to reuse the code later for tuning.\n",
        "    \"\"\"\n",
        "\n",
        "    lgr = LightGBMRegressor(\n",
        "        objective=\"quantile\",\n",
        "        alpha=alpha,\n",
        "        learningRate=learningRate,\n",
        "        numLeaves=numLeaves,\n",
        "        labelCol=\"target\",\n",
        "        numIterations=numIterations,\n",
        "    )\n",
        "\n",
        "    model = lgr.fit(train_data)\n",
        "\n",
        "    # Define an evaluation metric and evaluate the model on the validation dataset.\n",
        "    predictions = model.transform(val_data)\n",
        "    evaluator = RegressionEvaluator(predictionCol=\"prediction\", labelCol=\"target\", metricName=\"r2\")\n",
        "    eval_metric = evaluator.evaluate(predictions)\n",
        "\n",
        "    return model, eval_metric"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "Here, we train a model with default parameters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "b936d629-6efc-4582-a4cc-24b55a8f1260",
              "queued_time": "2023-04-09T13:53:09.3794418Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "R2 of initial model on test dataset is:  0.7086364659469071\n"
          ]
        }
      ],
      "source": [
        "init_model, init_eval_metric = train(alpha=0.2, learningRate=0.3, numLeaves=31, numIterations=100, train_data=train_data, val_data=test_data)\n",
        "print(\"R2 of initial model on test dataset is: \", init_eval_metric)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## 4. Tune with FLAML\n",
        "\n",
        "In this step, we configure the search space for hyperparameters, and use FLAML to tune the model over the parameters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "5785d2f4-5945-45ec-865d-1cf62f1365f2",
              "queued_time": "2023-04-09T13:53:09.3808794Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/dask/dataframe/backends.py:187: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n",
            "  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)\n",
            "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/dask/dataframe/backends.py:187: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n",
            "  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)\n",
            "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/dask/dataframe/backends.py:187: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n",
            "  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (urllib3 1.26.15 (/nfs4/pyenv-78360147-4170-4df6-b8c9-313b8eb68e39/lib/python3.8/site-packages), Requirement.parse('urllib3<=1.26.6,>=1.23')).\n"
          ]
        }
      ],
      "source": [
        "import flaml\n",
        "import time\n",
        "\n",
        "# define the search space\n",
        "params = {\n",
        "    \"alpha\": flaml.tune.uniform(0, 1),\n",
        "    \"learningRate\": flaml.tune.uniform(0.001, 1),\n",
        "    \"numLeaves\": flaml.tune.randint(30, 100),\n",
        "    \"numIterations\": flaml.tune.randint(100, 300),\n",
        "}\n",
        "\n",
        "# define the tune function\n",
        "def flaml_tune(config):\n",
        "    _, metric = train(**config)\n",
        "    return {\"r2\": metric}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "Here, we optimize the hyperparameters with FLAML. We set the total tuning time to 120 seconds."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "7f984630-2cd4-46f6-a029-df857503ac59",
              "queued_time": "2023-04-09T13:53:09.3823941Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[flaml.tune.tune: 04-09 13:58:26] {523} INFO - Using search algorithm BlendSearch.\n",
            "No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune\n",
            "You passed a `space` parameter to OptunaSearch that contained unresolved search space definitions. OptunaSearch should however be instantiated with fully configured search spaces only. To use Ray Tune's automatic search space conversion, pass the space definition as part of the `config` argument to `tune.run()` instead.\n",
            "[flaml.tune.tune: 04-09 13:58:26] {811} INFO - trial 1 config: {'alpha': 0.09743207287894917, 'learningRate': 0.64761881525086, 'numLeaves': 30, 'numIterations': 172}\n",
            "[flaml.tune.tune: 04-09 13:58:29] {215} INFO - result: {'r2': 0.687704619858422, 'training_iteration': 0, 'config': {'alpha': 0.09743207287894917, 'learningRate': 0.64761881525086, 'numLeaves': 30, 'numIterations': 172}, 'config/alpha': 0.09743207287894917, 'config/learningRate': 0.64761881525086, 'config/numLeaves': 30, 'config/numIterations': 172, 'experiment_tag': 'exp', 'time_total_s': 2.9537112712860107}\n",
            "[flaml.tune.tune: 04-09 13:58:29] {811} INFO - trial 2 config: {'alpha': 0.771320643266746, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}\n",
            "[flaml.tune.tune: 04-09 13:58:34] {215} INFO - result: {'r2': 0.8122065159182567, 'training_iteration': 0, 'config': {'alpha': 0.771320643266746, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}, 'config/alpha': 0.771320643266746, 'config/learningRate': 0.021731197410042098, 'config/numLeaves': 74, 'config/numIterations': 249, 'experiment_tag': 'exp', 'time_total_s': 5.294095993041992}\n",
            "[flaml.tune.tune: 04-09 13:58:34] {811} INFO - trial 3 config: {'alpha': 0.4985070123025904, 'learningRate': 0.2255718488853168, 'numLeaves': 43, 'numIterations': 252}\n",
            "[flaml.tune.tune: 04-09 13:58:38] {215} INFO - result: {'r2': 0.8601164308675, 'training_iteration': 0, 'config': {'alpha': 0.4985070123025904, 'learningRate': 0.2255718488853168, 'numLeaves': 43, 'numIterations': 252}, 'config/alpha': 0.4985070123025904, 'config/learningRate': 0.2255718488853168, 'config/numLeaves': 43, 'config/numIterations': 252, 'experiment_tag': 'exp', 'time_total_s': 3.6809208393096924}\n",
            "[flaml.tune.tune: 04-09 13:58:38] {811} INFO - trial 4 config: {'alpha': 0.5940316589938806, 'learningRate': 0.22926504794631342, 'numLeaves': 35, 'numIterations': 279}\n",
            "[flaml.tune.tune: 04-09 13:58:41] {215} INFO - result: {'r2': 0.8645092967530056, 'training_iteration': 0, 'config': {'alpha': 0.5940316589938806, 'learningRate': 0.22926504794631342, 'numLeaves': 35, 'numIterations': 279}, 'config/alpha': 0.5940316589938806, 'config/learningRate': 0.22926504794631342, 'config/numLeaves': 35, 'config/numIterations': 279, 'experiment_tag': 'exp', 'time_total_s': 3.345020294189453}\n",
            "[flaml.tune.tune: 04-09 13:58:41] {811} INFO - trial 5 config: {'alpha': 0.16911083656253545, 'learningRate': 0.08925147435983626, 'numLeaves': 77, 'numIterations': 290}\n",
            "[flaml.tune.tune: 04-09 13:58:47] {215} INFO - result: {'r2': 0.7628328927228814, 'training_iteration': 0, 'config': {'alpha': 0.16911083656253545, 'learningRate': 0.08925147435983626, 'numLeaves': 77, 'numIterations': 290}, 'config/alpha': 0.16911083656253545, 'config/learningRate': 0.08925147435983626, 'config/numLeaves': 77, 'config/numIterations': 290, 'experiment_tag': 'exp', 'time_total_s': 5.498648643493652}\n",
            "[flaml.tune.tune: 04-09 13:58:47] {811} INFO - trial 6 config: {'alpha': 0.7613139607545752, 'learningRate': 0.001, 'numLeaves': 82, 'numIterations': 244}\n",
            "[flaml.tune.tune: 04-09 13:58:52] {215} INFO - result: {'r2': 0.05495941941983151, 'training_iteration': 0, 'config': {'alpha': 0.7613139607545752, 'learningRate': 0.001, 'numLeaves': 82, 'numIterations': 244}, 'config/alpha': 0.7613139607545752, 'config/learningRate': 0.001, 'config/numLeaves': 82, 'config/numIterations': 244, 'experiment_tag': 'exp', 'time_total_s': 5.299764394760132}\n",
            "[flaml.tune.tune: 04-09 13:58:52] {811} INFO - trial 7 config: {'alpha': 0.003948266327914451, 'learningRate': 0.5126800711223909, 'numLeaves': 86, 'numIterations': 222}\n",
            "[flaml.tune.tune: 04-09 13:58:57] {215} INFO - result: {'r2': -0.13472888652710457, 'training_iteration': 0, 'config': {'alpha': 0.003948266327914451, 'learningRate': 0.5126800711223909, 'numLeaves': 86, 'numIterations': 222}, 'config/alpha': 0.003948266327914451, 'config/learningRate': 0.5126800711223909, 'config/numLeaves': 86, 'config/numIterations': 222, 'experiment_tag': 'exp', 'time_total_s': 4.852660417556763}\n",
            "[flaml.tune.tune: 04-09 13:58:57] {811} INFO - trial 8 config: {'alpha': 0.7217553174317995, 'learningRate': 0.2925841921024625, 'numLeaves': 94, 'numIterations': 242}\n",
            "[flaml.tune.tune: 04-09 13:59:02] {215} INFO - result: {'r2': 0.841125964017654, 'training_iteration': 0, 'config': {'alpha': 0.7217553174317995, 'learningRate': 0.2925841921024625, 'numLeaves': 94, 'numIterations': 242}, 'config/alpha': 0.7217553174317995, 'config/learningRate': 0.2925841921024625, 'config/numLeaves': 94, 'config/numIterations': 242, 'experiment_tag': 'exp', 'time_total_s': 5.44955039024353}\n",
            "[flaml.tune.tune: 04-09 13:59:02] {811} INFO - trial 9 config: {'alpha': 0.8650568165408982, 'learningRate': 0.20965040368499302, 'numLeaves': 92, 'numIterations': 221}\n",
            "[flaml.tune.tune: 04-09 13:59:07] {215} INFO - result: {'r2': 0.764342272362222, 'training_iteration': 0, 'config': {'alpha': 0.8650568165408982, 'learningRate': 0.20965040368499302, 'numLeaves': 92, 'numIterations': 221}, 'config/alpha': 0.8650568165408982, 'config/learningRate': 0.20965040368499302, 'config/numLeaves': 92, 'config/numIterations': 221, 'experiment_tag': 'exp', 'time_total_s': 4.9519362449646}\n",
            "[flaml.tune.tune: 04-09 13:59:07] {811} INFO - trial 10 config: {'alpha': 0.5425443680112613, 'learningRate': 0.14302787755392543, 'numLeaves': 56, 'numIterations': 234}\n",
            "[flaml.tune.tune: 04-09 13:59:11] {215} INFO - result: {'r2': 0.8624550670698988, 'training_iteration': 0, 'config': {'alpha': 0.5425443680112613, 'learningRate': 0.14302787755392543, 'numLeaves': 56, 'numIterations': 234}, 'config/alpha': 0.5425443680112613, 'config/learningRate': 0.14302787755392543, 'config/numLeaves': 56, 'config/numIterations': 234, 'experiment_tag': 'exp', 'time_total_s': 3.658425807952881}\n",
            "[flaml.tune.tune: 04-09 13:59:11] {811} INFO - trial 11 config: {'alpha': 0.5736011364335467, 'learningRate': 0.28259755916943197, 'numLeaves': 48, 'numIterations': 218}\n",
            "[flaml.tune.tune: 04-09 13:59:14] {215} INFO - result: {'r2': 0.8605136490358005, 'training_iteration': 0, 'config': {'alpha': 0.5736011364335467, 'learningRate': 0.28259755916943197, 'numLeaves': 48, 'numIterations': 218}, 'config/alpha': 0.5736011364335467, 'config/learningRate': 0.28259755916943197, 'config/numLeaves': 48, 'config/numIterations': 218, 'experiment_tag': 'exp', 'time_total_s': 3.052793502807617}\n",
            "[flaml.tune.tune: 04-09 13:59:14] {811} INFO - trial 12 config: {'alpha': 0.5114875995889758, 'learningRate': 0.003458195938418919, 'numLeaves': 64, 'numIterations': 250}\n",
            "[flaml.tune.tune: 04-09 13:59:18] {215} INFO - result: {'r2': 0.570491367756149, 'training_iteration': 0, 'config': {'alpha': 0.5114875995889758, 'learningRate': 0.003458195938418919, 'numLeaves': 64, 'numIterations': 250}, 'config/alpha': 0.5114875995889758, 'config/learningRate': 0.003458195938418919, 'config/numLeaves': 64, 'config/numIterations': 250, 'experiment_tag': 'exp', 'time_total_s': 4.374900579452515}\n",
            "[flaml.tune.tune: 04-09 13:59:18] {811} INFO - trial 13 config: {'alpha': 0.4545232529799527, 'learningRate': 0.12259729414043312, 'numLeaves': 52, 'numIterations': 268}\n",
            "[flaml.tune.tune: 04-09 13:59:22] {215} INFO - result: {'r2': 0.8548999617455493, 'training_iteration': 0, 'config': {'alpha': 0.4545232529799527, 'learningRate': 0.12259729414043312, 'numLeaves': 52, 'numIterations': 268}, 'config/alpha': 0.4545232529799527, 'config/learningRate': 0.12259729414043312, 'config/numLeaves': 52, 'config/numIterations': 268, 'experiment_tag': 'exp', 'time_total_s': 4.0238401889801025}\n",
            "[flaml.tune.tune: 04-09 13:59:22] {811} INFO - trial 14 config: {'alpha': 0.6305654830425699, 'learningRate': 0.16345846096741776, 'numLeaves': 60, 'numIterations': 200}\n",
            "[flaml.tune.tune: 04-09 13:59:26] {215} INFO - result: {'r2': 0.8601984046769122, 'training_iteration': 0, 'config': {'alpha': 0.6305654830425699, 'learningRate': 0.16345846096741776, 'numLeaves': 60, 'numIterations': 200}, 'config/alpha': 0.6305654830425699, 'config/learningRate': 0.16345846096741776, 'config/numLeaves': 60, 'config/numIterations': 200, 'experiment_tag': 'exp', 'time_total_s': 3.4227209091186523}\n",
            "[flaml.tune.tune: 04-09 13:59:26] {811} INFO - trial 15 config: {'alpha': 0.37308018496384865, 'learningRate': 0.2146450219293334, 'numLeaves': 51, 'numIterations': 230}\n",
            "[flaml.tune.tune: 04-09 13:59:29] {215} INFO - result: {'r2': 0.8447822051728697, 'training_iteration': 0, 'config': {'alpha': 0.37308018496384865, 'learningRate': 0.2146450219293334, 'numLeaves': 51, 'numIterations': 230}, 'config/alpha': 0.37308018496384865, 'config/learningRate': 0.2146450219293334, 'config/numLeaves': 51, 'config/numIterations': 230, 'experiment_tag': 'exp', 'time_total_s': 3.3695919513702393}\n",
            "[flaml.tune.tune: 04-09 13:59:29] {811} INFO - trial 16 config: {'alpha': 0.7120085510586739, 'learningRate': 0.07141073317851748, 'numLeaves': 61, 'numIterations': 238}\n",
            "[flaml.tune.tune: 04-09 13:59:33] {215} INFO - result: {'r2': 0.8502914796218052, 'training_iteration': 0, 'config': {'alpha': 0.7120085510586739, 'learningRate': 0.07141073317851748, 'numLeaves': 61, 'numIterations': 238}, 'config/alpha': 0.7120085510586739, 'config/learningRate': 0.07141073317851748, 'config/numLeaves': 61, 'config/numIterations': 238, 'experiment_tag': 'exp', 'time_total_s': 3.8938868045806885}\n",
            "[flaml.tune.tune: 04-09 13:59:33] {811} INFO - trial 17 config: {'alpha': 0.6950187212596339, 'learningRate': 0.04860046789642168, 'numLeaves': 56, 'numIterations': 216}\n",
            "[flaml.tune.tune: 04-09 13:59:36] {215} INFO - result: {'r2': 0.8507495957886304, 'training_iteration': 0, 'config': {'alpha': 0.6950187212596339, 'learningRate': 0.04860046789642168, 'numLeaves': 56, 'numIterations': 216}, 'config/alpha': 0.6950187212596339, 'config/learningRate': 0.04860046789642168, 'config/numLeaves': 56, 'config/numIterations': 216, 'experiment_tag': 'exp', 'time_total_s': 3.4858739376068115}\n",
            "[flaml.tune.tune: 04-09 13:59:36] {811} INFO - trial 18 config: {'alpha': 0.3900700147628886, 'learningRate': 0.23745528721142917, 'numLeaves': 56, 'numIterations': 252}\n",
            "[flaml.tune.tune: 04-09 13:59:40] {215} INFO - result: {'r2': 0.8448561963142436, 'training_iteration': 0, 'config': {'alpha': 0.3900700147628886, 'learningRate': 0.23745528721142917, 'numLeaves': 56, 'numIterations': 252}, 'config/alpha': 0.3900700147628886, 'config/learningRate': 0.23745528721142917, 'config/numLeaves': 56, 'config/numIterations': 252, 'experiment_tag': 'exp', 'time_total_s': 3.8567142486572266}\n",
            "[flaml.tune.tune: 04-09 13:59:40] {811} INFO - trial 19 config: {'alpha': 0.6652445360947545, 'learningRate': 0.035981262663243294, 'numLeaves': 63, 'numIterations': 225}\n",
            "[flaml.tune.tune: 04-09 13:59:44] {215} INFO - result: {'r2': 0.8513605547375983, 'training_iteration': 0, 'config': {'alpha': 0.6652445360947545, 'learningRate': 0.035981262663243294, 'numLeaves': 63, 'numIterations': 225}, 'config/alpha': 0.6652445360947545, 'config/learningRate': 0.035981262663243294, 'config/numLeaves': 63, 'config/numIterations': 225, 'experiment_tag': 'exp', 'time_total_s': 3.984147071838379}\n",
            "[flaml.tune.tune: 04-09 13:59:44] {811} INFO - trial 20 config: {'alpha': 0.419844199927768, 'learningRate': 0.25007449244460755, 'numLeaves': 49, 'numIterations': 243}\n",
            "[flaml.tune.tune: 04-09 13:59:48] {215} INFO - result: {'r2': 0.8489881682927205, 'training_iteration': 0, 'config': {'alpha': 0.419844199927768, 'learningRate': 0.25007449244460755, 'numLeaves': 49, 'numIterations': 243}, 'config/alpha': 0.419844199927768, 'config/learningRate': 0.25007449244460755, 'config/numLeaves': 49, 'config/numIterations': 243, 'experiment_tag': 'exp', 'time_total_s': 3.3616762161254883}\n",
            "[flaml.tune.tune: 04-09 13:59:48] {811} INFO - trial 21 config: {'alpha': 0.6440889733602198, 'learningRate': 0.028339066191258172, 'numLeaves': 65, 'numIterations': 240}\n",
            "[flaml.tune.tune: 04-09 13:59:52] {215} INFO - result: {'r2': 0.8495512334801718, 'training_iteration': 0, 'config': {'alpha': 0.6440889733602198, 'learningRate': 0.028339066191258172, 'numLeaves': 65, 'numIterations': 240}, 'config/alpha': 0.6440889733602198, 'config/learningRate': 0.028339066191258172, 'config/numLeaves': 65, 'config/numIterations': 240, 'experiment_tag': 'exp', 'time_total_s': 4.202790021896362}\n",
            "[flaml.tune.tune: 04-09 13:59:52] {811} INFO - trial 22 config: {'alpha': 0.44099976266230273, 'learningRate': 0.2577166889165927, 'numLeaves': 47, 'numIterations': 228}\n",
            "[flaml.tune.tune: 04-09 13:59:55] {215} INFO - result: {'r2': 0.8488734669877886, 'training_iteration': 0, 'config': {'alpha': 0.44099976266230273, 'learningRate': 0.2577166889165927, 'numLeaves': 47, 'numIterations': 228}, 'config/alpha': 0.44099976266230273, 'config/learningRate': 0.2577166889165927, 'config/numLeaves': 47, 'config/numIterations': 228, 'experiment_tag': 'exp', 'time_total_s': 3.127204656600952}\n",
            "[flaml.tune.tune: 04-09 13:59:55] {811} INFO - trial 23 config: {'alpha': 0.42121699403087287, 'learningRate': 0.001, 'numLeaves': 59, 'numIterations': 230}\n",
            "[flaml.tune.tune: 04-09 13:59:59] {215} INFO - result: {'r2': 0.06286187614238248, 'training_iteration': 0, 'config': {'alpha': 0.42121699403087287, 'learningRate': 0.001, 'numLeaves': 59, 'numIterations': 230}, 'config/alpha': 0.42121699403087287, 'config/learningRate': 0.001, 'config/numLeaves': 59, 'config/numIterations': 230, 'experiment_tag': 'exp', 'time_total_s': 4.033763885498047}\n",
            "[flaml.tune.tune: 04-09 13:59:59] {811} INFO - trial 24 config: {'alpha': 0.6638717419916497, 'learningRate': 0.2948532436523798, 'numLeaves': 53, 'numIterations': 238}\n",
            "[flaml.tune.tune: 04-09 14:00:02] {215} INFO - result: {'r2': 0.8498368376396829, 'training_iteration': 0, 'config': {'alpha': 0.6638717419916497, 'learningRate': 0.2948532436523798, 'numLeaves': 53, 'numIterations': 238}, 'config/alpha': 0.6638717419916497, 'config/learningRate': 0.2948532436523798, 'config/numLeaves': 53, 'config/numIterations': 238, 'experiment_tag': 'exp', 'time_total_s': 3.476837396621704}\n",
            "[flaml.tune.tune: 04-09 14:00:02] {811} INFO - trial 25 config: {'alpha': 0.5053650827127543, 'learningRate': 0.2864282425481766, 'numLeaves': 57, 'numIterations': 207}\n",
            "[flaml.tune.tune: 04-09 14:00:06] {215} INFO - result: {'r2': 0.8638166525272971, 'training_iteration': 0, 'config': {'alpha': 0.5053650827127543, 'learningRate': 0.2864282425481766, 'numLeaves': 57, 'numIterations': 207}, 'config/alpha': 0.5053650827127543, 'config/learningRate': 0.2864282425481766, 'config/numLeaves': 57, 'config/numIterations': 207, 'experiment_tag': 'exp', 'time_total_s': 3.355837106704712}\n",
            "[flaml.tune.tune: 04-09 14:00:06] {811} INFO - trial 26 config: {'alpha': 0.6747046166960979, 'learningRate': 0.10854042236738932, 'numLeaves': 32, 'numIterations': 253}\n",
            "[flaml.tune.tune: 04-09 14:00:09] {215} INFO - result: {'r2': 0.8547648297991456, 'training_iteration': 0, 'config': {'alpha': 0.6747046166960979, 'learningRate': 0.10854042236738932, 'numLeaves': 32, 'numIterations': 253}, 'config/alpha': 0.6747046166960979, 'config/learningRate': 0.10854042236738932, 'config/numLeaves': 32, 'config/numIterations': 253, 'experiment_tag': 'exp', 'time_total_s': 2.7572436332702637}\n",
            "[flaml.tune.tune: 04-09 14:00:09] {811} INFO - trial 27 config: {'alpha': 0.5784538183227009, 'learningRate': 0.375517980519932, 'numLeaves': 96, 'numIterations': 263}\n",
            "[flaml.tune.tune: 04-09 14:00:14] {215} INFO - result: {'r2': 0.8512614628125035, 'training_iteration': 0, 'config': {'alpha': 0.5784538183227009, 'learningRate': 0.375517980519932, 'numLeaves': 96, 'numIterations': 263}, 'config/alpha': 0.5784538183227009, 'config/learningRate': 0.375517980519932, 'config/numLeaves': 96, 'config/numIterations': 263, 'experiment_tag': 'exp', 'time_total_s': 5.738212823867798}\n",
            "[flaml.tune.tune: 04-09 14:00:14] {811} INFO - trial 28 config: {'alpha': 0.46593191048243093, 'learningRate': 0.2244884500377041, 'numLeaves': 99, 'numIterations': 269}\n",
            "[flaml.tune.tune: 04-09 14:00:20] {215} INFO - result: {'r2': 0.86197268492276, 'training_iteration': 0, 'config': {'alpha': 0.46593191048243093, 'learningRate': 0.2244884500377041, 'numLeaves': 99, 'numIterations': 269}, 'config/alpha': 0.46593191048243093, 'config/learningRate': 0.2244884500377041, 'config/numLeaves': 99, 'config/numIterations': 269, 'experiment_tag': 'exp', 'time_total_s': 5.934798240661621}\n",
            "[flaml.tune.tune: 04-09 14:00:20] {811} INFO - trial 29 config: {'alpha': 0.5784538183227009, 'learningRate': 0.375517980519932, 'numLeaves': 95, 'numIterations': 263}\n",
            "[flaml.tune.tune: 04-09 14:00:26] {215} INFO - result: {'r2': 0.8524397365306237, 'training_iteration': 0, 'config': {'alpha': 0.5784538183227009, 'learningRate': 0.375517980519932, 'numLeaves': 95, 'numIterations': 263}, 'config/alpha': 0.5784538183227009, 'config/learningRate': 0.375517980519932, 'config/numLeaves': 95, 'config/numIterations': 263, 'experiment_tag': 'exp', 'time_total_s': 5.699255704879761}\n"
          ]
        }
      ],
      "source": [
        "analysis = flaml.tune.run(\n",
        "    flaml_tune,\n",
        "    params,\n",
        "    time_budget_s=120,  # tuning in 120 seconds\n",
        "    num_samples=100,\n",
        "    metric=\"r2\",\n",
        "    mode=\"max\",\n",
        "    verbose=5,\n",
        "    )"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "a17d5766-6cd3-4428-a1b2-7a3694ea5116",
              "queued_time": "2023-04-09T13:53:09.3839884Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Best config:  {'alpha': 0.5940316589938806, 'learningRate': 0.22926504794631342, 'numLeaves': 35, 'numIterations': 279}\n"
          ]
        }
      ],
      "source": [
        "flaml_config = analysis.best_config\n",
        "print(\"Best config: \", flaml_config)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "source": [
        "## 5. Check results\n",
        "In this step, we retrain the model using the \"best\" hyperparameters on the full training dataset, and use the test dataset to compare evaluation metrics for the initial and \"best\" model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "jupyter": {
          "outputs_hidden": false,
          "source_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.livy.statement-meta+json": {
              "execution_finish_time": null,
              "execution_start_time": null,
              "livy_statement_state": null,
              "parent_msg_id": "8f4ef6a0-e516-449f-b4e4-59bb9dcffe09",
              "queued_time": "2023-04-09T13:53:09.3856221Z",
              "session_id": null,
              "session_start_time": null,
              "spark_jobs": null,
              "spark_pool": null,
              "state": "waiting",
              "statement_id": null
            },
            "text/plain": [
              "StatementMeta(, , , Waiting, )"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "On the test dataset, the initial (untuned) model achieved R^2:  0.7086364659469071\n",
            "On the test dataset, the final flaml (tuned) model achieved R^2:  0.8094330941991653\n"
          ]
        }
      ],
      "source": [
        "flaml_model, flaml_metric = train(train_data=train_data, val_data=test_data, **flaml_config)\n",
        "\n",
        "print(\"On the test dataset, the initial (untuned) model achieved R^2: \", init_eval_metric)\n",
        "print(\"On the test dataset, the final flaml (tuned) model achieved R^2: \", flaml_metric)"
      ]
    }
  ],
  "metadata": {
    "description": null,
    "kernelspec": {
      "display_name": "Synapse PySpark",
      "name": "synapse_pyspark"
    },
    "language_info": {
      "name": "python"
    },
    "save_output": true,
    "synapse_widget": {
      "state": {},
      "version": "0.1"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}