Chenhui/rename notebooks and update automl notebook (#106)

* removed unused module

* added outputs in automl notebook

* fixed a notebook name
This commit is contained in:
Chenhui Hu 2020-03-06 10:32:38 -05:00 коммит произвёл GitHub
Родитель e36b09ba97
Коммит 15d25213dc
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
9 изменённых файлов: 1793 добавлений и 759 удалений

Просмотреть файл

@ -1,756 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning (AutoML) on Azure for Retail Sales Forecasting\n",
"\n",
"This notebook demonstrates how to apply [AutoML in Azure Machine Learning services](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) to train and tune machine learning models for forecasting product sales in retail. We will use the Orange Juice dataset to illustrate the steps of utilizing AutoML as well as how to combine an AutoML model with a custom model for better performance.\n",
"\n",
"AutoML is a process of automating the tasks of machine learning model development. It helps data scientists and other practioners build machine learning models with high scalability and quality in less amount of time. AutoML in Azure Machine Learning allows you to train and tune a model using a target metric that you specify. This service iterates through machine learning algorithms and feature selection approaches, producing a score that measures the quality of each machine learning pipeline. The best model will then be selected based on the scores. For more technical details about Azure AutoML, please check [this paper](https://papers.nips.cc/paper/7595-probabilistic-matrix-factorization-for-automated-machine-learning.pdf).\n",
"\n",
"This notebook uses [Azure ML SDK](https://docs.microsoft.com/en-us/python/api/overview/azureml-sdk/?view=azure-ml-py) which is included in the `forecasting_env` conda environment. If you are running in Azure Notebooks or another Microsoft managed environment, the SDK is already installed. On the other hand, if you are running this notebook in your own environment, please follow [SDK installation instructions](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment) to install the SDK."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Global Settings and Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import math\n",
"import datetime\n",
"import logging\n",
"import azureml.core\n",
"import azureml.automl\n",
"import pandas as pd\n",
"\n",
"from matplotlib import pyplot as plt\n",
"from fclib.common.utils import git_repo_path\n",
"from fclib.evaluation.evaluation_utils import MAPE\n",
"from fclib.dataset.ojdata import download_ojdata, FIRST_WEEK_START\n",
"from fclib.common.utils import align_outputs\n",
"from fclib.models.multiple_linear_regression import fit, predict\n",
"\n",
"from azureml.core import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.core.experiment import Experiment\n",
"from automl.client.core.common import constants\n",
"from azureml.train.automl import AutoMLConfig\n",
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"from azureml.automl.core._vendor.automl.client.core.common import metrics\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"This notebook was created using version 1.0.85 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use False if you've already downloaded and split the data\n",
"DOWNLOAD_SPLIT_DATA = True\n",
"\n",
"# Data directory\n",
"DATA_DIR = os.path.join(git_repo_path(), \"ojdata\")\n",
"\n",
"# Forecasting settings\n",
"GAP = 2\n",
"LAST_WEEK = 138\n",
"\n",
"# Number of test periods\n",
"NUM_TEST_PERIODS = 3\n",
"\n",
"# Column names\n",
"time_column_name = \"week_start\"\n",
"target_column_name = \"move\"\n",
"grain_column_names = [\"store\", \"brand\"]\n",
"index_column_names = [time_column_name] + grain_column_names\n",
"\n",
"# Subset of stores used in the notebook\n",
"USE_STORES = [2, 5, 8]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up Azure Machine Learning Workspace\n",
"\n",
"An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inference, and the monitoring of deployed models. To create an Azure ML workspace, first you need access to an Azure subscription. An Azure subscription allows you to manage storage, compute, and other assets in the Azure cloud. You can [create a new subscription](https://azure.microsoft.com/en-us/free/) or access existing subscription information from the [Azure portal](https://portal.azure.com/). Given that you have access to your Azure subscription, you can further create an Azure ML workspace by following the instructions [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace). You can also do so [using Azure CLI](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli) or the `Workspace.create()` method in Azure SDK.\n",
"\n",
"In the following cell, please replace the value of each parameter with the value of the corresponding attribute of your workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subscription_id = \"<my-subscription-id>\"\n",
"resource_group = \"<my-resource-group>\"\n",
"workspace_name = \"<my-workspace-name>\"\n",
"workspace_region = \"eastus2\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Access Azure ML Workspace\n",
"\n",
"In what follows, we use Azure ML SDK to attempt to load the workspace specified by your parameters. The cell can fail if the specified workspace doesn't exist or you don't have permissions to access it. Hence, you may need to log into your Azure account and change the default subscription to the one which the workspace belongs to using Azure CLI `az account set --subscription <name or id>`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" ws = Workspace.create(subscription_id=subscription_id, resource_group=resource_group, \n",
" name=workspace_name, create_resource_group=True, exist_ok=True, \n",
" location=workspace_region)\n",
" # write the details of the workspace to a configuration file to the notebook library\n",
" ws.write_config()\n",
" print(\"Workspace configuration succeeded. Skip the workspace creation steps below\")\n",
"except ValueError:\n",
" raise Exception(\"Workspace not accessible. Change your parameters or create a new workspace below\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create compute resources for your experiments\n",
"\n",
"We run AutoML on a dynamically scalable compute cluster. To create a compute cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors. Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"workspace_compute = ws.compute_targets\n",
"if cpu_cluster_name in workspace_compute:\n",
" print(\"Found existing cpu-cluster\")\n",
" cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
"else: \n",
" print(\"Creating new cpu-cluster\")\n",
"\n",
" # Specify the configuration for the new cluster\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D2_V2\", min_nodes=4, max_nodes=4)\n",
"\n",
" # Create the cluster with the specified name and configuration\n",
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
" # Wait for the cluster to complete, show the output log\n",
" cpu_cluster.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define Experiment\n",
"\n",
"To run AutoML, you need to create an Experiment. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# choose a name for the run history container in the workspace\n",
"experiment_name = \"automl-ojforecasting\"\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output[\"SDK version\"] = azureml.core.VERSION\n",
"output[\"Workspace\"] = ws.name\n",
"output[\"SKU\"] = ws.sku\n",
"output[\"Resource Group\"] = ws.resource_group\n",
"output[\"Location\"] = ws.location\n",
"output[\"Run History Name\"] = experiment_name\n",
"pd.set_option(\"display.max_colwidth\", -1)\n",
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preparation\n",
"\n",
"We need to download the Orange Juice data and split it into training and test sets. By default, the following cell will download and spit the data. If you've already done so, you may skip this part by switching `DOWNLOAD_SPLIT_DATA` to `False`.\n",
"\n",
"We store the training data and test data using dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales up to week 135 (the time we make forecasts) and `aux_df` containing price/promotion information up until week 138. We assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product in week 137 and 138. Assuming the current week is week 135, our goal is to forecast the sales in week 137 and 138 using the training data. There is a one-week gap between the current week and the first target week of forecasting as we want to leave time for planning inventory in practice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data download and split"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if DOWNLOAD_SPLIT_DATA:\n",
" download_ojdata(DATA_DIR)\n",
" df = pd.read_csv(os.path.join(DATA_DIR, \"yx.csv\"))\n",
" df = df.loc[df.week <= LAST_WEEK]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert logarithm of the unit sales to unit sales\n",
"df[\"move\"] = df[\"logmove\"].apply(lambda x: round(math.exp(x)))\n",
"# Add timestamp column\n",
"df[\"week_start\"] = df[\"week\"].apply(lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7))\n",
"# Select a subset of stores for demo purpose\n",
"df_sub = df[df.store.isin(USE_STORES)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into training and test sets\n",
"def split_last_n_by_grain(df, n):\n",
" \"\"\"Group df by grain and split on last n rows for each group.\"\"\"\n",
" df_grouped = df.sort_values(time_column_name).groupby( # Sort by ascending time\n",
" grain_column_names, group_keys=False\n",
" )\n",
" df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])\n",
" df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])\n",
" return df_head, df_tail\n",
"\n",
"\n",
"train_df, test_df = split_last_n_by_grain(df_sub, NUM_TEST_PERIODS)\n",
"train_df.reset_index(drop=True)\n",
"test_df.reset_index(drop=True)\n",
"\n",
"# Save data locally\n",
"local_data_pathes = [\n",
" os.path.join(DATA_DIR, \"train.csv\"),\n",
" os.path.join(DATA_DIR, \"test.csv\"),\n",
"]\n",
"\n",
"train_df.to_csv(local_data_pathes[0], index=None, header=True)\n",
"test_df.to_csv(local_data_pathes[1], index=None, header=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to datastore\n",
"\n",
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore = ws.get_default_datastore()\n",
"datastore.upload_files(files=local_data_pathes, target_path=\"dataset/\", overwrite=True, show_progress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create dataset for training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path(\"dataset/train.csv\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dataset.to_pandas_dataframe().tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling\n",
"\n",
"For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:\n",
"* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span\n",
"* Impute missing values in the target (via forward-fill) and feature columns (using median column values)\n",
"* Create grain-based features to enable fixed effects across different series\n",
"* Create time-based features to assist in learning seasonal patterns\n",
"* Encode categorical variables to numeric quantities\n",
"\n",
"In this notebook, AutoML will train a single, regression-type model across all time-series in a given training set. This allows the model to generalize across related series. To create a training job, we use AutoML Config object to define the settings and data. Here is a summary of the meanings of the AutoMLConfig parameters:\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|forecasting|\n",
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
"|**experiment_timeout_hours**|Experimentation timeout in hours.|\n",
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
"|**debug_log**|Log file path for writing debugging information|\n",
"|**time_column_name**|Name of the datetime column in the input data|\n",
"|**grain_column_names**|Name(s) of the columns defining individual series in the input data|\n",
"|**drop_column_names**|Name(s) of columns to drop prior to modeling|\n",
"|**max_horizon**|Maximum desired forecast horizon in units of time-series frequency|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time_series_settings = {\n",
" \"time_column_name\": time_column_name,\n",
" \"grain_column_names\": grain_column_names,\n",
" \"drop_column_names\": [\"logmove\"], # 'logmove' is a leaky feature, so we remove it.\n",
" \"max_horizon\": NUM_TEST_PERIODS,\n",
"}\n",
"\n",
"automl_config = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" debug_log=\"automl_oj_sales_errors.log\",\n",
" primary_metric=\"normalized_mean_absolute_error\",\n",
" experiment_timeout_hours=0.6, # You may increase this number to improve model accuracy\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" compute_target=cpu_cluster,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" verbosity=logging.INFO,\n",
" **time_series_settings\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output=False)\n",
"remote_run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the best model\n",
"\n",
"Each run within an Experiment stores serialized (i.e. pickled) pipelines from the AutoML iterations. After the training job is done, we can retrieve the pipeline with the best performance on the validation dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = remote_run.get_output()\n",
"print(fitted_model.steps)\n",
"model_name = best_run.properties[\"model_name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Forecasting\n",
"\n",
"Now that we have retrieved the best model pipeline, we can apply it to generate forecasts for the target weeks. To do this, we first remove the target values from the test set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate forecasts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test = test_df\n",
"y_test = X_test.pop(target_column_name).values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The featurized data, aligned to y, will also be returned. It contains the assumptions\n",
"# that were made in the forecast and helps align the forecast to the original data.\n",
"y_predictions, X_trans = fitted_model.forecast(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to align the output explicitly to the input, as the count and order of the rows may have changed during transformations that span multiple rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_automl = align_outputs(y_predictions, X_trans, X_test, y_test, target_column_name)\n",
"pred_automl.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results evaluation & visualization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use automl metrics module\n",
"scores = metrics.compute_metrics_regression(\n",
" pred_automl[\"predicted\"],\n",
" pred_automl[target_column_name],\n",
" list(constants.Metric.SCALAR_REGRESSION_SET),\n",
" None,\n",
" None,\n",
" None,\n",
")\n",
"\n",
"print(\"[Test data scores]\\n\")\n",
"for key, value in scores.items():\n",
" print(\"{}: {:.3f}\".format(key, value))\n",
"\n",
"# Plot outputs\n",
"%matplotlib inline\n",
"test_pred = plt.scatter(pred_automl[target_column_name], pred_automl[\"predicted\"], color=\"b\")\n",
"test_test = plt.scatter(pred_automl[target_column_name], pred_automl[target_column_name], color=\"g\")\n",
"plt.legend((test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also compute MAPE of the forecasts in the last two weeks of the forecast period in order to be consistent with the evaluation period that is used in other quick start examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_automl_sub = pred_automl.loc[pred_automl.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_automl_sub = MAPE(pred_automl_sub[\"predicted\"], pred_automl_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by AutoML in the last two weeks: \" + str(mape_automl_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine AutoML Model with a Custom Model\n",
"\n",
"So far we have demonstrated how we can quickly build a forecasting model with AutoML in Azure. Next, we further show a simple way to achieve more robust and accurate forecasts by combining the forecasts from AutoML and a custom model that the user may have. Here we assume that the user have also constructed a series of linear regression models with each model forecasts the sales of a specfic store-brand using `scikit-learn` package."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple linear regression models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create price features\n",
"df_sub[\"price\"] = df_sub.apply(lambda x: x.loc[\"price\" + str(int(x.loc[\"brand\"]))], axis=1)\n",
"price_cols = [\n",
" \"price1\",\n",
" \"price2\",\n",
" \"price3\",\n",
" \"price4\",\n",
" \"price5\",\n",
" \"price6\",\n",
" \"price7\",\n",
" \"price8\",\n",
" \"price9\",\n",
" \"price10\",\n",
" \"price11\",\n",
"]\n",
"df_sub[\"avg_price\"] = df_sub[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))\n",
"df_sub[\"price_ratio\"] = df_sub.apply(lambda x: x[\"price\"] / x[\"avg_price\"], axis=1)\n",
"\n",
"# Create lag features on unit sales\n",
"df_sub[\"move_lag1\"] = df_sub[\"move\"].shift(1)\n",
"df_sub[\"move_lag2\"] = df_sub[\"move\"].shift(2)\n",
"\n",
"# Drop rows with NaN values\n",
"df_sub.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After splitting the data, we use `fit()` and `predit()` functions from `fclib.models.multiple_linear_regression` to train separate linear regression model for each invididual time series and generate forecasts for the sales during the test period."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into training and test sets\n",
"train_df, test_df = split_last_n_by_grain(df_sub, NUM_TEST_PERIODS)\n",
"train_df.reset_index(drop=True)\n",
"test_df.reset_index(drop=True)\n",
"\n",
"# Train multiple linear regression models\n",
"fea_column_names = [\"move_lag1\", \"move_lag2\", \"price\", \"price_ratio\"]\n",
"lr_models = fit(train_df, grain_column_names, fea_column_names, target_column_name)\n",
"\n",
"# Generate forecasts with the trained models\n",
"pred_all = predict(test_df, lr_models, time_column_name, grain_column_names, fea_column_names)\n",
"\n",
"pred_lr = pd.merge(pred_all, test_df, on=index_column_names)\n",
"pred_lr.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the accuracy of the predictions on the entire forecast period as well as in the last two weeks of the forecast period.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mape_lr_entire = MAPE(pred_lr[\"prediction\"], pred_lr[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by multiple linear regression on entire test period: \" + str(mape_lr_entire))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_lr_sub = pred_lr.loc[pred_lr.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_lr_sub = MAPE(pred_lr_sub[\"prediction\"], pred_lr_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by multiple linear regression in the last two weeks: \" + str(mape_lr_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combine forecasts from different methods\n",
"\n",
"We can combine the forecasts obtained by AutoML and multiple linear regression using weighted average and evaluate the final forecasts. Usually the combined forecasts will be more robust as a combination of two methods can reduce the chance of model overfitting. Here we use equal weights which can be further adjusted according to our confidence on each model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_final = pd.merge(\n",
" pred_automl[index_column_names + [\"predicted\", \"move\", \"week\"]],\n",
" pred_lr[index_column_names + [\"prediction\"]],\n",
" on=index_column_names,\n",
" how=\"left\",\n",
")\n",
"pred_final[\"combined_prediction\"] = pred_final[\"predicted\"] * 0.5 + pred_final[\"prediction\"] * 0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mape_entire = MAPE(pred_final[\"combined_prediction\"], pred_final[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by the combined model on entire test period: \" + str(mape_entire))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_final_sub = pred_final.loc[pred_final.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_final_sub = MAPE(pred_final_sub[\"combined_prediction\"], pred_final_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by the combined model in the last two weeks: \" + str(mape_final_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Reading\n",
"\n",
"\\[1\\] Nicolo Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems. 3348-3357.<br>\n",
"\\[2\\] Azure AutoML Package Docs: https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl?view=azure-ml-py <br>\n",
"\\[3\\] Azure Automated Machine Learning Examples: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning <br>\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "forecasting_env",
"language": "python",
"name": "forecasting_env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -9,8 +9,8 @@ The following summarizes each directory of the best practice notebooks.
| Directory | Content | Description |
| --- | --- | --- |
| [00_quick_start](./00_quick_start)| [auto_arima_forecasting.ipynb](./00_quick_start/auto_arima_forecasting.ipynb) <br>[azure_automl_forecast.ipynb](./00_quick_start/azure_automl_forecast.ipynb) <br> [lightgbm_point_forecast.ipynb](./00_quick_start/lightgbm_point_forecast.ipynb) | Quick start notebooks that demonstrate workflow of developing a forecasting model using one-round training and testing data|
| [01_prepare_data](./01_prepare_data) | [ojdata_exploration_retail.ipynb](./01_prepare_data/ojdata_exploration_retail.ipynb) <br> [ojdata_preparation_retail.ipynb](./01_prepare_data/ojdata_preparation_retail.ipynb) | Data exploration and preparation notebooks|
| [02_model](./02_model) | [dilatedcnn_point_forecast_multiround.ipynb](./02_model/dilatedcnn_point_forecast_multiround.ipynb) <br> [lightgbm_point_forecast_multiround.ipynb](./02_model/lightgbm_point_forecast_multiround.ipynb) | Deep dive notebooks that perform multi-round training and testing of various classical and deep learning forecast algorithms|
| [00_quick_start](./00_quick_start)| [autoarima_single_round.ipynb](./00_quick_start/autoarima_single_round.ipynb) <br>[azure_automl_single_round.ipynb](./00_quick_start/azure_automl_single_round.ipynb) <br> [lightgbm_single_round.ipynb](./00_quick_start/lightgbm_single_round.ipynb) | Quick start notebooks that demonstrate workflow of developing a forecasting model using one-round training and testing data|
| [01_prepare_data](./01_prepare_data) | [ojdata_exploration.ipynb](./01_prepare_data/ojdata_exploration.ipynb) <br> [ojdata_preparation.ipynb](./01_prepare_data/ojdata_preparation.ipynb) | Data exploration and preparation notebooks|
| [02_model](./02_model) | [dilatedcnn_multi_round.ipynb](./02_model/dilatedcnn_multi_round.ipynb) <br> [lightgbm_multi_round.ipynb](./02_model/lightgbm_multi_round.ipynb) | Deep dive notebooks that perform multi-round training and testing of various classical and deep learning forecast algorithms|
| [03_model_select_deploy](03_model_select_deploy) | Example notebook to be added soon | Best practice notebook for model selecting by using Azure Machine Learning Service and deploying the best model on Azure|