diff --git a/examples/01_prepare_data/data_prep_retail.ipynb b/examples/01_prepare_data/data_prep_retail.ipynb deleted file mode 100644 index e69de29b..00000000 diff --git a/examples/01_prepare_data/data_prepare_retail.ipynb b/examples/01_prepare_data/data_prepare_retail.ipynb new file mode 100644 index 00000000..dbc33a1a --- /dev/null +++ b/examples/01_prepare_data/data_prepare_retail.ipynb @@ -0,0 +1,1344 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copyright (c) Microsoft Corporation.\n", + "\n", + "Licensed under the MIT License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Preparation for Retail Sales Forecasting\n", + "\n", + "This notebook introduces how to split the Orange Juice dataset into training sets and test sets for training and evaluating different retail sales forecasting methods.\n", + "\n", + "We use backtesting a method that tests a predictive model on historical data to evaluate the forecasting methods. Other than standard [K-fold cross validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) which randomly splits data into K folds, we split the data so that any of the time stamps in the training set is no later than any of the time stamps in the test set to ensure that no future information is used (expect certain information that we can know beforehand, e.g., price of the product in the next few weeks as we can set the price manually).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Global Settings and Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "System version: 3.6.7 | packaged by conda-forge | (default, Nov 6 2019, 16:19:42) \n", + "[GCC 7.3.0]\n" + ] + } + ], + "source": [ + "import os\n", + "import sys\n", + "\n", + "import forecasting_lib.common.forecast_settings as fs\n", + "from forecasting_lib.common.utils import git_repo_path\n", + "from forecasting_lib.dataset.ojdata import download_ojdata, split_train_test\n", + "\n", + "print(\"System version: {}\".format(sys.version))" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Use False if you've already downloaded and split the data\n", + "DOWNLOAD_DATA = True\n", + "\n", + "# Data directory\n", + "DATA_DIR = os.path.join(git_repo_path(), \"ojdata\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Data\n", + "\n", + "We need to download the Orange Juice data before splitting it into training and test sets. By default, the following cell will download the data. If you've already done so, you may skip this part by switching `DOWNLOAD_DATA` to `False`.\n", + "\n", + "The dataset is from R package [bayesm](https://cran.r-project.org/web/packages/bayesm/index.html) and is part of the [Dominick's dataset](https://www.chicagobooth.edu/research/kilts/datasets/dominicks). It contains the following two csv files:\n", + "\n", + "1. `yx.csv` includes weekly sales of refrigerated orange juice at 83 stores. This files has 106139 rows and 19 columns. It contains weekly sales and prices of 11 orange juice brands as well as information about profit, deal, and advertisement for each brand. Note that the weekly sales is captured by a column named `logmove` which corresponds to the natural logarithm of the number of units sold. To get the number of units sold, you need to apply an exponential transform to this column.\n", + "\n", + "2. `storedemo.csv` includes demographic information on those stores. This table has 83 rows and 13 columns. For every store, the table describes demographic information of its consumers, distance to the nearest warehouse store, average distance to the nearest 5 supermarkets, ratio of its sales to the nearest warehouse store, and ratio of its sales\n", + "to the average of the nearest 5 stores.\n", + "\n", + "Note that the week number starts from 40 in this dataset, while the full Dominick's dataset has week number from 1 to 400. According to [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx), week 1 starts on 09/14/1989.\n", + "Please see pages 40 and 41 of the [bayesm reference manual](https://cran.r-project.org/web/packages/bayesm/bayesm.pdf) and the [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx) for more details about the data." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Starting data download ...\n", + "Data download completed. Data saved to /data/home/chenhui/work/forecasting/ojdata\n" + ] + } + ], + "source": [ + "if DOWNLOAD_DATA:\n", + " download_ojdata(DATA_DIR)\n", + " print(\"Data download completed. Data saved to \" + DATA_DIR)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Split Data for Single-Round Forecasting\n", + "\n", + "Next, we can use `split_train_test()` utility function to split the data in `yx.csv` into training and test sets. If we want to do a one-time model training and evaluation, we can split the data using the default settings provided in `forecasting_lib.common.forecast_settings`.\n", + "\n", + "The data split function will return training data and test data as dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales up to week 135 (the time we make forecasts) and `aux_df` containing price/promotion information up until week 138. Here we assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product in week 137 and 138. Assuming the current week is week 135, our goal is to forecast the sales in week 137 and 138 using the training data. There is a one-week gap between the current week and the first target week of forecasting as we want to leave time for planning inventory in practice." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "data_generator = split_train_test(DATA_DIR, fs)\n", + "[train_df, test_df, aux_df] = next(data_generator)\n", + "train_df.reset_index(inplace=True)\n", + "test_df.reset_index(inplace=True)\n", + "aux_df.reset_index(inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
storebrandweeklogmoveconstantprice1price2price3price4price5price6price7price8price9price10price11dealfeatprofit
021409.01869510.0604690.0604970.0420310.0295310.0495310.0530210.0389060.0414060.0289060.0248440.03898410.037.992326
121468.72323110.0604690.0603120.0451560.0467190.0495310.0478130.0457810.0279690.0429690.0420310.03898400.030.126667
221478.25322810.0604690.0603120.0451560.0467190.0373440.0530210.0457810.0414060.0481250.0326560.03898400.030.000000
321488.98719710.0604690.0603120.0498440.0373440.0495310.0530210.0457810.0414060.0423440.0326560.03898400.029.950000
421509.09335710.0604690.0603120.0435940.0310940.0495310.0530210.0466480.0414060.0423440.0326560.03820300.029.920000
............................................................
84178137111319.63115410.0279690.0519790.0490800.0398200.0310940.0483950.0375000.0389060.0232810.0221870.02570310.017.170000
84179137111329.70406110.0305040.0519790.0435940.0339270.0331670.0457290.0310940.0389060.0253130.0248440.02632811.018.630000
84180137111338.99516510.0430560.0519790.0455420.0310940.0372050.0465790.0334700.0379690.0201560.0256250.02960910.025.350000
84181137111348.91247310.0390620.0493010.0495880.0323000.0310940.0509370.0420310.0357810.0220310.0310940.02960910.025.320000
84182137111359.90188610.0404730.0457290.0469570.0452230.0334930.0509370.0339410.0357810.0264060.0229690.02335911.05.350000
\n", + "

84183 rows × 19 columns

\n", + "
" + ], + "text/plain": [ + " store brand week logmove constant price1 price2 price3 \\\n", + "0 2 1 40 9.018695 1 0.060469 0.060497 0.042031 \n", + "1 2 1 46 8.723231 1 0.060469 0.060312 0.045156 \n", + "2 2 1 47 8.253228 1 0.060469 0.060312 0.045156 \n", + "3 2 1 48 8.987197 1 0.060469 0.060312 0.049844 \n", + "4 2 1 50 9.093357 1 0.060469 0.060312 0.043594 \n", + "... ... ... ... ... ... ... ... ... \n", + "84178 137 11 131 9.631154 1 0.027969 0.051979 0.049080 \n", + "84179 137 11 132 9.704061 1 0.030504 0.051979 0.043594 \n", + "84180 137 11 133 8.995165 1 0.043056 0.051979 0.045542 \n", + "84181 137 11 134 8.912473 1 0.039062 0.049301 0.049588 \n", + "84182 137 11 135 9.901886 1 0.040473 0.045729 0.046957 \n", + "\n", + " price4 price5 price6 price7 price8 price9 price10 \\\n", + "0 0.029531 0.049531 0.053021 0.038906 0.041406 0.028906 0.024844 \n", + "1 0.046719 0.049531 0.047813 0.045781 0.027969 0.042969 0.042031 \n", + "2 0.046719 0.037344 0.053021 0.045781 0.041406 0.048125 0.032656 \n", + "3 0.037344 0.049531 0.053021 0.045781 0.041406 0.042344 0.032656 \n", + "4 0.031094 0.049531 0.053021 0.046648 0.041406 0.042344 0.032656 \n", + "... ... ... ... ... ... ... ... \n", + "84178 0.039820 0.031094 0.048395 0.037500 0.038906 0.023281 0.022187 \n", + "84179 0.033927 0.033167 0.045729 0.031094 0.038906 0.025313 0.024844 \n", + "84180 0.031094 0.037205 0.046579 0.033470 0.037969 0.020156 0.025625 \n", + "84181 0.032300 0.031094 0.050937 0.042031 0.035781 0.022031 0.031094 \n", + "84182 0.045223 0.033493 0.050937 0.033941 0.035781 0.026406 0.022969 \n", + "\n", + " price11 deal feat profit \n", + "0 0.038984 1 0.0 37.992326 \n", + "1 0.038984 0 0.0 30.126667 \n", + "2 0.038984 0 0.0 30.000000 \n", + "3 0.038984 0 0.0 29.950000 \n", + "4 0.038203 0 0.0 29.920000 \n", + "... ... ... ... ... \n", + "84178 0.025703 1 0.0 17.170000 \n", + "84179 0.026328 1 1.0 18.630000 \n", + "84180 0.029609 1 0.0 25.350000 \n", + "84181 0.029609 1 0.0 25.320000 \n", + "84182 0.023359 1 1.0 5.350000 \n", + "\n", + "[84183 rows x 19 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
storebrandweeklogmoveconstantprice1price2price3price4price5price6price7price8price9price10price11dealfeatprofit
0211379.18932110.0416450.0519790.0476560.0388010.0326560.0381250.0328610.0360940.0373440.0221870.03242200.020.425098
1211389.73861310.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242211.011.290000
2221378.73873510.0416450.0519790.0476560.0388010.0326560.0381250.0328610.0360940.0373440.0221870.03242200.033.300308
3221389.60130110.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242211.09.430000
4231377.56008010.0416450.0519790.0476560.0388010.0326560.0381250.0328610.0360940.0373440.0221870.03242200.030.506667
............................................................
182113791385.95064310.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242200.029.490000
18221371013710.60618910.0427850.0519790.0476560.0406210.0326560.0381250.0333530.0368750.0373440.0210940.03210910.05.110000
1823137101388.88627110.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242200.034.120000
1824137111378.91247310.0427850.0519790.0476560.0406210.0326560.0381250.0333530.0368750.0373440.0210940.03210900.031.720000
1825137111388.72323110.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242200.033.590000
\n", + "

1826 rows × 19 columns

\n", + "
" + ], + "text/plain": [ + " store brand week logmove constant price1 price2 price3 \\\n", + "0 2 1 137 9.189321 1 0.041645 0.051979 0.047656 \n", + "1 2 1 138 9.738613 1 0.037344 0.038958 0.047656 \n", + "2 2 2 137 8.738735 1 0.041645 0.051979 0.047656 \n", + "3 2 2 138 9.601301 1 0.037344 0.038958 0.047656 \n", + "4 2 3 137 7.560080 1 0.041645 0.051979 0.047656 \n", + "... ... ... ... ... ... ... ... ... \n", + "1821 137 9 138 5.950643 1 0.037344 0.038958 0.047656 \n", + "1822 137 10 137 10.606189 1 0.042785 0.051979 0.047656 \n", + "1823 137 10 138 8.886271 1 0.037344 0.038958 0.047656 \n", + "1824 137 11 137 8.912473 1 0.042785 0.051979 0.047656 \n", + "1825 137 11 138 8.723231 1 0.037344 0.038958 0.047656 \n", + "\n", + " price4 price5 price6 price7 price8 price9 price10 \\\n", + "0 0.038801 0.032656 0.038125 0.032861 0.036094 0.037344 0.022187 \n", + "1 0.035781 0.043594 0.050937 0.042031 0.038906 0.037344 0.031094 \n", + "2 0.038801 0.032656 0.038125 0.032861 0.036094 0.037344 0.022187 \n", + "3 0.035781 0.043594 0.050937 0.042031 0.038906 0.037344 0.031094 \n", + "4 0.038801 0.032656 0.038125 0.032861 0.036094 0.037344 0.022187 \n", + "... ... ... ... ... ... ... ... \n", + "1821 0.035781 0.043594 0.050937 0.042031 0.038906 0.037344 0.031094 \n", + "1822 0.040621 0.032656 0.038125 0.033353 0.036875 0.037344 0.021094 \n", + "1823 0.035781 0.043594 0.050937 0.042031 0.038906 0.037344 0.031094 \n", + "1824 0.040621 0.032656 0.038125 0.033353 0.036875 0.037344 0.021094 \n", + "1825 0.035781 0.043594 0.050937 0.042031 0.038906 0.037344 0.031094 \n", + "\n", + " price11 deal feat profit \n", + "0 0.032422 0 0.0 20.425098 \n", + "1 0.032422 1 1.0 11.290000 \n", + "2 0.032422 0 0.0 33.300308 \n", + "3 0.032422 1 1.0 9.430000 \n", + "4 0.032422 0 0.0 30.506667 \n", + "... ... ... ... ... \n", + "1821 0.032422 0 0.0 29.490000 \n", + "1822 0.032109 1 0.0 5.110000 \n", + "1823 0.032422 0 0.0 34.120000 \n", + "1824 0.032109 0 0.0 31.720000 \n", + "1825 0.032422 0 0.0 33.590000 \n", + "\n", + "[1826 rows x 19 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_df" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
storebrandweekprice1price2price3price4price5price6price7price8price9price10price11dealfeat
021400.0604690.0604970.0420310.0295310.0495310.0530210.0389060.0414060.0289060.0248440.03898410.0
121460.0604690.0603120.0451560.0467190.0495310.0478130.0457810.0279690.0429690.0420310.03898400.0
221470.0604690.0603120.0451560.0467190.0373440.0530210.0457810.0414060.0481250.0326560.03898400.0
321480.0604690.0603120.0498440.0373440.0495310.0530210.0457810.0414060.0423440.0326560.03898400.0
421500.0604690.0603120.0435940.0310940.0495310.0530210.0466480.0414060.0423440.0326560.03820300.0
...................................................
86906137111340.0390620.0493010.0495880.0323000.0310940.0509370.0420310.0357810.0220310.0310940.02960910.0
86907137111350.0404730.0457290.0469570.0452230.0334930.0509370.0339410.0357810.0264060.0229690.02335911.0
86908137111360.0498440.0474120.0476560.0465540.0435940.0509370.0310940.0357810.0268750.0201560.03242200.0
86909137111370.0427850.0519790.0476560.0406210.0326560.0381250.0333530.0368750.0373440.0210940.03210900.0
86910137111380.0373440.0389580.0476560.0357810.0435940.0509370.0420310.0389060.0373440.0310940.03242200.0
\n", + "

86911 rows × 16 columns

\n", + "
" + ], + "text/plain": [ + " store brand week price1 price2 price3 price4 price5 \\\n", + "0 2 1 40 0.060469 0.060497 0.042031 0.029531 0.049531 \n", + "1 2 1 46 0.060469 0.060312 0.045156 0.046719 0.049531 \n", + "2 2 1 47 0.060469 0.060312 0.045156 0.046719 0.037344 \n", + "3 2 1 48 0.060469 0.060312 0.049844 0.037344 0.049531 \n", + "4 2 1 50 0.060469 0.060312 0.043594 0.031094 0.049531 \n", + "... ... ... ... ... ... ... ... ... \n", + "86906 137 11 134 0.039062 0.049301 0.049588 0.032300 0.031094 \n", + "86907 137 11 135 0.040473 0.045729 0.046957 0.045223 0.033493 \n", + "86908 137 11 136 0.049844 0.047412 0.047656 0.046554 0.043594 \n", + "86909 137 11 137 0.042785 0.051979 0.047656 0.040621 0.032656 \n", + "86910 137 11 138 0.037344 0.038958 0.047656 0.035781 0.043594 \n", + "\n", + " price6 price7 price8 price9 price10 price11 deal feat \n", + "0 0.053021 0.038906 0.041406 0.028906 0.024844 0.038984 1 0.0 \n", + "1 0.047813 0.045781 0.027969 0.042969 0.042031 0.038984 0 0.0 \n", + "2 0.053021 0.045781 0.041406 0.048125 0.032656 0.038984 0 0.0 \n", + "3 0.053021 0.045781 0.041406 0.042344 0.032656 0.038984 0 0.0 \n", + "4 0.053021 0.046648 0.041406 0.042344 0.032656 0.038203 0 0.0 \n", + "... ... ... ... ... ... ... ... ... \n", + "86906 0.050937 0.042031 0.035781 0.022031 0.031094 0.029609 1 0.0 \n", + "86907 0.050937 0.033941 0.035781 0.026406 0.022969 0.023359 1 1.0 \n", + "86908 0.050937 0.031094 0.035781 0.026875 0.020156 0.032422 0 0.0 \n", + "86909 0.038125 0.033353 0.036875 0.037344 0.021094 0.032109 0 0.0 \n", + "86910 0.050937 0.042031 0.038906 0.037344 0.031094 0.032422 0 0.0 \n", + "\n", + "[86911 rows x 16 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "aux_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Split Data for Multi-Round Forecasting\n", + "\n", + "We can also create training data and test data for multi-round forecasting. In this case, we gradually increase the length of the training data at each round. This allows us to retrain the forecasting model for achieving more accurate forecasts. Using the default settings in `forecasting_lib/common/forecast_settings.py` and update `NUM_ROUNDS` to 12, we can generate the training and test sets as follows\n", + "\n", + "\n", + "| **Round** | **Train period
start week** | **Train period
end week** | **Test period
start week** | **Test period
end week** |\n", + "| -------- | --------------- | ------------------ | ------------------------- | ----------------------- |\n", + "| 1 | 40 | 135 | 137 | 138 |\n", + "| 2 | 40 | 137 | 139 | 140 |\n", + "| 3 | 40 | 139 | 141 | 142 |\n", + "| 4 | 40 | 141 | 143 | 144 |\n", + "| 5 | 40 | 143 | 145 | 146 |\n", + "| 6 | 40 | 145 | 147 | 148 |\n", + "| 7 | 40 | 147 | 149 | 150 |\n", + "| 8 | 40 | 149 | 151 | 152 |\n", + "| 9 | 40 | 151 | 153 | 154 |\n", + "| 10 | 40 | 153 | 155 | 156 |\n", + "| 11 | 40 | 155 | 157 | 158 |\n", + "| 12 | 40 | 157 | 159 | 160 |\n", + "\n", + "The gap of one week between training period and test period allows store managers to prepare the stock based on the forecasted demand. Besides, we assume that the information about the price, deal, and advertisement up until the forecast period end week is available at each round." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training data size: (84183, 19)\n", + "Testing data size: (1826, 19)\n", + "Auxiliary data size: (86911, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 135\n", + "Minimum testing week number: 137\n", + "Maximum testing week number: 138\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 138\n", + "\n", + "Training data size: (85998, 19)\n", + "Testing data size: (1793, 19)\n", + "Auxiliary data size: (88704, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 137\n", + "Minimum testing week number: 139\n", + "Maximum testing week number: 140\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 140\n", + "\n", + "Training data size: (87802, 19)\n", + "Testing data size: (1771, 19)\n", + "Auxiliary data size: (90475, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 139\n", + "Minimum testing week number: 141\n", + "Maximum testing week number: 142\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 142\n", + "\n", + "Training data size: (89617, 19)\n", + "Testing data size: (1749, 19)\n", + "Auxiliary data size: (92224, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 141\n", + "Minimum testing week number: 143\n", + "Maximum testing week number: 144\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 144\n", + "\n", + "Training data size: (91333, 19)\n", + "Testing data size: (1727, 19)\n", + "Auxiliary data size: (93951, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 143\n", + "Minimum testing week number: 145\n", + "Maximum testing week number: 146\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 146\n", + "\n", + "Training data size: (93071, 19)\n", + "Testing data size: (1749, 19)\n", + "Auxiliary data size: (95700, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 145\n", + "Minimum testing week number: 147\n", + "Maximum testing week number: 148\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 148\n", + "\n", + "Training data size: (94842, 19)\n", + "Testing data size: (1771, 19)\n", + "Auxiliary data size: (97471, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 147\n", + "Minimum testing week number: 149\n", + "Maximum testing week number: 150\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 150\n", + "\n", + "Training data size: (96591, 19)\n", + "Testing data size: (1738, 19)\n", + "Auxiliary data size: (99209, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 149\n", + "Minimum testing week number: 151\n", + "Maximum testing week number: 152\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 152\n", + "\n", + "Training data size: (98340, 19)\n", + "Testing data size: (1705, 19)\n", + "Auxiliary data size: (100914, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 151\n", + "Minimum testing week number: 153\n", + "Maximum testing week number: 154\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 154\n", + "\n", + "Training data size: (100056, 19)\n", + "Testing data size: (1705, 19)\n", + "Auxiliary data size: (102619, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 153\n", + "Minimum testing week number: 155\n", + "Maximum testing week number: 156\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 156\n", + "\n", + "Training data size: (101772, 19)\n", + "Testing data size: (1749, 19)\n", + "Auxiliary data size: (104368, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 155\n", + "Minimum testing week number: 157\n", + "Maximum testing week number: 158\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 158\n", + "\n", + "Training data size: (103488, 19)\n", + "Testing data size: (1771, 19)\n", + "Auxiliary data size: (106139, 16)\n", + "Minimum training week number: 40\n", + "Maximum training week number: 157\n", + "Minimum testing week number: 159\n", + "Maximum testing week number: 160\n", + "Minimum auxiliary week number: 40\n", + "Maximum auxiliary week number: 160\n", + "\n" + ] + } + ], + "source": [ + "fs.NUM_ROUNDS = 12\n", + "for train_df, test_df, aux_df in split_train_test(DATA_DIR, fs):\n", + " train_df.reset_index(inplace=True)\n", + " test_df.reset_index(inplace=True)\n", + " aux_df.reset_index(inplace=True)\n", + " print(\"Training data size: {}\".format(train_df.shape))\n", + " print(\"Testing data size: {}\".format(test_df.shape))\n", + " print(\"Auxiliary data size: {}\".format(aux_df.shape))\n", + " print(\"Minimum training week number: {}\".format(min(train_df[\"week\"])))\n", + " print(\"Maximum training week number: {}\".format(max(train_df[\"week\"])))\n", + " print(\"Minimum testing week number: {}\".format(min(test_df[\"week\"])))\n", + " print(\"Maximum testing week number: {}\".format(max(test_df[\"week\"])))\n", + " print(\"Minimum auxiliary week number: {}\".format(min(aux_df[\"week\"])))\n", + " print(\"Maximum auxiliary week number: {}\".format(max(aux_df[\"week\"])))\n", + " print(\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional Reading\n", + "\n", + "\\[1\\] Christoph GBergmeir, Rob J. Hyndman, and Bonsoo Koo. 2018. A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction. Computational Statistics & Data Analysis. 120, pp. 70-83.
\n", + "\\[2\\] How To Backtest Machine Learning Models for Time Series Forecasting: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/Parameters.rst
\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.6.7 64-bit ('forecast': conda)", + "language": "python", + "name": "python36764bitforecastconda6547c53842644e9994cb8960c1d7107f" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7-final" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}