Merge pull request #69 from microsoft/vapaunic/datasplit

Visuals for time series data splitting
2020-02-21 16:57:12 +00:00 · 2020-02-21 16:57:12 +00:00 · a0a6f6b107
--- a/assets/time_series_split_multiround.jpg
+++ b/assets/time_series_split_multiround.jpg
--- a/assets/time_series_split_singleround.jpg
+++ b/assets/time_series_split_singleround.jpg
--- a/contrib/tsperf/OrangeJuice_Pt_3Weeks_Weekly/data/data_explore_retail_r.ipynb
+++ b/contrib/tsperf/OrangeJuice_Pt_3Weeks_Weekly/data/data_explore_retail_r.ipynb
--- a/examples/01_prepare_data/data_explore_retail_python.ipynb
+++ b/examples/01_prepare_data/data_explore_retail_python.ipynb
--- a/examples/01_prepare_data/data_prepare_retail.ipynb
+++ b/examples/01_prepare_data/data_prepare_retail.ipynb
@ -40,7 +40,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
@ -64,7 +64,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
@ -96,7 +96,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@ -126,8 +126,8 @@
   "metadata": {},
   "source": [
    "Here we define parameters for data preparation and forecast settings. In particular:\n",
-    "- *N_SPLITS* defines the number of training/testing splits we want to divide our data into\n",
-    "- *HORIZON* or forecasting horizon determines the number of weeks to forecast in the future\n",
+    "- *N_SPLITS* defines the number of training/testing splits we want to split our data into\n",
+    "- *HORIZON* or forecasting horizon determines the number of weeks to forecast in the future, or the test period length\n",
    "- *GAP* defines the gap (in weeks) between the training and test data. This is to allow business managers to plan for the forecasted demand.\n",
    "- *FIRST_WEEK* is the first available week in the data\n",
    "- *LAST_WEEK* is the last available week in the data"
@ -135,7 +135,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
@ -151,23 +151,45 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Split Data for Single-Round Forecasting\n",
+    "## Splitting Time Series Data\n",
+    "\n",
+    "In all our examples we use `split_train_test()` utility function to split the Orange Juice dataset. Splitting time series data into training and testing has to follow the rule that, in each split, test indices need to be higher than the train indices, and higher than the test indices in the previous split.\n",
+    "\n",
+    "We wrote the `split_train_test()` function to do just that. Given the parameters listed above, it creates `N_SPLITS` number of training/testing data splits, so that each test split is `HORIZON` weeks long, and the testing period is `GAP` number of weeks away from the training period. The first available week in the data (or the first week we want to start modeling from) is given as `FIRST_WEEK`, and the last week we want to work with is `LAST_WEEK`.\n",
+    "\n",
+    "For demonstration, this is what the time series split on the Orange Juice dataset looks like, for the parameters listed above.\n",
+    "For `HORIZON = 2` and `GAP = 2`, assuming the current week is week `153`, our goal is to forecast the sales in week `155` and `156` using the training data. As you can see, the first forecasting week is `two` weeks away from the current week, as we want to leave time for planning inventory in practice.\n",
+    "\n",
+    "![Single split](../../assets/time_series_split_singleround.jpg)\n",
+    "\n",
+    "We also refer to splits as rounds, so for `N_SPLITS = 1`, we have single-round forecasting, and for `N_SPLITS > 1`, we have multi-round forecasting."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Single-Round Forecasting\n",
    "\n",
    "Next, we can use `split_train_test()` utility function to split the data in `yx.csv` into training and test sets based on the parameters described above. If we want to do a one-time model training and evaluation, we can split the data using the default settings provided in the function.\n",
    "\n",
-    "The data split function will return training data and test data as dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales and `aux_df` containing price/promotion information. Here we assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product.\n",
-    "\n",
-    "Based on the above parameters, `HORIZON = 2` and `GAP = 2`, assuming the current week is week 135, our goal is to forecast the sales in week 137 and 138 using the training data. As you can see, the first forecasting week is 2 weeks away from the current week, as we want to leave time for planning inventory in practice."
+    "The data split function will return training data and test data as dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales and `aux_df` containing price/promotion information. Here we assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product.\n"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df_list, test_df_list, aux_df_list = split_train_test(\n",
-    "    DATA_DIR, n_splits=N_SPLITS, horizon=HORIZON, gap=GAP, first_week=FIRST_WEEK, last_week=LAST_WEEK, write_csv=True,\n",
+    "    data_dir=DATA_DIR,\n",
+    "    n_splits=N_SPLITS,\n",
+    "    horizon=HORIZON,\n",
+    "    gap=GAP,\n",
+    "    first_week=FIRST_WEEK,\n",
+    "    last_week=LAST_WEEK,\n",
+    "    write_csv=True,\n",
    ")\n",
    "\n",
    "# Split returns a list, extract the dataframes from the list\n",
@ -178,7 +200,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
@ -486,7 +508,7 @@
       "9  0.038984     1   0.0  27.061163  "
      ]
     },
-     "execution_count": 24,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -497,7 +519,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@ -805,7 +827,7 @@
       "9  0.028047     1  0.00000  24.212895  "
      ]
     },
-     "execution_count": 25,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -816,7 +838,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@ -1079,7 +1101,7 @@
       "9  0.053021  0.031094  0.041406  0.042344  0.042031  0.038984     1   0.0  "
      ]
     },
-     "execution_count": 26,
+     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1092,39 +1114,71 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Split Data for Multi-Round Forecasting\n",
+    "## Multi-Round Forecasting\n",
    "\n",
-    "We can also create training data and test data for multi-round forecasting. In this case, we gradually increase the length of the training data at each round. Note that the forecasting period we generate in each round are non-overlapping. This allows us to retrain the forecasting model for achieving more accurate forecasts. Using the settings defined in the **Parameters** section above, and updating `N_SPLITS` to `10`, we can generate the training and test sets as follows:\n",
+    "To create training data and test data for multi-round forecasting, we use the same function passing a number greater than `1` to `n_splits` parameter. Note that the forecasting period we generate in each test round are **non-overlapping**. This allows us to evaluate the forecasting model on multiple rounds of data, and get a more robust estimate of our model's performance.\n",
    "\n",
+    "For demonstration, this is what the time series splits would look like for `N_SPLITS = 5`, and using other settings as above:\n",
    "\n",
-    "| **Round** | **Train period <br> start week** | **Train period <br> end week** | **Test period <br> start week** | **Test period <br> end week** |\n",
-    "| -------- | --------------- | ------------------ | ------------------------- | ----------------------- |\n",
-    "| 1 | 40 | 135 | 137 | 138 |\n",
-    "| 2 | 40 | 137 | 139 | 140 |\n",
-    "| 3 | 40 | 139 | 141 | 142 |\n",
-    "| 4 | 40 | 141 | 143 | 144 |\n",
-    "| 5 | 40 | 143 | 145 | 146 |\n",
-    "| 6 | 40 | 145 | 147 | 148 |\n",
-    "| 7 | 40 | 147 | 149 | 150 |\n",
-    "| 8 | 40 | 149 | 151 | 152 |\n",
-    "| 9 | 40 | 151 | 153 | 154 |\n",
-    "| 10 | 40 | 153 | 155 | 156 |\n",
+    "![Multi split](../../assets/time_series_split_multiround.jpg)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's now generate `10` rounds of training/testing data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "## Generate 10 splits (rounds) of data\n",
+    "N_SPLITS = 10\n",
+    "\n",
+    "train_df_list, test_df_list, aux_df_list = split_train_test(\n",
+    "    DATA_DIR, n_splits=N_SPLITS, horizon=HORIZON, gap=GAP, first_week=FIRST_WEEK, last_week=LAST_WEEK, write_csv=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above cell wil generate the following splits.\n",
+    "\n",
+    "| **Round** | **Train period in weeks** | **Test period in weeks** |\n",
+    "| -------- | -------------------------- | ----------------------- |\n",
+    "| 1 | 40 - 135 | 137 - 138 |\n",
+    "| 2 | 40 - 137 | 139 - 140 |\n",
+    "| 3 | 40 - 139 | 141 - 142 |\n",
+    "| 4 | 40 - 141 | 143 - 144 |\n",
+    "| 5 | 40 - 143 | 145 - 146 |\n",
+    "| 6 | 40 - 145 | 147 - 148 |\n",
+    "| 7 | 40 - 147 | 149 - 150 |\n",
+    "| 8 | 40 - 149 | 151 - 152 |\n",
+    "| 9 | 40 - 151 | 153 - 154 |\n",
+    "| 10 | 40 - 153 | 155 - 156 |\n",
    "\n",
    "The gap of one week between training period and test period allows store managers to prepare the stock based on the forecasted demand. Besides, we assume that the information about the price, deal, and advertisement up until the forecast period end week is available at each round."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 38,
-   "metadata": {
-    "lines_to_next_cell": 2
-   },
+   "execution_count": 11,
+   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROUND 1\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (84183, 18)\n",
      "    Week range: 40-135\n",
@ -1136,6 +1190,7 @@
      "    Week range: 40-138\n",
      "\n",
      "ROUND 2\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (85998, 18)\n",
      "    Week range: 40-137\n",
@ -1147,6 +1202,7 @@
      "    Week range: 40-140\n",
      "\n",
      "ROUND 3\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (87802, 18)\n",
      "    Week range: 40-139\n",
@ -1158,6 +1214,7 @@
      "    Week range: 40-142\n",
      "\n",
      "ROUND 4\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (89617, 18)\n",
      "    Week range: 40-141\n",
@ -1169,6 +1226,7 @@
      "    Week range: 40-144\n",
      "\n",
      "ROUND 5\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (91333, 18)\n",
      "    Week range: 40-143\n",
@ -1180,6 +1238,7 @@
      "    Week range: 40-146\n",
      "\n",
      "ROUND 6\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (93071, 18)\n",
      "    Week range: 40-145\n",
@ -1191,6 +1250,7 @@
      "    Week range: 40-148\n",
      "\n",
      "ROUND 7\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (94842, 18)\n",
      "    Week range: 40-147\n",
@ -1202,6 +1262,7 @@
      "    Week range: 40-150\n",
      "\n",
      "ROUND 8\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (96591, 18)\n",
      "    Week range: 40-149\n",
@ -1213,6 +1274,7 @@
      "    Week range: 40-152\n",
      "\n",
      "ROUND 9\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (98340, 18)\n",
      "    Week range: 40-151\n",
@ -1224,6 +1286,7 @@
      "    Week range: 40-154\n",
      "\n",
      "ROUND 10\n",
+      "--------\n",
      "Training\n",
      "    Data shape: (100056, 18)\n",
      "    Week range: 40-153\n",
@ -1238,18 +1301,12 @@
    }
   ],
   "source": [
-    "## Generate 10 splits (rounds) of data\n",
-    "N_SPLITS = 10\n",
-    "\n",
-    "train_df_list, test_df_list, aux_df_list = split_train_test(\n",
-    "    DATA_DIR, n_splits=N_SPLITS, horizon=HORIZON, gap=GAP, first_week=FIRST_WEEK, last_week=LAST_WEEK, write_csv=True,\n",
-    ")\n",
-    "\n",
    "for i in range(len(train_df_list)):\n",
    "    train_df = train_df_list[i]\n",
    "    test_df = test_df_list[i]\n",
    "    aux_df = aux_df_list[i]\n",
    "    print(f\"ROUND {i+1}\")\n",
+    "    print(\"--------\")\n",
    "    print(\"Training\")\n",
    "    print(f\"    Data shape: {train_df.shape}\")\n",
    "    print(f\"    Week range: {min(train_df['week'])}-{max(train_df['week'])}\")\n",
@ -1271,13 +1328,6 @@
    "\\[2\\] How To Backtest Machine Learning Models for Time Series Forecasting: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/Parameters.rst <br>\n",
    "\n"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
--- a/tools/environment.yml
+++ b/tools/environment.yml
@ -39,3 +39,4 @@ dependencies:
    - nteract-scrapbook==0.3.1
    - gitpython==3.0.8
    - azureml-sdk[explain,automl]==1.0.85
+    - statsmodels==0.11.1