Revert "Chenhui/add ci tests (#150)" (#151)

This reverts commit 21846168a7.
This commit is contained in:
Chenhui Hu 2020-03-23 13:23:32 -04:00 коммит произвёл GitHub
Родитель 21846168a7
Коммит 89e986fe2c
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
73 изменённых файлов: 1443 добавлений и 6719 удалений

21
.gitignore поставляемый
Просмотреть файл

@ -2,27 +2,6 @@
**/.ipynb_checkpoints
*.egg-info/
.vscode/
*.pkl
*.h5
# Data
ojdata/*
*.Rdata
# AML Config
aml_config/
.azureml/
.config/
# Pytests
.pytest_cache/
# File for model deployment
score.py
# Environments
myenv.yml
# Logs
logs/
*.log

18
.lintr
Просмотреть файл

@ -1,18 +0,0 @@
linters: with_defaults(
infix_spaces_linter = NULL,
spaces_left_parentheses_linter = NULL,
open_curly_linter = NULL,
line_length_linter = NULL,
camel_case_linter = NULL,
object_name_linter = NULL,
object_usage_linter = NULL,
object_length_linter = NULL,
trailing_blank_lines_linter = NULL,
absolute_paths_linter = NULL,
commented_code_linter = NULL,
implicit_integer_linter = NULL,
extraction_operator_linter = NULL,
single_quotes_linter = NULL,
pipe_continuation_linter = NULL,
cyclocomp_linter = NULL
)

Просмотреть файл

@ -10,3 +10,4 @@ NumSpacesForTab: 4
Encoding: UTF-8
RnwWeave: knitr

Просмотреть файл

@ -0,0 +1,83 @@
---
title: Data preparation
output: html_notebook
---
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
In this notebook, we generate the datasets that will be used for model training and validating. The experiment parameters are obtained from the file `ojdata_forecast_settings.json`; you can modify that file to vary the experimental setup, or just edit the values in this notebook.
The orange juice dataset comes from the bayesm package, and gives pricing and sales figures over time for a variety of orange juice brands in several stores in Florida.
A complicating factor is that the data is in a hybrid of long and wide format: while the sales figures are long (one column of sales data for every store and brand), the prices are wide (one price column for each brand). Therefore we need to reshape the data if we want to use prices for modelling. As part of this, we also compute a new column `maxpricediff`: this represents the log-ratio of the price of this brand compared to the best competing price. A positive `maxpricediff` means this brand is cheaper than all the other brands, and a negative `maxpricediff` means it is more expensive.
```{r}
settings <- jsonlite::fromJSON("ojdata_forecast_settings.json")
train_periods <- seq(settings$TRAIN_WINDOW, 160 - settings$STEP - 1, settings$STEP)
start_date <- as.Date(settings$START_DATE)
data(orangeJuice, package="bayesm")
oj_data <- orangeJuice$yx %>%
complete(store, brand, week) %>%
group_by(store, brand) %>%
group_modify(~ {
pricevars <- grep("price", names(.x), value=TRUE)
thispricevar <- paste0("price", .y$brand)
best_other_price <- do.call(pmin, .x[setdiff(pricevars, thispricevar)])
.x$price <- .x[[thispricevar]]
.x$maxpricediff <- log(best_other_price/.x$price)
select(.x, week, logmove, deal, feat, price, maxpricediff)
}) %>%
ungroup() %>%
mutate(week=yearweek(start_date + week*7)) %>% # do this separately because of tsibble/vctrs issues
as_tsibble(index=week, key=c(store, brand))
```
Here are some glimpses of what the data looks like. The dependent variable is `logmove`, the logarithm of the total sales for a given brand and store, in a particular week. Note that we do _not_ fill in the missing values in the data, as (with the exception of `ETS`) the modelling functions in the fable package can handle this innately.
```{r}
head(oj_data)
```
The time series plots for a small subset of brands and stores are shown below. It is clear that the statistical behaviour of the data varies by store and brand.
```{r}
library(ggplot2)
oj_data %>%
filter(store < 10, brand < 5) %>%
ggplot(aes(x=week, y=logmove)) +
geom_line() +
scale_x_date(labels=NULL) +
facet_grid(vars(store), vars(brand), labeller="label_both")
```
Finally, we split the dataset into separate samples for training and testing. The schema used is broadly time series cross-validation, whereby we train a model on data up to time $t$, test it on data for times $t+1$ to $t+k$, then train on data up to time $t+k$, test it on data for times $t+k+1$ to $t+2k$, and so on.
In this specific case study we introduce a small extra piece of complexity. We train a model on data up to month $t$, then test it on months $t+2$ to $t+3$. Then we train on data up to month $t+2$, and test it on months $t+4$ to $t+5$, and so on. Thus there is always a gap of one month between the training and test samples, a complicating factor introduced after discussions with domain experts.
```{r}
subset_oj_data <- function(start, end)
{
start <- yearweek(start_date + start*7)
end <- yearweek(start_date + end*7)
filter(oj_data, week >= start, week <= end)
}
oj_train <- lapply(train_periods, function(i) subset_oj_data(40, i))
oj_test <- lapply(train_periods, function(i) subset_oj_data(i + 2, i + settings$STEP + 1))
save(oj_train, oj_test, file="oj_data.Rdata")
head(oj_train[[1]])
head(oj_test[[1]])
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,74 @@
---
title: Simple models
output: html_notebook
encoding: utf8
---
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
We fit some simple models to the orange juice data. One model is fit for each combination of store and brand.
- `mean`: This is just a simple mean.
- `naive`: A random walk model without any other components. This amounts to setting all forecast values to the last observed value.
- `drift`: This adjusts the `naive` model to incorporate a trend.
- `arima`: An ARIMA model with the parameter values estimated from the data.
- `ets`: An exponentially weighted model, again with parameter values estimated from the data.
Note that the model training process is embarrassingly parallel on 3 levels:
- We have multiple independent training datasets;
- For which we fit multiple independent models;
- Within which we have independent sub-models for each store and brand.
This lets us speed up the training significantly. While the `fable::model` function can fit multiple models in parallel, we will run it sequentially here and instead parallelise by dataset. This avoids contention for cores, and also results in the simplest code.
```{r, results="hide"}
load("oj_data.Rdata")
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
cl <- parallel::makeCluster(ncores)
parallel::clusterEvalQ(cl,
{
library(tidyr)
library(feasts)
library(fable)
library(tsibble)
})
```
First, we fit the models that can innately handle missing values.
```{r}
oj_modelset <- parallel::parLapply(cl, oj_train, function(df)
{
model(df,
mean=MEAN(logmove),
naive=NAIVE(logmove),
drift=RW(logmove ~ drift()),
arima=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0))
)
})
```
Next, we fit models that require manual imputation (ETS).
```{r}
oj_modelset_ets <- parallel::parLapply(cl, oj_train, function(df)
{
df %>%
fill(everything()) %>%
model(ets=ETS(logmove ~ error("A") + trend("A") + season("N")))
})
parallel::stopCluster(cl)
save(oj_modelset, oj_modelset_ets, file="oj_modelset.Rdata")
head(oj_modelset[[1]])
head(oj_modelset_ets[[1]])
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,39 @@
---
title: Regression models
output: html_notebook
---
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
This notebook builds on the output from "Simple models" by including regressor variables in the ARIMA model(s).
```{r, results="hide"}
load("oj_data.Rdata")
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
cl <- parallel::makeCluster(ncores)
parallel::clusterEvalQ(cl,
{
library(feasts)
library(fable)
library(tsibble)
})
oj_modelset_reg <- parallel::parLapply(cl, oj_train, function(df)
{
model(df,
ar_reg=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + deal + feat + price + maxpricediff),
ar_trend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend()),
ar_regtrend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend() + deal + feat + price + maxpricediff)
)
})
parallel::stopCluster(cl)
save(oj_modelset_reg, file="oj_modelset_reg.Rdata")
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,58 @@
---
title: Model evaluation
output: html_notebook
encoding: utf8
---
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
Having fit the models, let's examine their rolling goodness of fit, using the MAPE (mean absolute percentage error) metric.
First, we compute the forecasts for each dataset and model, again in parallel.
```{r, results="hide"}
for(f in dir(pattern="Rdata$"))
load(f)
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
cl <- parallel::makeCluster(ncores)
parallel::clusterEvalQ(cl,
{
library(feasts)
library(fable)
library(tsibble)
})
fcast_sets <- lapply(ls(pattern="^oj_modelset"), function(mod)
parallel::clusterMap(cl, function(mod, df) forecast(mod, df), get(mod), oj_test)
)
parallel::stopCluster(cl)
```
Next, we compute the MAPE for each model. It is apparent that adding independent variables as regressors improves the quality of the fit substantially. Adding a simple trend does _not_ improve the fit, indicating that the level of sales does not appear to change over time (at least over the period included in the data).
```{r}
orig <- do.call(rbind, oj_test) %>%
as_tibble() %>%
select(store, brand, week, logmove) %>%
mutate(move=exp(logmove))
gof <- function(fcast_data)
{
fcast_data <- do.call(rbind, fcast_data) %>%
as_tibble() %>%
select(store, brand, week, .model, logmove) %>%
pivot_wider(id_cols=c(store, brand, week), names_from=.model, values_from=logmove) %>%
select(-store, -brand, -week) %>%
summarise_all(function(x) MAPE(exp(x) - orig$move, orig$move))
}
lapply(fcast_sets, gof) %>% bind_cols()
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

32
R/orange_juice/README.md Normal file
Просмотреть файл

@ -0,0 +1,32 @@
## Orange juice dataset
### Package installation
You'll need the following packages to run the notebooks in this directory:
- bayesm (the source of the data)
- ggplot2
- dplyr
- tidyr
- jsonlite
- tsibble
- urca
- fable
- fabletools
- feasts
The easiest way to install them is to run
```r
install.packages("bayesm")
install.packages("tidyverse") # installs all tidyverse packages
install.packages(c("fable", "feasts", "urca"))
```
The Rmarkdown notebooks in this directory are as follows. You should run them in sequence, as each will create output objects (datasets/models) that are used in later notebooks.
- [`01_dataprep.Rmd`](01_dataprep.Rmd) creates the training and test datasets
- [`02_simplemodels.Rmd`](02_simplemodels.Rmd) fits a range of simple time series models to the data, including ARIMA and ETS models.
- [`02a_simplereg_models.Rmd`](02a_simplereg_models.Rmd) adds independent variables as regressors to the ARIMA model.
- [`03_model_eval.Rmd`](03_model_eval.Rmd) evaluates the goodness of fit of the models on the test data.

Просмотреть файл

@ -0,0 +1,5 @@
{
"STEP": 2,
"TRAIN_WINDOW": 135,
"START_DATE": "1989-09-14"
}

Просмотреть файл

@ -1,87 +1,16 @@
# Forecasting Best Practices
Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.
This repository contains examples and best practices for building Forecasting solutions and systems, provided as [Jupyter notebooks](examples) and [a library of utility functions](fclib). The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on forecasting problems.
This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.
## Getting Started
The examples and best practices are provided as [Python Jupyter notebooks and R markdown files](examples) and [a library of utility functions](fclib). We hope that these examples and utilities can significantly reduce the “time to market” by simplifying the experience from defining the business problem to the development of solutions by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.
## Content
The following is a summary of the examples related to the process of building forecasting solutions covered in this repository. The [examples](examples) are organized according to use cases. Currently, we focus on a retail sales forecasting use case as it is widely used in [assortment planning](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1569&context=edissertations), [inventory optimization](https://en.wikipedia.org/wiki/Inventory_optimization), and [price optimization](https://en.wikipedia.org/wiki/Price_optimization).
| Example | Models/Methods | Description | Language |
|----------------------------------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|-----------|
| Quick Start | Auto ARIMA, Azure AutoML, Linear Regression, LightGBM | Quick start notebooks that demonstrate workflow of developing a forecast model using one-round training and testing data | Python |
| Data Exploration and Preparation | Statistical Analysis and Data Transformation | Data exploration and preparation examples | Python, R |
| Model Training and Evaluation | Auto ARIMA, LightGBM, Dilated CNN | Deep dive notebooks that perform multi-round training and testing of various classical and deep learning forecast algorithms | Python |
| Model Tuning and Deployment | HyperDrive, LightGBM | Example notebook for model tuning using Azure Machine Learning Service and deploying the best model on Azure | Python |
| R Models | Mean Forecast, ARIMA, ETS, Prophet | Popular statistical forecast models and Prophet model implmented in R | R |
## Getting Started in Python
To quickly get started with the repository on your local machine, use the following commands.
1. Install Anaconda with Python >= 3.6. [Miniconda](https://conda.io/miniconda.html) is a quick way to get started.
2. Clone the repository
```
git clone https://github.com/microsoft/forecasting
cd forecasting/
```
3. Run setup scripts to create conda environment. Please execute one of the following commands from the root of Forecasting repo based on your operating system.
- Linux
```
./tools/environment_setup.sh
```
- Windows
```
tools\environment_setup.bat
```
Note that for Windows you need to run the batch script from Anaconda Prompt. The script creates a conda environment `forecasting_env` and installs the forecasting utility library `fclib`.
4. Start the Jupyter notebook server
```
jupyter notebook
```
5. Run the [LightGBM single-round](examples/oj_retail/python/00_quick_start/lightgbm_single_round.ipynb) notebook under the `00_quick_start` folder. Make sure that the selected Jupyter kernel is `forecasting_env`.
If you have any issues with the above setup, or want to find more detailed instructions on how to set up your environment and run examples provided in the repository, on local or a remote machine, please navigate to the [Setup Guide](./docs/SETUP.md).
## Getting Started in R
We assume you already have R installed on your machine. If not, simply follow the [instructions on CRAN](https://cloud.r-project.org/) to download and install R.
The recommended editor is [RStudio](https://rstudio.com), which supports interactive editing and previewing of R notebooks. However, you can use any editor or IDE that supports RMarkdown. In particular, [Visual Studio Code](https://code.visualstudio.com) with the [R extension](https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r) can be used to edit and render the notebook files. The rendered `.nb.html` files can be viewed in any modern web browser.
The examples use the [Tidyverts](https://tidyverts.org) family of packages, which is a modern framework for time series analysis that builds on the widely-used [Tidyverse](https://tidyverse.org) family. The Tidyverts framework is still under active development, so it's recommended that you update your packages regularly to get the latest bug fixes and features.
## Target Audience
Our target audience for this repository includes data scientists and machine learning engineers with varying levels of knowledge in forecasting as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world forecasting problems.
To get started, navigate to the [Setup Guide](./docs/SETUP.md), which lists instructions on how to set up your environment and dependencies, download the data and run examples provided in the repository.
## Contributing
We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our [Contributing Guide](./docs/CONTRIBUTING.md).
## Reference
The following is a list of related repositories that you may find helpful.
| | |
|------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| [Deep Learning for Time Series Forecasting](https://github.com/Azure/DeepLearningForTimeSeriesForecasting) | A collection of examples for using deep neural networks for time series forecasting with Keras. |
| [Demand Forecasting and Price Optimization Solution](https://github.com/Azure/cortana-intelligence-price-optimization) | A Cortana Intelligence solution how-to guide for demand forecasting and price optimization. |
## Build Status
| Build | Branch | Status |
|---------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=master) |
| **Linux CPU** | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=staging)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=staging) |
| Build | Branch | Status |
| --- | --- | --- |
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=master) |
| **Linux CPU** | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=staging)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=staging) |

Просмотреть файл

@ -1,36 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Creates a local background cluster for parallel computations
#'
#' @param ncores The number of nodes (cores) for the cluster. The default is 2 less than the number of physical cores.
#' @param libs The packages to load on each node, as a character vector.
#' @param useXDR For most platforms, this can be left at its default `FALSE` value.
#' @return
#' A cluster object.
make_cluster <- function(ncores=NULL, libs=character(0), useXDR=FALSE)
{
if(is.null(ncores))
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
cl <- parallel::makeCluster(ncores, type="PSOCK", useXDR=useXDR)
res <- try(parallel::clusterCall(
cl,
function(libs)
{
for(lib in libs) library(lib, character.only=TRUE)
},
libs
), silent=TRUE)
if(inherits(res, "try-error"))
parallel::stopCluster(cl)
else cl
}
#' Deletes a local background cluster
#'
#' @param cl The cluster object, as returned from `make_cluster`.
destroy_cluster <- function(cl)
{
try(parallel::stopCluster(cl), silent=TRUE)
}

Просмотреть файл

@ -1,50 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Computes forecast values on a dataset
#'
#' @param mable A mable (model table) as returned by `fabletools::model`.
#' @param newdata The dataset for which to compute forecasts.
#' @param ... Further arguments to `fabletools::forecast`.
#' @return
#' A tsibble, with one column per model type in `mable`, and one column named `.response` containing the response variable from `newdata`.
get_forecasts <- function(mable, newdata, ...)
{
fcast <- forecast(mable, new_data=newdata, ...)
keyvars <- key_vars(fcast)
keyvars <- keyvars[-length(keyvars)]
indexvar <- index_var(fcast)
fcastvar <- as.character(attr(fcast, "response")[[1]])
fcast <- fcast %>%
as_tibble() %>%
pivot_wider(
id_cols=all_of(c(keyvars, indexvar)),
names_from=.model,
values_from=all_of(fcastvar))
select(newdata, !!keyvars, !!indexvar, !!fcastvar) %>%
rename(.response=!!fcastvar) %>%
inner_join(fcast)
}
#' Evaluate quality of forecasts given a criterion
#'
#' @param fcast_df A tsibble as returned from `get_forecasts`.
#' @param gof A goodness-of-fit function. The default is to use `fabletools::MAPE`, which computes the mean absolute percentage error.
#' @return
#' A single-row data frame with the computed goodness-of-fit statistic for each model.
eval_forecasts <- function(fcast_df, gof=fabletools::MAPE)
{
if(!is.function(gof))
gof <- get(gof, mode="function")
resp <- fcast_df$.response
keyvars <- key_vars(fcast_df)
indexvar <- index_var(fcast_df)
fcast_df %>%
as_tibble() %>%
select(-all_of(c(keyvars, indexvar, ".response"))) %>%
summarise_all(
function(x, .actual) gof(x - .actual, .actual=.actual),
.actual=resp
)
}

Просмотреть файл

@ -1,25 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Loads serialised objects relating to a given forecasting example into the current workspace
#'
#' @param example The particular forecasting example.
#' @param file The name of the file (with extension).
#' @return
#' This function is run for its side effect, namely loading the given file into the global environment.
load_objects <- function(example, file)
{
examp_dir <- here::here("examples", example, "R")
load(file.path(examp_dir, file), envir=globalenv())
}
#' Saves R objects for a forecasting example to a file
#'
#' @param ... Objects to save, as unquoted names.
#' @param example The particular forecasting example.
#' @param file The name of the file (with extension).
save_objects <- function(..., example, file)
{
examp_dir <- here::here("examples", example, "R")
save(..., file=file.path(examp_dir, file))
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,140 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
import csvtomd
import matplotlib.pyplot as plt
import pandas as pd
### Generating performance charts
#################################################
# Function to plot a performance chart
def plot_perf(x, y, df):
# extract submission name from submission URL
labels = df.apply(lambda x: x["Submission Name"][1:].split("]")[0], axis=1)
fig = plt.scatter(x=df[x], y=df[y], label=labels, s=150, alpha=0.5, c=["b", "g", "r", "c", "m", "y", "k"])
plt.xlabel(x)
plt.ylabel(y)
plt.title(y + " by " + x)
offset = (max(df[y]) - min(df[y])) / 50
for i, name in enumerate(labels):
ax = df[x][i]
ay = df[y][i] + offset * (-2.5 + i % 5)
plt.text(ax, ay, name, fontsize=10)
return fig
### Printing the Readme.md file
############################################
readmefile = "../../Readme.md"
# Write header
# print(file=open(readmefile))
print("# TSPerf\n", file=open(readmefile, "w"))
print(
"TSPerf is a collection of implementations of time-series forecasting algorithms in Azure cloud and comparison of their performance over benchmark datasets. \
Algorithm implementations are compared by model accuracy, training and scoring time and cost. Each implementation includes all the necessary \
instructions and tools that ensure its reproducibility.",
file=open(readmefile, "a"),
)
print("The following table summarizes benchmarks that are currently included in TSPerf.\n", file=open(readmefile, "a"))
# Read the benchmark table the CSV file and converrt to a table in md format
with open("Benchmarks.csv", "r") as f:
table = csvtomd.csv_to_table(f, ",")
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print("\n\n\n", file=open(readmefile, "a"))
print(
"A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, \
can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of \
implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.\n",
file=open(readmefile, "a"),
)
### Write the Energy section
# ============================
print("## Probabilistic energy forecasting performance board\n\n", file=open(readmefile, "a"))
print(
"The following table lists the current submision for the energy forecasting and their respective performances.\n\n",
file=open(readmefile, "a"),
)
# Read the energy perfromane board from the CSV file and converrt to a table in md format
with open("TSPerfBoard-Energy.csv", "r") as f:
table = csvtomd.csv_to_table(f, ",")
print(csvtomd.md_table(table), file=open(readmefile, "a"))
# Read Energy Performance Board CSV file
df = pd.read_csv("TSPerfBoard-Energy.csv", engine="python")
# df
# Plot ,'Pinball Loss' by 'Training and Scoring Cost($)' chart
fig4 = plt.figure(figsize=(12, 8), dpi=80, facecolor="w", edgecolor="k") # this sets the plotting area size
fig4 = plot_perf("Training and Scoring Cost($)", "Pinball Loss", df)
plt.savefig("../../docs/images/Energy-Cost.png")
# insetting the performance charts
print(
"\n\nThe following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:\n\n ",
file=open(readmefile, "a"),
)
print("![EnergyPBLvsTime](./docs/images/Energy-Cost.png)", file=open(readmefile, "a"))
print("\n\n\n", file=open(readmefile, "a"))
# print the retail sales forcsating section
# ========================================
print("## Retail sales forecasting performance board\n\n", file=open(readmefile, "a"))
print(
"The following table lists the current submision for the retail forecasting and their respective performances.\n\n",
file=open(readmefile, "a"),
)
# Read the energy perfromane board from the CSV file and converrt to a table in md format
with open("TSPerfBoard-Retail.csv", "r") as f:
table = csvtomd.csv_to_table(f, ",")
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print("\n\n\n", file=open(readmefile, "a"))
# Read Retail Performane Board CSV file
df = pd.read_csv("TSPerfBoard-Retail.csv", engine="python")
# df
# Plot MAPE (%) by Training and Scoring Cost ($) chart
fig2 = plt.figure(figsize=(12, 8), dpi=80, facecolor="w", edgecolor="k") # this sets the plotting area size
fig2 = plot_perf("Training and Scoring Cost ($)", "MAPE (%)", df)
plt.savefig("../../docs/images/Retail-Cost.png")
# insetting the performance charts
print(
"\n\nThe following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:\n\n ",
file=open(readmefile, "a"),
)
print("![EnergyPBLvsTime](./docs/images/Retail-Cost.png)", file=open(readmefile, "a"))
print("\n\n\n", file=open(readmefile, "a"))
# insertting build status badge
print("## Build Status\n\n", file=open(readmefile, "a"))
print("| Build Type | Branch | Status | | Branch | Status |", file=open(readmefile, "a"))
print("| --- | --- | --- | --- | --- | --- |", file=open(readmefile, "a"))
print(
"| **Python Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/python_unit_tests_base?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=12&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/python_unit_tests_base?branchName=chenhui/python_test_pipeline)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=12&branchName=chenhui/python_test_pipeline) |",
file=open(readmefile, "a"),
)
print(
"| **R Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/Forecasting/r_unit_tests_prototype?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=9&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/Forecasting/r_unit_tests_prototype?branchName=zhouf/r_test_pipeline)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=9&branchName=zhouf/r_test_pipeline) |",
file=open(readmefile, "a"),
)
print("\n\n\n", file=open(readmefile, "a"))
print("A new Readme.md file has been generated successfully.")

Просмотреть файл

@ -4,12 +4,9 @@ Please follow these instructions to read about the preferred compute environment
### Compute environment
The code in this repo has been developed and tested on an Azure Linux VM. Therefore, we recommend using an [Azure Data Science Virtual Machine (DSVM) for Linux (Ubuntu)](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to run the example notebooks and scripts. This VM will come installed with all the system requirements that are needed to create the conda environment described below and then run the notebooks in this repository. If you are using a Linux machine without conda installed, please install Miniconda by following the instructions in this [link](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html).
You can also use a Windows machine to run the example notebooks and scripts. In this case, you may either work with a [Windows Server 2019 Data Science Virtual Machine on Azure](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm) or a local Windows machine. Azure Windows VW comes with conda pre-installed. If conda is not installed on your machine, please follow the instructions in this [link](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html) to install Miniconda.
The code in this repo has been developed and tested on an Azure Linux VM. Therefore, we recommend using an [Azure Data Science Virtual Machine (DSVM) for Linux (Ubuntu)](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to run the example notebooks and scripts. This VM will come installed with all the system requirements that are needed to create the conda environment described below and then run the notebooks in this repository.
### Clone the repository
To clone the Forecasting repository to your local machine, please run:
```
@ -17,51 +14,27 @@ git clone https://github.com/microsoft/forecasting.git
cd forecasting/
```
Next, follow the instruction below to install all dependencies required to run the examples provided in the repository. Follow [Automated environment setup](#automated-environment-setup) section to set up the environment automatically using a script. Alternatively, follow the [Manual environment setup](#manual-environment-setup) section for a step-by-step guide to setting up the environment.
Next, follow the instruction below to install all dependencies required to run the examples provided in the repository. Follow [Automated environment setup](#automated-environment-setup) section to setup the environment automatically using a script. Alternatively, follow the [Manual environment setup](#manual-environment-setup) section for a step-by-step guide to setting up the environment.
### Automated environment setup
We provide scripts to install all dependencies automatically on a Linux machine as well as on a Windows machine.
We provide a script to install all dependencies automatically on a Linux machine. To execute the script, please run:
#### Linux
If you are using a Linux machine, please run the following command to execute the shell script for Linux
```
./tools/environment_setup.sh
```
from the root of Forecasting repo.
from the root of Forecasting repo. If you have issues with running the setup script, please follow the [Manual environment setup](#manual-environment-setup) instructions below.
#### Windows
Similarly, if you are using a Windows machine, please run the batch script for Windows via
```
tools\environment_setup.bat
```
from the root of Forecasting repo. Note that you need to run the above command from Anaconda Prompt (a terminal with conda available), which can be started by opening the Windows Start menu and clicking `Anaconda Prompt (Miniconda3)` as follows
<p align="center">
<img src="https://user-images.githubusercontent.com/20047467/76897869-f2f22900-686a-11ea-9f67-b189c15df27a.png" width="210" height="395">
</p>
Once you've executed the setup script, please activate the newly created conda environment:
```
conda activate forecasting_env
```
>!NOTE: If you have issues with running the setup script, please follow the [Manual environment setup](#manual-environment-setup) instructions below.
Next, navigate to [Starting the Jupyter Notebook Server](#starting-the-jupyter-notebook-server) section below to start the Jupyter server necessary for running the examples.
Once you've executed the setup script, you can run example notebooks under [examples/](./examples) directory.
### Manual environment setup
#### Conda environment
To install the package contained in this repository, navigate to the directory where you pulled the Forecasting repo to run:
```bash
conda update conda
conda env create -f tools/environment.yml
conda env create -f tools/environment.yaml
```
This will create the appropriate conda environment to run experiments. Next activate the installed environment:
```bash
@ -90,24 +63,6 @@ In order to run the example notebooks, make sure to run the notebooks in the con
python -m ipykernel install --user --name forecasting_env
```
### Starting the Jupyter Notebook Server
In order to run the example notebooks provided in this repository, you will have to start a Jupyter notebook server.
Once you've set up the environment, you can run example notebooks under [examples/](./examples) directory.
For running examples on your **local machine**, please open your terminal application and run the following command:
```
jupyter notebook
```
If you are working on a remote VM, you can start the notebook server with the following command:
```
jupyter notebook --no-browser --port=8889
```
and forward the port where the notebooks are running (e.g., 8889) to the local machine via running the following command from the local machine:
```
ssh -L localhost:8889:localhost:8889 <user-name>@<ip-address-of-the-vm>
```
To access the notebooks, type `localhost:8889/` in the browser on your local machine.
Now you're ready to run the examples provided in the `examples/`, by simply opening and executing the notebooks in the Jupyter server. Please also navigate to the [examples README file](../examples/README.md) to read about the available notebooks.

Просмотреть файл

@ -164,16 +164,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Process training data"
"Our data preparation for the training and test set include the following steps:\n",
"\n",
"- The unit sales of orange juice are give in logarithmic scale. We will transfrom them back into the unit scale by applying `math.exp()`\n",
"- Our time series data is not complete, since we have missing sales for some stores/products and weeks. We will fill in those missing values by propagating the last valid observation forward to next available value.\n",
"\n",
"Note that our time series are grouped by `store` and `brand`, while `week` represents a time step, and `move` represents the value to predict."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our time series data is not complete, since we have missing sales for some stores/products and weeks. We will fill in those missing values by propagating the last valid observation forward to next available value. We will define functions for data frame processing, then use these functions within a loop that loops over each forecasting rounds.\n",
"\n",
"Note that our time series are grouped by `store` and `brand`, while `week` represents a time step, and `logmove` represents the value to predict."
"### Process training data"
]
},
{
@ -474,7 +477,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now process the test data. Note that the test data runs from `LAST_WEEK - HORIZON + 1` to `LAST_WEEK`. Note that, in addition to filling out missing values, we also convert unit sales from logarithmic scale to the counts. We will do model training on the log scale, due to improved performance, however, we will transfrom the test data back into the unit scale (counts) by applying `math.exp()`, so that we can evaluate the performance on the unit scale."
"Let's now process the test data. Note that the test data runs from `LAST_WEEK - HORIZON + 1` to `LAST_WEEK`. Note that we are converting unit sales below from logarithmic scale to the counts, as we will be using counts to calculate the evaluation metrics."
]
},
{

Просмотреть файл

@ -0,0 +1,756 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<i>Copyright (c) Microsoft Corporation.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automated Machine Learning (AutoML) on Azure for Retail Sales Forecasting\n",
"\n",
"This notebook demonstrates how to apply [AutoML in Azure Machine Learning services](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) to train and tune machine learning models for forecasting product sales in retail. We will use the Orange Juice dataset to illustrate the steps of utilizing AutoML as well as how to combine an AutoML model with a custom model for better performance.\n",
"\n",
"AutoML is a process of automating the tasks of machine learning model development. It helps data scientists and other practioners build machine learning models with high scalability and quality in less amount of time. AutoML in Azure Machine Learning allows you to train and tune a model using a target metric that you specify. This service iterates through machine learning algorithms and feature selection approaches, producing a score that measures the quality of each machine learning pipeline. The best model will then be selected based on the scores. For more technical details about Azure AutoML, please check [this paper](https://papers.nips.cc/paper/7595-probabilistic-matrix-factorization-for-automated-machine-learning.pdf).\n",
"\n",
"This notebook uses [Azure ML SDK](https://docs.microsoft.com/en-us/python/api/overview/azureml-sdk/?view=azure-ml-py) which is included in the `forecasting_env` conda environment. If you are running in Azure Notebooks or another Microsoft managed environment, the SDK is already installed. On the other hand, if you are running this notebook in your own environment, please follow [SDK installation instructions](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment) to install the SDK."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Global Settings and Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import math\n",
"import datetime\n",
"import logging\n",
"import azureml.core\n",
"import azureml.automl\n",
"import pandas as pd\n",
"\n",
"from matplotlib import pyplot as plt\n",
"from fclib.common.utils import git_repo_path\n",
"from fclib.evaluation.evaluation_utils import MAPE\n",
"from fclib.dataset.ojdata import download_ojdata, FIRST_WEEK_START\n",
"from fclib.common.utils import align_outputs\n",
"from fclib.models.multiple_linear_regression import fit, predict\n",
"\n",
"from azureml.core import Workspace\n",
"from azureml.core.dataset import Dataset\n",
"from azureml.core.experiment import Experiment\n",
"from automl.client.core.common import constants\n",
"from azureml.train.automl import AutoMLConfig\n",
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"from azureml.automl.core._vendor.automl.client.core.common import metrics\n",
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"This notebook was created using version 1.0.85 of the Azure ML SDK\")\n",
"print(\"You are currently using version\", azureml.core.VERSION, \"of the Azure ML SDK\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use False if you've already downloaded and split the data\n",
"DOWNLOAD_SPLIT_DATA = True\n",
"\n",
"# Data directory\n",
"DATA_DIR = os.path.join(git_repo_path(), \"ojdata\")\n",
"\n",
"# Forecasting settings\n",
"GAP = 2\n",
"LAST_WEEK = 138\n",
"\n",
"# Number of test periods\n",
"NUM_TEST_PERIODS = 3\n",
"\n",
"# Column names\n",
"time_column_name = \"week_start\"\n",
"target_column_name = \"move\"\n",
"grain_column_names = [\"store\", \"brand\"]\n",
"index_column_names = [time_column_name] + grain_column_names\n",
"\n",
"# Subset of stores used in the notebook\n",
"USE_STORES = [2, 5, 8]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up Azure Machine Learning Workspace\n",
"\n",
"An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inference, and the monitoring of deployed models. To create an Azure ML workspace, first you need access to an Azure subscription. An Azure subscription allows you to manage storage, compute, and other assets in the Azure cloud. You can [create a new subscription](https://azure.microsoft.com/en-us/free/) or access existing subscription information from the [Azure portal](https://portal.azure.com/). Given that you have access to your Azure subscription, you can further create an Azure ML workspace by following the instructions [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace). You can also do so [using Azure CLI](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli) or the `Workspace.create()` method in Azure SDK.\n",
"\n",
"In the following cell, please replace the value of each parameter with the value of the corresponding attribute of your workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subscription_id = \"<my-subscription-id>\"\n",
"resource_group = \"<my-resource-group>\"\n",
"workspace_name = \"<my-workspace-name>\"\n",
"workspace_region = \"eastus2\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Access Azure ML Workspace\n",
"\n",
"In what follows, we use Azure ML SDK to attempt to load the workspace specified by your parameters. The cell can fail if the specified workspace doesn't exist or you don't have permissions to access it. Hence, you may need to log into your Azure account and change the default subscription to the one which the workspace belongs to using Azure CLI `az account set --subscription <name or id>`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" ws = Workspace.create(subscription_id=subscription_id, resource_group=resource_group, \n",
" name=workspace_name, create_resource_group=True, exist_ok=True, \n",
" location=workspace_region)\n",
" # write the details of the workspace to a configuration file to the notebook library\n",
" ws.write_config()\n",
" print(\"Workspace configuration succeeded. Skip the workspace creation steps below\")\n",
"except ValueError:\n",
" raise Exception(\"Workspace not accessible. Change your parameters or create a new workspace below\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create compute resources for your experiments\n",
"\n",
"We run AutoML on a dynamically scalable compute cluster. To create a compute cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors. Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Choose a name for your CPU cluster\n",
"cpu_cluster_name = \"cpu-cluster\"\n",
"\n",
"# Verify that cluster does not exist already\n",
"workspace_compute = ws.compute_targets\n",
"if cpu_cluster_name in workspace_compute:\n",
" print(\"Found existing cpu-cluster\")\n",
" cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)\n",
"else: \n",
" print(\"Creating new cpu-cluster\")\n",
"\n",
" # Specify the configuration for the new cluster\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D2_V2\", min_nodes=4, max_nodes=4)\n",
"\n",
" # Create the cluster with the specified name and configuration\n",
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)\n",
"\n",
" # Wait for the cluster to complete, show the output log\n",
" cpu_cluster.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define Experiment\n",
"\n",
"To run AutoML, you need to create an Experiment. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# choose a name for the run history container in the workspace\n",
"experiment_name = \"automl-ojforecasting\"\n",
"\n",
"experiment = Experiment(ws, experiment_name)\n",
"\n",
"output = {}\n",
"output[\"SDK version\"] = azureml.core.VERSION\n",
"output[\"Workspace\"] = ws.name\n",
"output[\"SKU\"] = ws.sku\n",
"output[\"Resource Group\"] = ws.resource_group\n",
"output[\"Location\"] = ws.location\n",
"output[\"Run History Name\"] = experiment_name\n",
"pd.set_option(\"display.max_colwidth\", -1)\n",
"outputDf = pd.DataFrame(data=output, index=[\"\"])\n",
"outputDf.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preparation\n",
"\n",
"We need to download the Orange Juice data and split it into training and test sets. By default, the following cell will download and spit the data. If you've already done so, you may skip this part by switching `DOWNLOAD_SPLIT_DATA` to `False`.\n",
"\n",
"We store the training data and test data using dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales up to week 135 (the time we make forecasts) and `aux_df` containing price/promotion information up until week 138. We assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product in week 137 and 138. Assuming the current week is week 135, our goal is to forecast the sales in week 137 and 138 using the training data. There is a one-week gap between the current week and the first target week of forecasting as we want to leave time for planning inventory in practice."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data download and split"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if DOWNLOAD_SPLIT_DATA:\n",
" download_ojdata(DATA_DIR)\n",
" df = pd.read_csv(os.path.join(DATA_DIR, \"yx.csv\"))\n",
" df = df.loc[df.week <= LAST_WEEK]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert logarithm of the unit sales to unit sales\n",
"df[\"move\"] = df[\"logmove\"].apply(lambda x: round(math.exp(x)))\n",
"# Add timestamp column\n",
"df[\"week_start\"] = df[\"week\"].apply(lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7))\n",
"# Select a subset of stores for demo purpose\n",
"df_sub = df[df.store.isin(USE_STORES)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into training and test sets\n",
"def split_last_n_by_grain(df, n):\n",
" \"\"\"Group df by grain and split on last n rows for each group.\"\"\"\n",
" df_grouped = df.sort_values(time_column_name).groupby( # Sort by ascending time\n",
" grain_column_names, group_keys=False\n",
" )\n",
" df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])\n",
" df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])\n",
" return df_head, df_tail\n",
"\n",
"\n",
"train_df, test_df = split_last_n_by_grain(df_sub, NUM_TEST_PERIODS)\n",
"train_df.reset_index(drop=True)\n",
"test_df.reset_index(drop=True)\n",
"\n",
"# Save data locally\n",
"local_data_pathes = [\n",
" os.path.join(DATA_DIR, \"train.csv\"),\n",
" os.path.join(DATA_DIR, \"test.csv\"),\n",
"]\n",
"\n",
"train_df.to_csv(local_data_pathes[0], index=None, header=True)\n",
"test_df.to_csv(local_data_pathes[1], index=None, header=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to datastore\n",
"\n",
"The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datastore = ws.get_default_datastore()\n",
"datastore.upload_files(files=local_data_pathes, target_path=\"dataset/\", overwrite=True, show_progress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create dataset for training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path(\"dataset/train.csv\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_dataset.to_pandas_dataframe().tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling\n",
"\n",
"For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:\n",
"* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span\n",
"* Impute missing values in the target (via forward-fill) and feature columns (using median column values)\n",
"* Create grain-based features to enable fixed effects across different series\n",
"* Create time-based features to assist in learning seasonal patterns\n",
"* Encode categorical variables to numeric quantities\n",
"\n",
"In this notebook, AutoML will train a single, regression-type model across all time-series in a given training set. This allows the model to generalize across related series. To create a training job, we use AutoML Config object to define the settings and data. Here is a summary of the meanings of the AutoMLConfig parameters:\n",
"\n",
"|Property|Description|\n",
"|-|-|\n",
"|**task**|forecasting|\n",
"|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>\n",
"|**experiment_timeout_hours**|Experimentation timeout in hours.|\n",
"|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|\n",
"|**training_data**|Input dataset, containing both features and label column.|\n",
"|**label_column_name**|The name of the label column.|\n",
"|**compute_target**|The remote compute for training.|\n",
"|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|\n",
"|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|\n",
"|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|\n",
"|**debug_log**|Log file path for writing debugging information|\n",
"|**time_column_name**|Name of the datetime column in the input data|\n",
"|**grain_column_names**|Name(s) of the columns defining individual series in the input data|\n",
"|**drop_column_names**|Name(s) of columns to drop prior to modeling|\n",
"|**max_horizon**|Maximum desired forecast horizon in units of time-series frequency|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time_series_settings = {\n",
" \"time_column_name\": time_column_name,\n",
" \"grain_column_names\": grain_column_names,\n",
" \"drop_column_names\": [\"logmove\"], # 'logmove' is a leaky feature, so we remove it.\n",
" \"max_horizon\": NUM_TEST_PERIODS,\n",
"}\n",
"\n",
"automl_config = AutoMLConfig(\n",
" task=\"forecasting\",\n",
" debug_log=\"automl_oj_sales_errors.log\",\n",
" primary_metric=\"normalized_mean_absolute_error\",\n",
" experiment_timeout_hours=0.6, # You may increase this number to improve model accuracy\n",
" training_data=train_dataset,\n",
" label_column_name=target_column_name,\n",
" compute_target=cpu_cluster,\n",
" enable_early_stopping=True,\n",
" n_cross_validations=3,\n",
" verbosity=logging.INFO,\n",
" **time_series_settings\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run = experiment.submit(automl_config, show_output=False)\n",
"remote_run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"remote_run.wait_for_completion()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Retrieve the best model\n",
"\n",
"Each run within an Experiment stores serialized (i.e. pickled) pipelines from the AutoML iterations. After the training job is done, we can retrieve the pipeline with the best performance on the validation dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run, fitted_model = remote_run.get_output()\n",
"print(fitted_model.steps)\n",
"model_name = best_run.properties[\"model_name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Forecasting\n",
"\n",
"Now that we have retrieved the best model pipeline, we can apply it to generate forecasts for the target weeks. To do this, we first remove the target values from the test set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate forecasts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test = test_df\n",
"y_test = X_test.pop(target_column_name).values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The featurized data, aligned to y, will also be returned. It contains the assumptions\n",
"# that were made in the forecast and helps align the forecast to the original data.\n",
"y_predictions, X_trans = fitted_model.forecast(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to align the output explicitly to the input, as the count and order of the rows may have changed during transformations that span multiple rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_automl = align_outputs(y_predictions, X_trans, X_test, y_test, target_column_name)\n",
"pred_automl.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Results evaluation & visualization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use automl metrics module\n",
"scores = metrics.compute_metrics_regression(\n",
" pred_automl[\"predicted\"],\n",
" pred_automl[target_column_name],\n",
" list(constants.Metric.SCALAR_REGRESSION_SET),\n",
" None,\n",
" None,\n",
" None,\n",
")\n",
"\n",
"print(\"[Test data scores]\\n\")\n",
"for key, value in scores.items():\n",
" print(\"{}: {:.3f}\".format(key, value))\n",
"\n",
"# Plot outputs\n",
"%matplotlib inline\n",
"test_pred = plt.scatter(pred_automl[target_column_name], pred_automl[\"predicted\"], color=\"b\")\n",
"test_test = plt.scatter(pred_automl[target_column_name], pred_automl[target_column_name], color=\"g\")\n",
"plt.legend((test_pred, test_test), (\"prediction\", \"truth\"), loc=\"upper left\", fontsize=8)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also compute MAPE of the forecasts in the last two weeks of the forecast period in order to be consistent with the evaluation period that is used in other quick start examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_automl_sub = pred_automl.loc[pred_automl.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_automl_sub = MAPE(pred_automl_sub[\"predicted\"], pred_automl_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by AutoML in the last two weeks: \" + str(mape_automl_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine AutoML Model with a Custom Model\n",
"\n",
"So far we have demonstrated how we can quickly build a forecasting model with AutoML in Azure. Next, we further show a simple way to achieve more robust and accurate forecasts by combining the forecasts from AutoML and a custom model that the user may have. Here we assume that the user have also constructed a series of linear regression models with each model forecasts the sales of a specfic store-brand using `scikit-learn` package."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple linear regression models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create price features\n",
"df_sub[\"price\"] = df_sub.apply(lambda x: x.loc[\"price\" + str(int(x.loc[\"brand\"]))], axis=1)\n",
"price_cols = [\n",
" \"price1\",\n",
" \"price2\",\n",
" \"price3\",\n",
" \"price4\",\n",
" \"price5\",\n",
" \"price6\",\n",
" \"price7\",\n",
" \"price8\",\n",
" \"price9\",\n",
" \"price10\",\n",
" \"price11\",\n",
"]\n",
"df_sub[\"avg_price\"] = df_sub[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))\n",
"df_sub[\"price_ratio\"] = df_sub.apply(lambda x: x[\"price\"] / x[\"avg_price\"], axis=1)\n",
"\n",
"# Create lag features on unit sales\n",
"df_sub[\"move_lag1\"] = df_sub[\"move\"].shift(1)\n",
"df_sub[\"move_lag2\"] = df_sub[\"move\"].shift(2)\n",
"\n",
"# Drop rows with NaN values\n",
"df_sub.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After splitting the data, we use `fit()` and `predit()` functions from `fclib.models.multiple_linear_regression` to train separate linear regression model for each invididual time series and generate forecasts for the sales during the test period."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split data into training and test sets\n",
"train_df, test_df = split_last_n_by_grain(df_sub, NUM_TEST_PERIODS)\n",
"train_df.reset_index(drop=True)\n",
"test_df.reset_index(drop=True)\n",
"\n",
"# Train multiple linear regression models\n",
"fea_column_names = [\"move_lag1\", \"move_lag2\", \"price\", \"price_ratio\"]\n",
"lr_models = fit(train_df, grain_column_names, fea_column_names, target_column_name)\n",
"\n",
"# Generate forecasts with the trained models\n",
"pred_all = predict(test_df, lr_models, time_column_name, grain_column_names, fea_column_names)\n",
"\n",
"pred_lr = pd.merge(pred_all, test_df, on=index_column_names)\n",
"pred_lr.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the accuracy of the predictions on the entire forecast period as well as in the last two weeks of the forecast period.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mape_lr_entire = MAPE(pred_lr[\"prediction\"], pred_lr[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by multiple linear regression on entire test period: \" + str(mape_lr_entire))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_lr_sub = pred_lr.loc[pred_lr.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_lr_sub = MAPE(pred_lr_sub[\"prediction\"], pred_lr_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by multiple linear regression in the last two weeks: \" + str(mape_lr_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combine forecasts from different methods\n",
"\n",
"We can combine the forecasts obtained by AutoML and multiple linear regression using weighted average and evaluate the final forecasts. Usually the combined forecasts will be more robust as a combination of two methods can reduce the chance of model overfitting. Here we use equal weights which can be further adjusted according to our confidence on each model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_final = pd.merge(\n",
" pred_automl[index_column_names + [\"predicted\", \"move\", \"week\"]],\n",
" pred_lr[index_column_names + [\"prediction\"]],\n",
" on=index_column_names,\n",
" how=\"left\",\n",
")\n",
"pred_final[\"combined_prediction\"] = pred_final[\"predicted\"] * 0.5 + pred_final[\"prediction\"] * 0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mape_entire = MAPE(pred_final[\"combined_prediction\"], pred_final[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by the combined model on entire test period: \" + str(mape_entire))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_final_sub = pred_final.loc[pred_final.week >= max(test_df.week) - NUM_TEST_PERIODS + GAP]\n",
"mape_final_sub = MAPE(pred_final_sub[\"combined_prediction\"], pred_final_sub[\"move\"]) * 100\n",
"print(\"MAPE of forecasts obtained by the combined model in the last two weeks: \" + str(mape_final_sub))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Reading\n",
"\n",
"\\[1\\] Nicolo Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems. 3348-3357.<br>\n",
"\\[2\\] Azure AutoML Package Docs: https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl?view=azure-ml-py <br>\n",
"\\[3\\] Azure Automated Machine Learning Examples: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning <br>\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "forecasting_env",
"language": "python",
"name": "forecasting_env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Просмотреть файл

@ -6,7 +6,7 @@
"source": [
"<i>Copyright (c) Microsoft Corporation.</i>\n",
"\n",
"<i>Licensed under the MIT License.</i> "
"<i>Licensed under the MIT License.</i>"
]
},
{
@ -104,11 +104,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"parameters"
]
},
"metadata": {},
"outputs": [],
"source": [
"# Use False if you've already downloaded and split the data\n",
@ -591,10 +587,6 @@
}
],
"metadata": {
"author_info": {
"affiliation": "Microsoft",
"created_by": "Chenhui Hu"
},
"kernelspec": {
"display_name": "forecasting_env",
"language": "python",

Просмотреть файл

@ -160,7 +160,7 @@
"For demonstration, this is what the time series split on the Orange Juice dataset looks like, for the parameters listed above.\n",
"For `HORIZON = 2` and `GAP = 2`, assuming the current week is week `153`, our goal is to forecast the sales in week `155` and `156` using the training data. As you can see, the first forecasting week is `two` weeks away from the current week, as we want to leave time for planning inventory in practice.\n",
"\n",
"![Single split](../../../../assets/time_series_split_singleround.jpg)\n",
"![Single split](../../assets/time_series_split_singleround.jpg)\n",
"\n",
"We also refer to splits as rounds, so for `N_SPLITS = 1`, we have single-round forecasting, and for `N_SPLITS > 1`, we have multi-round forecasting."
]
@ -1120,7 +1120,7 @@
"\n",
"For demonstration, this is what the time series splits would look like for `N_SPLITS = 5`, and using other settings as above:\n",
"\n",
"![Multi split](../../../../assets/time_series_split_multiround.jpg)\n"
"![Multi split](../../assets/time_series_split_multiround.jpg)\n"
]
},
{

Просмотреть файл

@ -38,7 +38,8 @@
"metadata": {},
"outputs": [],
"source": [
"%load_ext tensorboard"
"%load_ext tensorboard\n",
"%load_ext blackcellmagic"
]
},
{
@ -104,15 +105,11 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"parameters"
]
},
"metadata": {},
"outputs": [],
"source": [
"# Use False if you've already downloaded and split the data\n",
"DOWNLOAD_SPLIT_DATA = True\n",
"DOWNLOAD_SPLIT_DATA = False # True\n",
"\n",
"# Data directories\n",
"DATA_DIR = os.path.join(git_repo_path(), \"ojdata\")\n",
@ -244,7 +241,7 @@
" data_filled = pd.merge(data_grid, train_df, how=\"left\", on=[\"store\", \"brand\", \"week\"])\n",
"\n",
" # Get future price, deal, and advertisement info\n",
" aux_df = pd.read_csv(os.path.join(TRAIN_DIR, \"auxi_\" + str(pred_round) + \".csv\"))\n",
" aux_df = pd.read_csv(os.path.join(TRAIN_DIR, \"aux_\" + str(pred_round) + \".csv\"))\n",
" data_filled = pd.merge(data_filled, aux_df, how=\"left\", on=[\"store\", \"brand\", \"week\"])\n",
"\n",
" # Create relative price feature\n",
@ -941,10 +938,6 @@
}
],
"metadata": {
"author_info": {
"affiliation": "Microsoft",
"created_by": "Chenhui Hu"
},
"kernelspec": {
"display_name": "forecasting_env",
"language": "python",

Просмотреть файл

@ -103,11 +103,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"parameters"
]
},
"metadata": {},
"outputs": [],
"source": [
"# Use False if you've already downloaded and split the data\n",
@ -248,7 +244,7 @@
" data_filled = pd.merge(data_grid, train_df, how=\"left\", on=[\"store\", \"brand\", \"week\"])\n",
"\n",
" # Get future price, deal, and advertisement info\n",
" aux_df = pd.read_csv(os.path.join(train_dir, \"auxi_\" + str(pred_round) + \".csv\"))\n",
" aux_df = pd.read_csv(os.path.join(train_dir, \"aux_\" + str(pred_round) + \".csv\"))\n",
" data_filled = pd.merge(data_filled, aux_df, how=\"left\", on=[\"store\", \"brand\", \"week\"])\n",
"\n",
" # Create relative price feature\n",
@ -4133,10 +4129,6 @@
}
],
"metadata": {
"author_info": {
"affiliation": "Microsoft",
"created_by": "Chenhui Hu"
},
"kernelspec": {
"display_name": "forecasting_env",
"language": "python",
@ -4152,7 +4144,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
"version": "3.5.6-final"
}
},
"nbformat": 4,

Просмотреть файл

@ -1,16 +1,16 @@
# Forecasting examples
This folder contains Python and R examples for building forecasting solutions presented in Python Jupyter notebooks and R Markdown files, respectively. The examples are organized according to forecasting scenarios in different use cases with each subdirectory under `examples/` named after the specific use case.
At the moment, the repository contains a single retail sales forecasting scenario utilizing [Dominick's OrangeJuice data set](https://www.chicagobooth.edu/research/kilts/datasets/dominicks). The name of the directory is `grocery_sales`.
This folder contains Python examples for building forecasting solutions. To run the notebooks, please execute `jupyter notebook` and select the Jupyter kernel `forecasting_env` if you are using a local machine. Otherwise, if you use a remote VM, you can start the notebooks via `jupyter notebook --no-browser` and forward the port where the notebooks are running (e.g., 8888) to the local machine via `ssh <user-name>@<ip-address-of-the-vm> -L 8888:localhost:8888`.
## Summary
The following table summarizes each forecasting scenario contained in the repository, and links available content within that scenario.
| Directory | Content | Description |
|----------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| [grocery_sales](./grocery_sales) | [python/](./grocery_sales/python) <br> [R/](./grocery_sales/R) | Python and R examples for forecasting sales of orange juice in [Dominick's dataset](https://www.chicagobooth.edu/research/kilts/datasets/dominicks). |
The following summarizes each directory of the best practice notebooks.
| Directory | Content | Description |
| --- | --- | --- |
| [00_quick_start](./00_quick_start)| [auto_arima_forecasting.ipynb](./00_quick_start/auto_arima_forecasting.ipynb) <br>[azure_automl_forecast.ipynb](./00_quick_start/azure_automl_forecast.ipynb) <br> [lightgbm_point_forecast.ipynb](./00_quick_start/lightgbm_point_forecast.ipynb) | Quick start notebooks that demonstrate workflow of developing a forecasting model using one-round training and testing data|
| [01_prepare_data](./01_prepare_data) | [ojdata_exploration_retail.ipynb](./01_prepare_data/ojdata_exploration_retail.ipynb) <br> [ojdata_preparation_retail.ipynb](./01_prepare_data/ojdata_preparation_retail.ipynb) | Data exploration and preparation notebooks|
| [02_model](./02_model) | [dilatedcnn_point_forecast_multiround.ipynb](./02_model/dilatedcnn_point_forecast_multiround.ipynb) <br> [lightgbm_point_forecast_multiround.ipynb](./02_model/lightgbm_point_forecast_multiround.ipynb) | Deep dive notebooks that perform multi-round training and testing of various classical and deep learning forecast algorithms|
| [03_model_select_deploy](03_model_select_deploy) | Example notebook to be added soon | Best practice notebook for model selecting by using Azure Machine Learning Service and deploying the best model on Azure|

Просмотреть файл

@ -1,96 +0,0 @@
---
title: Data preparation
output: html_notebook
---
_Copyright (c) Microsoft Corporation._<br/>
_Licensed under the MIT License._
In this notebook, we generate the datasets that will be used for model training and validating.
The orange juice dataset comes from the bayesm package, and gives pricing and sales figures over time for a variety of orange juice brands in several stores in Florida. Rather than installing the entire package (which is very complex), we download the dataset itself from the GitHub mirror of the CRAN repository.
```{r, results="hide", message=FALSE}
# download the data from the GitHub mirror of the bayesm package source
ojfile <- tempfile(fileext=".rda")
download.file("https://github.com/cran/bayesm/raw/master/data/orangeJuice.rda", ojfile)
load(ojfile)
file.remove(ojfile)
```
The dataset generation parameters are obtained from the file `ojdata_forecast_settings.yaml`; you can modify that file to vary the experimental setup. The settings are
| Parameter | Description | Default |
|-----------|-------------|---------|
| `N_SPLITS` | The number of splits to make. | 10 |
| `HORIZON` | The forecast horizon for the test dataset for each split. | 2 |
| `GAP` | The gap in weeks from the end of the training period to the start of the testing period; see below. | 2 |
| `FIRST_WEEK` | The first week of data to use. | 40 |
| `LAST_WEEK` | The last week of data to use. | 156 |
| `START_DATE` | The actual calendar date for the start of the first week in the data. | `1989-09-14` |
A complicating factor is that the data does not include every possible combination of store, brand and date, so we have to pad out the missing rows with `complete`. In addition, one store/brand combination has no data beyond week 156; we therefore end the analysis at this week. We also do _not_ fill in the missing values in the data, as many of the modelling functions in the fable package can handle this innately.
```{r, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
settings <- yaml::read_yaml(here::here("examples/grocery_sales/R/forecast_settings.yaml"))
start_date <- as.Date(settings$START_DATE)
train_periods <- seq(to=settings$LAST_WEEK - settings$HORIZON - settings$GAP + 1,
by=settings$HORIZON,
length.out=settings$N_SPLITS)
oj_data <- orangeJuice$yx %>%
complete(store, brand, week) %>%
mutate(week=yearweek(start_date + week*7)) %>%
as_tsibble(index=week, key=c(store, brand))
```
Here are some glimpses of what the data looks like. The dependent variable is `logmove`, the logarithm of the total sales for a given brand and store, in a particular week.
```{r}
head(oj_data)
```
The time series plots for a small subset of brands and stores are shown below. We can make the following observations:
- There appears to be little seasonal variation in sales (probably because Florida is a state without very different seasons). In any case, with less than 2 years of observations, the time series is not long enough for many model-fitting functions in the fable package to automatically estimate seasonal parameters.
- While some store/brand combinations show weak trends over time, this is far from universal.
- Different brands can exhibit very different behaviour, especially in terms of variation about the mean.
- Many of the time series have missing values, indicating that the dataset is incomplete.
```{r, fig.height=10}
library(ggplot2)
oj_data %>%
filter(store < 25, brand < 5) %>%
ggplot(aes(x=week, y=logmove)) +
geom_line() +
scale_x_date(labels=NULL) +
facet_grid(vars(store), vars(brand), labeller="label_both")
```
Finally, we split the dataset into separate samples for training and testing. The schema used is broadly time series cross-validation, whereby we train a model on data up to time $t$, test it on data for times $t+1$ to $t+k$, then train on data up to time $t+k$, test it on data for times $t+k+1$ to $t+2k$, and so on. In this specific case study, however, we introduce a small extra piece of complexity based on discussions with domain experts. We train a model on data up to week $t$, then test it on week $t+2$ to $t+3$. Then we train on data up to week $t+2$, and test it on weeks $t+4$ to $t+5$, and so on. There is thus always a gap of one week between the training and test samples. The reason for this is because in reality, inventory planning always takes some time; the gap allows store managers to prepare the stock based on the forecasted demand.
```{r}
subset_oj_data <- function(start, end)
{
start <- yearweek(start_date + start*7)
end <- yearweek(start_date + end*7)
filter(oj_data, week >= start, week <= end)
}
oj_train <- lapply(train_periods, function(i) subset_oj_data(settings$FIRST_WEEK, i))
oj_test <- lapply(train_periods, function(i) subset_oj_data(i + settings$GAP, i + settings$GAP + settings$HORIZON - 1))
save(oj_train, oj_test, file=here::here("examples/grocery_sales/R/data.Rdata"))
head(oj_train[[1]])
head(oj_test[[1]])
```

Просмотреть файл

@ -1,87 +0,0 @@
---
title: Basic models
output: html_notebook
---
_Copyright (c) Microsoft Corporation._<br/>
_Licensed under the MIT License._
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
We fit some simple models to the orange juice data for illustrative purposes. Here, each model is actually a _group_ of models, one for each combination of store and brand. This is the standard approach taken in statistical forecasting, and is supported out-of-the-box by the tidyverts framework.
- `mean`: This is just a simple mean.
- `naive`: A random walk model without any other components. This amounts to setting all forecast values to the last observed value.
- `drift`: This adjusts the `naive` model to incorporate a straight-line trend.
- `arima`: An ARIMA model with the parameter values estimated from the data.
Note that the model training process is embarrassingly parallel on 3 levels:
- We have multiple independent training datasets;
- For which we fit multiple independent models;
- Within which we have independent sub-models for each store and brand.
This lets us speed up the training significantly. While the `fable::model` function can fit multiple models in parallel, we will run it sequentially here and instead parallelise by dataset. This avoids contention for cores, and also results in the simplest code. As a guard against returning invalid results, we also specify the argument `.safely=FALSE`; this forces `model` to throw an error if a model algorithm fails.
```{r}
srcdir <- here::here("R_utils")
for(src in dir(srcdir, full.names=TRUE)) source(src)
load_objects("grocery_sales", "data.Rdata")
cl <- make_cluster(libs=c("tidyr", "dplyr", "fable", "tsibble", "feasts"))
oj_modelset_basic <- parallel::parLapply(cl, oj_train, function(df)
{
model(df,
mean=MEAN(logmove),
naive=NAIVE(logmove),
drift=RW(logmove ~ drift()),
arima=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0)),
.safely=FALSE
)
})
oj_fcast_basic <- parallel::clusterMap(cl, get_forecasts, oj_modelset_basic, oj_test)
save_objects(oj_modelset_basic, oj_fcast_basic,
example="grocery_sales", file="model_basic.Rdata")
do.call(rbind, oj_fcast_basic) %>%
mutate_at(-(1:3), exp) %>%
eval_forecasts()
```
The ARIMA model does the best of the simple models, but not any better than a simple mean.
Having fit some basic models, we can also try an exponential smoothing model, fit using the `ETS` function. Unlike the others, `ETS` does not currently support time series with missing values; we therefore have to use one of the other models to impute missing values first via the `interpolate` function.
```{r}
oj_modelset_ets <- parallel::clusterMap(cl, function(df, basicmod)
{
df %>%
interpolate(object=select(basicmod, -c(mean, naive, drift))) %>%
model(
ets=ETS(logmove ~ error("A") + trend("A") + season("N")),
.safely=FALSE
)
}, oj_train, oj_modelset_basic)
oj_fcast_ets <- parallel::clusterMap(cl, get_forecasts, oj_modelset_ets, oj_test)
destroy_cluster(cl)
save_objects(oj_modelset_ets, oj_fcast_ets,
example="grocery_sales", file="model_ets.Rdata")
do.call(rbind, oj_fcast_ets) %>%
mutate_at(-(1:3), exp) %>%
eval_forecasts()
```
The ETS model does _worse_ than the ARIMA model, something that should not be a surprise given the lack of strong seasonality and trend in this dataset. We conclude that any simple univariate approach is unlikely to do well.

Просмотреть файл

@ -1,86 +0,0 @@
---
title: ARIMA-Regression models
output: html_notebook
---
_Copyright (c) Microsoft Corporation._<br/>
_Licensed under the MIT License._
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
```
This notebook builds on the output from "Basic models" by including regressor variables in the ARIMA model(s). We fit the following model types:
- `ar_trend` includes only a linear trend over time.
- `ar_reg` allows stepwise selection of independent regressors.
- `ar_reg_price`: rather than allowing the algorithm to select from the 11 price variables, we use only the price relevant to each brand. This is to guard against possible overfitting, something that classical stepwise procedures are wont to do.
- `ar_reg_price_trend` is the same as `ar_reg_price`, but including a linear trend.
As part of the modelling, we also compute a new independent variable `maxpricediff`, the log-ratio of the price of this brand compared to the best competing price. A positive `maxpricediff` means this brand is cheaper than all the other brands, and a negative `maxpricediff` means it is more expensive.
```{r}
srcdir <- here::here("R_utils")
for(src in dir(srcdir, full.names=TRUE)) source(src)
load_objects("grocery_sales", "data.Rdata")
cl <- make_cluster(libs=c("tidyr", "dplyr", "fable", "tsibble", "feasts"))
# add extra regression variables to training and test datasets
add_regvars <- function(df)
{
df %>%
group_by(store, brand) %>%
group_modify(~ {
pricevars <- grep("price", names(.x), value=TRUE)
thispricevar <- unique(paste0("price", .y$brand))
best_other_price <- do.call(pmin, .x[setdiff(pricevars, thispricevar)])
.x$price <- .x[[thispricevar]]
.x$maxpricediff <- log(best_other_price/.x$price)
.x
}) %>%
ungroup() %>%
mutate(week=yearweek(week)) %>% # need to recreate this variable because of tsibble/vctrs issues
as_tsibble(week, key=c(store, brand))
}
oj_trainreg <- parallel::parLapply(cl, oj_train, add_regvars)
oj_testreg <- parallel::parLapply(cl, oj_test, add_regvars)
save_objects(oj_trainreg, oj_testreg,
example="grocery_sales", file="data_reg.Rdata")
oj_modelset_reg <- parallel::parLapply(cl, oj_trainreg, function(df)
{
model(df,
ar_trend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend()),
ar_reg=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + deal + feat + maxpricediff +
price1 + price2 + price3 + price4 + price5 + price6 + price7 + price8 + price9 + price10 + price11),
ar_reg_price=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + deal + feat + maxpricediff + price),
ar_reg_price_trend=ARIMA(logmove ~ pdq() + PDQ(0, 0, 0) + trend() + deal + feat + maxpricediff + price),
.safely=FALSE
)
})
oj_fcast_reg <- parallel::clusterMap(cl, get_forecasts, oj_modelset_reg, oj_testreg)
destroy_cluster(cl)
save_objects(oj_modelset_reg, oj_fcast_reg,
example="grocery_sales", file="model_reg.Rdata")
do.call(rbind, oj_fcast_reg) %>%
mutate_at(-(1:3), exp) %>%
eval_forecasts()
```
This shows that the models incorporating price are a significant improvement over the previous naive models. The model that uses stepwise selection to choose the best price variable does worse than the one where we choose the price beforehand, confirming the suspicion that stepwise leads to overfitting in this case.

Просмотреть файл

@ -1,67 +0,0 @@
---
title: Prophet models
output: html_notebook
---
_Copyright (c) Microsoft Corporation._<br/>
_Licensed under the MIT License._
```{r, echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tsibble)
library(feasts)
library(fable)
library(prophet)
library(fable.prophet)
```
This notebook builds a forecasting model using the [Prophet](https://facebook.github.io/prophet/) algorithm. Prophet is a time series model developed by Facebook that is designed to be simple for non-experts to use, yet flexible and powerful.
> Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Here, we will use the fable.prophet package which provides a tidyverts frontend to the prophet package itself. As with ETS, prophet does not support time series with missing values, so we again impute them using the ARIMA model forecasts.
```{r}
srcdir <- here::here("R_utils")
for(src in dir(srcdir, full.names=TRUE)) source(src)
load_objects("grocery_sales", "data_reg.Rdata")
load_objects("grocery_sales", "model_basic.Rdata")
cl <- make_cluster(libs=c("tidyr", "dplyr", "fable", "tsibble", "feasts", "prophet", "fable.prophet"))
oj_modelset_pr <- parallel::clusterMap(cl, function(df, basicmod)
{
df$logmove <- interpolate(select(basicmod, -c(mean, naive, drift)), df)$logmove
df %>%
group_by(store, brand) %>%
fill(deal:maxpricediff, .direction="downup") %>%
model(
pr=prophet(logmove ~ deal + feat + price + maxpricediff),
pr_tune=prophet(logmove ~ deal + feat + price + maxpricediff +
growth(n_changepoints=2) + season(period=52, order=5, prior_scale=2)),
.safely=FALSE
)
}, oj_trainreg, oj_modelset_basic)
oj_fcast_pr <- parallel::clusterMap(cl, function(mable, newdata, fcast_func)
{
newdata <- newdata %>%
fill(deal:maxpricediff, .direction="downup")
fcast_func(mable, newdata)
}, oj_modelset_pr, oj_testreg, MoreArgs=list(fcast_func=get_forecasts))
destroy_cluster(cl)
save_objects(oj_modelset_pr, oj_fcast_pr,
example="grocery_sales", file="model_pr.Rdata")
do.call(rbind, oj_fcast_pr) %>%
mutate_at(-(1:3), exp) %>%
eval_forecasts()
```
It appears that Prophet does _not_ do better than the simple ARIMA model with regression variables. This is possibly because the dataset does not have a strong time series nature: there is no seasonality, and only weak or nonexistent trends. These are features which the Prophet algorithm is designed to detect, and their absence means that there would be little advantage in using it.

Просмотреть файл

@ -1,45 +0,0 @@
# Forecasting examples in R: orange juice retail sales
The Rmarkdown notebooks in this directory are as follows. Each notebook also has a corresponding HTML file, which is the rendered output from running the code.
- [`01_dataprep.Rmd`](01_dataprep.Rmd) creates the training and test datasets
- [`02_basic_models.Rmd`](02_basic_models.Rmd) fits a range of simple time series models to the data, including ARIMA and ETS.
- [`02a_reg_models.Rmd`](02a_reg_models.Rmd) adds independent variables as regressors to the ARIMA model.
- [`02b_prophet_models.Rmd`](02b_prophet_models.Rmd) fits some simple models using the Prophet algorithm.
If you want to run the code in the notebooks interactively, you must start from `01_dataprep.Rmd` and proceed in sequence, as the earlier notebooks will generate artifacts (datasets/model objects) that are used by later ones.
## Package installation
The following packages are needed to run the basic analysis notebooks in this directory:
- rmarkdown
- dplyr
- tidyr
- ggplot2
- tsibble
- fable
- feasts
- yaml
- here
It's likely that you will already have many of these (particularly the [Tidyverse](https://tidyverse.org) packages) installed, if you use R for data science tasks. The main exceptions are the packages in the [Tidyverts](https://tidyverts.org) family, which is a modern framework for time series analysis building on the Tidyverse.
```r
install.packages("tidyverse") # installs all tidyverse packages
install.packages("rmarkdown")
install.packages("here")
install.packages(c("tsibble", "fable", "feasts"))
```
The following packages are needed to run the Prophet analysis notebook:
- prophet
- fable.prophet
While prophet is available from CRAN, its frontend for the tidyverts framework, fable.prophet, is currently on GitHub only. You can install these packages with
```r
install.packages("prophet")
install.packages("https://github.com/mitchelloharawild/fable.prophet/archive/master.tar.gz", repos=NULL)
```

Просмотреть файл

@ -1,6 +0,0 @@
N_SPLITS: 10
HORIZON: 2
GAP: 2
FIRST_WEEK: 40
LAST_WEEK: 156
START_DATE: "1989-09-14"

Просмотреть файл

@ -1,26 +0,0 @@
# Forecasting examples
This folder contains Python and R examples for building forecasting solutions on the Orange Juice dataset which is part of the [Dominick's dataset](https://www.chicagobooth.edu/research/kilts/datasets/dominicks). The examples are presented in Python Jupyter notebooks and R Markdown files, respectively.
## Orange Juice Dataset
In this scenario, we will use the Orange Juice (OJ) dataset to forecast its sales. The OJ dataset is from R package [bayesm](https://cran.r-project.org/web/packages/bayesm/index.html) and is part of the [Dominick's dataset](https://www.chicagobooth.edu/research/kilts/datasets/dominicks).
This dataset contains the following two tables:
- **yx.cs.** - Weekly sales of refrigerated orange juice at 83 stores. This table has 106139 rows and 19 columns. It includes weekly sales and prices of 11 orange juice brands as well as information about profit, deal, and advertisement for each brand. Note that the weekly sales is captured by a column named `logmove` which corresponds to the natural logarithm of the number of units sold. To get the number of units sold, you need to apply an exponential transform to this column.
- **storedemo.csv** - Demographic information on those stores. This table has 83 rows and 13 columns. For every store, the table describes demographic information of its consumers, distance to the nearest warehouse store, average distance to the nearest 5 supermarkets, ratio of its sales to the nearest warehouse store, and ratio of its sales to the average of the nearest 5 stores.
Note that the week number starts from 40 in this dataset, while the full Dominick's dataset has data starting from week 1 to week 400. According to [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx), week 1 starts on 09/14/1989. Please see pages 40 and 41 of the [bayesm reference manual](https://cran.r-project.org/web/packages/bayesm/bayesm.pdf) and the [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx) for more details about the data.
## Summary
The following summarizes each directory of the forecasting examples.
| Directory | Content | Description |
| --- | --- | --- |
| [python](./python)| [00_quick_start/](./python/00_quick_start) <br>[01_prepare_data/](./python/01_prepare_data) <br> [02_model/](./python/02_model) <br> [03_model_tune_deploy/](./python/03_model_tune_deploy/) | <ul> <li> Quick start examples for single-round training </li> <li> Data exploration and preparation notebooks </li> <li> Multi-round training examples </li> <li> Model tuning and deployment example </li> </ul> |
| [R](./R) | [01_dataprep.Rmd](R/01_dataprep.Rmd) <br> [02_basic_models.Rmd](R/02_basic_models.Rmd) <br> [02a_reg_models.Rmd](R/02a_reg_models.Rmd) <br> [02b_prophet_models.Rmd](R/02b_prophet_models.Rmd) | <ul> <li>Data preparation</li> <li>Basic time series models</li> <li>ARIMA-regression models</li> <li>Prophet models</li> </ul> |

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,275 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""
Perform cross validation of a LightGBM forecasting model on the training data of the 1st forecast round.
"""
import os
import math
import argparse
import datetime
import numpy as np
import pandas as pd
import lightgbm as lgb
from azureml.core import Run
from sklearn.model_selection import train_test_split
from fclib.feature_engineering.feature_utils import week_of_month, df_from_cartesian_product, combine_features
FIRST_WEEK = 40
GAP = 2
HORIZON = 2
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
def create_features(pred_round, train_dir, lags, window_size, used_columns):
"""Create input features for model training and testing.
Args:
pred_round (int): Prediction round (1, 2, ...)
train_dir (str): Path of the training data directory
lags (np.array): Numpy array including all the lags
window_size (int): Maximum step for computing the moving average
used_columns (list[str]): A list of names of columns used in model training (including target variable)
Returns:
pd.Dataframe: Dataframe including all the input features and target variable
int: Last week of the training data
"""
# Load training data
default_train_file = os.path.join(train_dir, "train.csv")
if os.path.isfile(default_train_file):
train_df = pd.read_csv(default_train_file)
else:
train_df = pd.read_csv(os.path.join(train_dir, "train_" + str(pred_round) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
train_end_week = train_df["week"].max()
week_list = range(FIRST_WEEK, train_end_week + GAP + HORIZON)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
default_aux_file = os.path.join(train_dir, "auxi.csv")
if os.path.isfile(default_aux_file):
aux_df = pd.read_csv(default_aux_file)
else:
aux_df = pd.read_csv(os.path.join(train_dir, "auxi_" + str(pred_round) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, window_size, used_columns)
)
# Drop rows with NaN values
features.dropna(inplace=True)
return features, train_end_week
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data-folder", type=str, dest="data_folder", default=".", help="data folder mounting point")
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=64, help="# of leaves of the tree")
parser.add_argument(
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=50, help="minimum # of samples in each leaf"
)
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.001, help="learning rate")
parser.add_argument(
"--feature-fraction",
type=float,
dest="feature_fraction",
default=1.0,
help="ratio of features used in each iteration",
)
parser.add_argument(
"--bagging-fraction",
type=float,
dest="bagging_fraction",
default=1.0,
help="ratio of samples used in each iteration",
)
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=1, help="bagging frequency")
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=400, help="# of boosting iterations")
parser.add_argument("--max-lag", type=int, dest="max_lag", default=10, help="max lag of unit sales")
parser.add_argument(
"--window-size", type=int, dest="window_size", default=10, help="window size of moving average of unit sales"
)
args = parser.parse_args()
args.feature_fraction = round(args.feature_fraction, 2)
args.bagging_fraction = round(args.bagging_fraction, 2)
print(args)
# Start an Azure ML run
run = Run.get_context()
# Data paths
DATA_DIR = args.data_folder
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Data and forecast problem parameters
TRAIN_START_WEEK = 40
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
TEST_START_WEEK_LIST = list(range(137, 161, 2))
TEST_END_WEEK_LIST = list(range(138, 162, 2))
# The start datetime of the first week in the dataset
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Parameters of GBM model
params = {
"objective": "mape",
"num_leaves": args.num_leaves,
"min_data_in_leaf": args.min_data_in_leaf,
"learning_rate": args.learning_rate,
"feature_fraction": args.feature_fraction,
"bagging_fraction": args.bagging_fraction,
"bagging_freq": args.bagging_freq,
"num_rounds": args.max_rounds,
"early_stopping_rounds": 125,
"num_threads": 16,
}
# Lags and used column names
lags = np.arange(2, args.max_lag + 1)
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
categ_fea = ["store", "brand", "deal"]
# Train and validate the model using only the first round data
r = 0
print("---- Round " + str(r + 1) + " ----")
# Load training data
default_train_file = os.path.join(TRAIN_DIR, "train.csv")
if os.path.isfile(default_train_file):
train_df = pd.read_csv(default_train_file)
else:
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_" + str(r + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
default_aux_file = os.path.join(TRAIN_DIR, "auxi.csv")
if os.path.isfile(default_aux_file):
aux_df = pd.read_csv(default_aux_file)
else:
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "auxi_" + str(r + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, args.window_size, used_columns)
)
train_fea = features[features.week <= TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
# Drop rows with NaN values
train_fea.dropna(inplace=True)
# Model training and validation
# Create a training/validation split
train_fea, valid_fea, train_label, valid_label = train_test_split(
train_fea.drop("move", axis=1, inplace=False), train_fea["move"], test_size=0.05, random_state=1
)
dtrain = lgb.Dataset(train_fea, train_label)
dvalid = lgb.Dataset(valid_fea, valid_label)
# A dictionary to record training results
evals_result = {}
# Train LightGBM model
bst = lgb.train(
params, dtrain, valid_sets=[dtrain, dvalid], categorical_feature=categ_fea, evals_result=evals_result
)
# Get final training loss & validation loss
train_loss = evals_result["training"]["mape"][-1]
valid_loss = evals_result["valid_1"]["mape"][-1]
print("Final training loss is {}".format(train_loss))
print("Final validation loss is {}".format(valid_loss))
# Log the validation loss (MAPE)
run.log("MAPE", np.float(valid_loss) * 100)
# Files saved in the "./outputs" folder are automatically uploaded into run history
os.makedirs("./outputs/model", exist_ok=True)
bst.save_model("./outputs/model/bst-model.txt")

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,16 +0,0 @@
# Forecasting examples in Python
This folder contains Jupyter notebooks with Python examples for building forecasting solutions. To run the notebooks, please ensure your environment is set up with required dependencies by following instructions in the [Setup guide](../../../docs/SETUP.md).
## Summary
The following summarizes each directory of the Python best practice notebooks.
| Directory | Content | Description |
| --- | --- | --- |
| [00_quick_start](./00_quick_start)| [autoarima_single_round.ipynb](./00_quick_start/autoarima_single_round.ipynb) <br>[azure_automl_single_round.ipynb](./00_quick_start/azure_automl_single_round.ipynb) <br> [lightgbm_single_round.ipynb](./00_quick_start/lightgbm_single_round.ipynb) | Quick start notebooks that demonstrate workflow of developing a forecasting model using one-round training and testing data|
| [01_prepare_data](./01_prepare_data) | [ojdata_exploration.ipynb](./01_prepare_data/ojdata_exploration.ipynb) <br> [ojdata_preparation.ipynb](./01_prepare_data/ojdata_preparation.ipynb) | Data exploration and preparation notebooks|
| [02_model](./02_model) | [dilatedcnn_multi_round.ipynb](./02_model/dilatedcnn_multi_round.ipynb) <br> [lightgbm_multi_round.ipynb](./02_model/lightgbm_multi_round.ipynb) <br> [autoarima_multi_round.ipynb](./02_model/autoarima_multi_round.ipynb) | Deep dive notebooks that perform multi-round training and testing of various classical and deep learning forecast algorithms|
| [03_model_tune_deploy](./03_model_tune_deploy/) | [azure_hyperdrive_lightgbm.ipynb](./03_model_tune_deploy/azure_hyperdrive_lightgbm.ipynb) <br> [aml_scripts/](./03_model_tune_deploy/aml_scripts) | <ul><li> Example notebook for model tuning using Azure Machine Learning Service and deploying the best model on Azure </ul></li> <ul><li> Scripts for model training and validation </ul></li> |

Просмотреть файл

@ -1,40 +1,11 @@
# Forecasting library
Building forecasting models can involve tedious tasks ranging from data loading, dataset understanding, model development, model evaluation to deployment of trained models. To assist with these tasks, we developed a forecasting library - **fclib**. You'll see this library used widely in sample notebooks in [examples](../examples). The following provides a short description of the sub-modules. For more details about what functions/classes/utitilies are available and how to use them, please review the doc-strings provided with the code and see the sample notebooks in [examples](../examples) directory.
A set of utility functions for forecasting.
## Submodules
## Install
### [AzureML](fclib/azureml)
The AzureML submodule contains utilities to connect to an Azure Machine Learning workspace, train, tune and operationalize forecasting models at scale using AzureML.
### [Common](fclib/common)
This submodule contains high-level utilities that are commonly used in multiple algorithms as well as helper functions for visualizing forecasting predictions.
### [Dataset](fclib/dataset)
This submodule includes helper functions for interacting with datasets used in the example notebooks, utility functions to process datasets for different models tasks, as well as utilities for splitting data for training/testing. For example, the [ojdata](fclib/dataset/ojdata.py) submodule will allow you to download and process Orange Juice data set, as well as split it into training and testing rounds.
```python
from fclib.dataset.ojdata import download_ojdata, split_train_test
download_ojdata(DATA_DIR)
train_df_list, test_df_list, _ = split_train_test(
DATA_DIR,
n_splits=N_SPLITS,
horizon=HORIZON,
gap=GAP,
first_week=FIRST_WEEK,
last_week=LAST_WEEK
)
```bash
pip install -e .
```
### [Evaluation](fclib/evaluation)
Evaluation module includes functionalities for computing common forecasting evaluation metrics, more specifically `MAPE`, `sMAPE`, and `pinball loss`.
### [Feature Engineering](fclib/feature_engineering)
Feature engineering module contains utilities to create various time series features, for example, week or day of month, lagged features, and moving average features. This module is used widely in machine-learning based approaches to forecasting, in which time series data is transformed into a tabular featurized dataset, that becomes input to a machine learning method.
### [Models](fclib/models)
The models module contains implementations of various algorithms that can be used in addition to external packages to evaluate and develop new forecasting solutions. Some submodules found here are: `lightgbm`, `dilated cnn`, etc. A more detailed description of which algorithms are used in our examples can be found in [this README](../examples/oj_retail/python/README.md).
This will install the package fclib.

Просмотреть файл

@ -2,149 +2,6 @@
# Licensed under the MIT License.
"""
This file contains utility functions for interacting with Azure ML Resources.
Reused code from
https://github.com/microsoft/nlp-recipes/blob/master/utils_nlp/azureml/azureml_utils.py
This file contains utility functions for using AzureML SDK in the
development of forecasting solutions.
"""
import os
from azureml.core.authentication import AzureCliAuthentication
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.authentication import AuthenticationException
from azureml.core import Workspace
from azureml.exceptions import ProjectSystemException
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
def get_auth():
"""
Method to get the correct Azure ML Authentication type
Always start with CLI Authentication and if it fails, fall back
to interactive login
"""
try:
auth_type = AzureCliAuthentication()
auth_type.get_authentication_header()
except AuthenticationException:
auth_type = InteractiveLoginAuthentication()
return auth_type
def get_or_create_workspace(
config_path="./.azureml", subscription_id=None, resource_group=None, workspace_name=None, workspace_region=None,
):
"""
Method to get or create workspace.
Args:
config_path: optional directory to look for / store config.json file (defaults to current
directory)
subscription_id: Azure subscription id
resource_group: Azure resource group to create workspace and related resources
workspace_name: name of azure ml workspace
workspace_region: region for workspace
Returns:
obj: AzureML workspace if one exists already with the name otherwise creates a new one.
"""
config_file_path = "."
if config_path is not None:
config_dir, config_file_name = os.path.split(config_path)
if config_file_name != "config.json":
config_file_path = os.path.join(config_path, "config.json")
try:
# Get existing azure ml workspace
if os.path.isfile(config_file_path):
ws = Workspace.from_config(config_file_path, auth=get_auth())
else:
ws = Workspace.get(
name=workspace_name, subscription_id=subscription_id, resource_group=resource_group, auth=get_auth(),
)
except ProjectSystemException:
# This call might take a minute or two.
print("Creating new workspace")
ws = Workspace.create(
name=workspace_name,
subscription_id=subscription_id,
resource_group=resource_group,
create_resource_group=True,
location=workspace_region,
auth=get_auth(),
)
ws.write_config(path=config_path)
return ws
def get_or_create_amlcompute(
workspace, compute_name, vm_size="", min_nodes=0, max_nodes=None, idle_seconds_before_scaledown=None, verbose=False,
):
"""
Get or create AmlCompute as the compute target. If a cluster of the same name is found,
attach it and rescale accordingly. Otherwise, create a new cluster.
Args:
workspace (Workspace): workspace
compute_name (str): name
vm_size (str, optional): vm size
min_nodes (int, optional): minimum number of nodes in cluster
max_nodes (None, optional): maximum number of nodes in cluster
idle_seconds_before_scaledown (None, optional): how long to wait before the cluster
autoscales down
verbose (bool, optional): if true, print logs
Returns:
Compute target
"""
try:
if verbose:
print("Found compute target: {}".format(compute_name))
compute_target = ComputeTarget(workspace=workspace, name=compute_name)
if len(compute_target.list_nodes()) < max_nodes:
if verbose:
print("Rescaling to {} nodes".format(max_nodes))
compute_target.update(max_nodes=max_nodes)
compute_target.wait_for_completion(show_output=verbose)
except ComputeTargetException:
if verbose:
print("Creating new compute target: {}".format(compute_name))
compute_config = AmlCompute.provisioning_configuration(
vm_size=vm_size,
min_nodes=min_nodes,
max_nodes=max_nodes,
idle_seconds_before_scaledown=idle_seconds_before_scaledown,
)
compute_target = ComputeTarget.create(workspace, compute_name, compute_config)
compute_target.wait_for_completion(show_output=verbose)
return compute_target
def get_output_files(run, output_path, file_names=None):
"""
Method to get the output files from an AzureML output directory.
Args:
file_names(list): Names of the files to download.
run(azureml.core.run.Run): Run object of the run.
output_path(str): Path to download the output files.
Returns: None
"""
os.makedirs(output_path, exist_ok=True)
if file_names is None:
file_names = run.get_file_names()
for f in file_names:
dest = os.path.join(output_path, f.split("/")[-1])
print("Downloading file {} to {}...".format(f, dest))
run.download_file(f, dest)

Просмотреть файл

@ -1,25 +1,20 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# This script retrieves the orangeJuice dataset from the bayesm R package and saves the data as csv.
#
# Two arguments must be supplied to this script:
#
# RDA_PATH - path to the local .rda file containing the data
# DATA_DIR - destination directory for saving processed .csv files
# This script retrieves the orangeJuice dataset from the bayesm R package and saves the data as csv
args = commandArgs(trailingOnly=TRUE)
# Test if there are at least two arguments: if not, return an error
if (length(args)==2) {
RDA_PATH <- args[1]
DATA_DIR <- args[2]
} else {
stop("Two arguments must be supplied - path to .rda file and destination data directory).", call.=FALSE)
}
# test if there is at least one argument: if not, return an error
if (length(args)==0) {
stop("At least one argument must be supplied (data directory).", call.=FALSE)
} else if (length(args)==1) {
DATA_DIR <- args[1]
}
# Load the data from bayesm library
load(RDA_PATH)
library(bayesm)
data("orangeJuice")
yx <- orangeJuice[[1]]
storedemo <- orangeJuice[[2]]

Просмотреть файл

@ -8,16 +8,11 @@ import pandas as pd
import math
import datetime
import itertools
import argparse
import logging
import requests
from tqdm import tqdm
from fclib.common.utils import git_repo_path
from fclib.feature_engineering.feature_utils import df_from_cartesian_product
DATA_FILE_LIST = ["yx.csv", "storedemo.csv"]
SCRIPT_NAME = "load_oj_data.R"
SCRIPT_NAME = "download_oj_data.R"
DEFAULT_TARGET_COL = "move"
DEFAULT_STATIC_FEA = None
@ -26,54 +21,25 @@ DEFAULT_DYNAMIC_FEA = ["deal", "feat"]
# The start datetime of the first week in the record
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Original data source
OJ_URL = "https://github.com/cran/bayesm/raw/master/data/orangeJuice.rda"
def download_ojdata(dest_dir):
"""Downloads Orange Juice dataset.
log = logging.getLogger(__name__)
def maybe_download(url, dest_directory, filename=None):
"""Download a file if it is not already downloaded.
Args:
dest_directory (str): Destination directory.
url (str): URL of the file to download.
filename (str): File name.
dest_dir (str): Directory path for the downloaded file
"""
maybe_download(dest_dir=dest_dir)
def maybe_download(dest_dir):
"""Download a file if it is not already downloaded.
Args:
dest_dir (str): Destination directory
Returns:
str: File path of the file downloaded.
"""
if filename is None:
filename = url.split("/")[-1]
os.makedirs(dest_directory, exist_ok=True)
filepath = os.path.join(dest_directory, filename)
if not os.path.exists(filepath):
r = requests.get(url, stream=True)
total_size = int(r.headers.get("content-length", 0))
block_size = 1024
num_iterables = math.ceil(total_size / block_size)
with open(filepath, "wb") as file:
for data in tqdm(r.iter_content(block_size), total=num_iterables, unit="KB", unit_scale=True,):
file.write(data)
else:
log.debug("File {} already downloaded".format(filepath))
return filepath
def download_ojdata(dest_dir="."):
"""Download orange juice dataset from the original source.
Args:
dest_dir (str): Directory path for the downloaded file
Returns:
str: Path of the downloaded file.
"""
url = OJ_URL
rda_path = maybe_download(url, dest_directory=dest_dir)
# Check if data files exist
data_exists = True
for f in DATA_FILE_LIST:
@ -81,21 +47,13 @@ def download_ojdata(dest_dir="."):
data_exists = data_exists and os.path.exists(file_path)
if not data_exists:
# Call data loading script
repo_path = git_repo_path()
script_path = os.path.join(repo_path, "fclib", "fclib", "dataset", SCRIPT_NAME)
# Call data download script
print("Starting data download ...")
script_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), SCRIPT_NAME)
try:
print(f"Destination directory: {dest_dir}")
output = subprocess.run(
["Rscript", script_path, rda_path, dest_dir], stderr=subprocess.PIPE, stdout=subprocess.PIPE
)
print(output.stdout)
if output.returncode != 0:
raise Exception(f"Subprocess failed - {output.stderr}")
subprocess.call(["Rscript", script_path, dest_dir])
except subprocess.CalledProcessError as e:
raise e
print(e.output)
else:
print("Data already exists at the specified location.")
@ -155,12 +113,12 @@ def split_train_test(data_dir, n_splits=1, horizon=2, gap=2, first_week=40, last
Note that train_*.csv files in /train folder contain all the features in the training period
and aux_*.csv files in /train folder contain all the features except 'logmove', 'constant',
'profit' up until the forecast period end week. Both train_*.csv and auxi_*csv can be used for
'profit' up until the forecast period end week. Both train_*.csv and aux_*csv can be used for
generating forecasts in each split. However, test_*.csv files in /test folder can only be used
for model performance evaluation.
Example:
data_dir = "/home/ojdata"
data_dir = "/home/vapaunic/forecasting/ojdata"
train, test, aux = split_train_test(data_dir=data_dir, n_splits=5, horizon=3, write_csv=True)
@ -216,7 +174,7 @@ def split_train_test(data_dir, n_splits=1, horizon=2, gap=2, first_week=40, last
roundstr = "_" + str(i + 1) if n_splits > 1 else ""
train_df.to_csv(os.path.join(TRAIN_DATA_DIR, "train" + roundstr + ".csv"))
test_df.to_csv(os.path.join(TEST_DATA_DIR, "test" + roundstr + ".csv"))
aux_df.to_csv(os.path.join(TRAIN_DATA_DIR, "auxi" + roundstr + ".csv"))
aux_df.to_csv(os.path.join(TRAIN_DATA_DIR, "aux" + roundstr + ".csv"))
train_df_list.append(train_df)
test_df_list.append(test_df)
@ -478,12 +436,9 @@ def specify_retail_data_schema(
if __name__ == "__main__":
data_dir = "/home/vapaunic/forecasting/ojdata"
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", help="Data download directory")
args = parser.parse_args()
download_ojdata(args.data_dir)
download_ojdata(data_dir)
# train, test, aux = split_train_test(data_dir=data_dir, n_splits=1, horizon=2, write_csv=True)
# print((test[0].week))

Просмотреть файл

Просмотреть файл

@ -11,17 +11,10 @@ import calendar
import itertools
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from sklearn.preprocessing import MinMaxScaler
from dateutil.relativedelta import relativedelta
ALLOWED_TIME_COLUMN_TYPES = [
pd.Timestamp,
pd.DatetimeIndex,
datetime.datetime,
datetime.date,
]
from fclib.feature_engineering.utils import is_datetime_like
# 0: Monday, 2: T/W/TR, 4: F, 5:SA, 6: S
WEEK_DAY_TYPE_MAP = {1: 2, 3: 2} # Map for converting Wednesday and
@ -32,11 +25,6 @@ SEMI_HOLIDAY_CODE = 8 # days before and after a holiday
DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
def is_datetime_like(x):
"""Function that checks if a data frame column x is of a datetime type."""
return any(isinstance(x, col_type) for col_type in ALLOWED_TIME_COLUMN_TYPES)
def day_type(datetime_col, holiday_col=None, semi_holiday_offset=timedelta(days=1)):
"""
Convert datetime_col to 7 day types
@ -1014,81 +1002,3 @@ def normalize_columns(df, seq_cols, scaler=MinMaxScaler()):
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]), columns=seq_cols, index=df.index)
df_scaled = pd.concat([df[cols_fixed], df_scaled], axis=1)
return df_scaled, scaler
def get_datetime_col(df, datetime_colname):
"""
Helper function for extracting the datetime column as datetime type from
a data frame.
Args:
df: pandas DataFrame containing the column to convert
datetime_colname: name of the column to be converted
Returns:
pandas.Series: converted column
Raises:
Exception: if datetime_colname does not exist in the dateframe df.
Exception: if datetime_colname cannot be converted to datetime type.
"""
if datetime_colname in df.index.names:
datetime_col = df.index.get_level_values(datetime_colname)
elif datetime_colname in df.columns:
datetime_col = df[datetime_colname]
else:
raise Exception("Column or index {0} does not exist in the data " "frame".format(datetime_colname))
if not is_datetime_like(datetime_col):
datetime_col = pd.to_datetime(df[datetime_colname])
return datetime_col
def get_month_day_range(date):
"""
Returns the first date and last date of the month of the given date.
"""
# Replace the date in the original timestamp with day 1
first_day = date + relativedelta(day=1)
# Replace the date in the original timestamp with day 1
# Add a month to get to the first day of the next month
# Subtract one day to get the last day of the current month
last_day = date + relativedelta(day=1, months=1, days=-1, hours=23)
return first_day, last_day
def add_datetime(input_datetime, unit, add_count):
"""
Function to add a specified units of time (years, months, weeks, days,
hours, or minutes) to the input datetime.
Args:
input_datetime: datatime to be added to
unit: unit of time, valid values: 'year', 'month', 'week',
'day', 'hour', 'minute'.
add_count: number of units to add
Returns:
New datetime after adding the time difference to input datetime.
Raises:
Exception: if invalid unit is provided. Valid units are:
'year', 'month', 'week', 'day', 'hour', 'minute'.
"""
if unit == "Y":
new_datetime = input_datetime + relativedelta(years=add_count)
elif unit == "M":
new_datetime = input_datetime + relativedelta(months=add_count)
elif unit == "W":
new_datetime = input_datetime + relativedelta(weeks=add_count)
elif unit == "D":
new_datetime = input_datetime + relativedelta(days=add_count)
elif unit == "h":
new_datetime = input_datetime + relativedelta(hours=add_count)
elif unit == "m":
new_datetime = input_datetime + relativedelta(minutes=add_count)
else:
raise Exception(
"Invalid backtest step unit, {}, provided. Valid " "step units are Y, M, W, D, h, " "and m".format(unit)
)
return new_datetime

Просмотреть файл

@ -1,5 +1,4 @@
pandas
datetime
scikit_learn
numpy
requests
numpy

Просмотреть файл

@ -1,49 +0,0 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
# Pull request against these branches will trigger this build
pr:
- master
- staging
# no CI trigger
trigger: none
jobs:
- job: Component_governance
timeoutInMinutes: 20 # how long to run the job before automatically cancelling
pool:
vmImage: 'ubuntu-16.04'
steps:
- bash: |
python tools/generate_requirements_txt.py
displayName: 'Generate requirements.txt file from generate_conda_file.py'
- task: ComponentGovernanceComponentDetection@0
inputs:
scanType: 'Register'
verbosity: 'Verbose'
alertWarningLevel: 'High'
- task: notice@0
inputs:
outputformat: 'text'
- bash: |
ls -la
cat NOTICE.txt
git status
result=$(git status | grep NOTICE.txt)
if [[ $result ]]; then
echo "Notice file modified: $result"
echo `git diff NOTICE.txt`
BRANCH=NOTICE/`date +%s`
git checkout -b $BRANCH
git add NOTICE.txt
git commit -m "Notice file modified."
git push origin $BRANCH
else
echo "Notice file not modified."
fi
displayName: 'Check in notice file if modified.'

Просмотреть файл

@ -14,10 +14,10 @@ trigger:
jobs:
- job: cpu_integration_tests_linux
timeoutInMinutes: 60 # how long to run the job before automatically cancelling
timeoutInMinutes: 10 # how long to run the job before automatically cancelling
pool:
# vmImage: 'ubuntu-16.04' # hosted machine
name: $(Agent_Name)
name: ForecastingAgents
steps:
- bash: |

Просмотреть файл

@ -17,7 +17,7 @@ jobs:
timeoutInMinutes: 10 # how long to run the job before automatically cancelling
pool:
# vmImage: 'ubuntu-16.04' # hosted machine
name: $(Agent_Name)
name: ForecastingAgents
steps:
- bash: |

Просмотреть файл

@ -5,21 +5,10 @@ from fclib.common.utils import git_repo_path
@pytest.fixture(scope="module")
def notebooks():
"""Get paths of example notebooks.
Returns:
dict: Dictionary including paths of the example notebooks.
"""
repo_path = git_repo_path()
examples_path = os.path.join(repo_path, "examples")
usecase_path = os.path.join(examples_path, "grocery_sales", "python")
quick_start_path = os.path.join(usecase_path, "00_quick_start")
model_path = os.path.join(usecase_path, "02_model")
quick_start_path = os.path.join(examples_path, "00_quick_start")
# Path for the notebooks
paths = {
"lightgbm_quick_start": os.path.join(quick_start_path, "lightgbm_single_round.ipynb"),
"lightgbm_multi_round": os.path.join(model_path, "lightgbm_multi_round.ipynb"),
"dilatedcnn_multi_round": os.path.join(model_path, "dilatedcnn_multi_round.ipynb"),
}
paths = {"lightgbm_quick_start": os.path.join(quick_start_path, "lightgbm_point_forecast.ipynb")}
return paths

Просмотреть файл

@ -2,6 +2,10 @@
# Licensed under the MIT License.
import os
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
import pytest
import papermill as pm
import scrapbook as sb
@ -19,31 +23,3 @@ def test_lightgbm_quick_start(notebooks):
assert df.shape[0] == 1
mape = df.loc[df.name == "MAPE"]["data"][0]
assert mape == pytest.approx(35.60, abs=ABS_TOL)
@pytest.mark.integration
def test_lightgbm_multi_round(notebooks):
notebook_path = notebooks["lightgbm_multi_round"]
output_notebook_path = os.path.join(os.path.dirname(notebook_path), "output.ipynb")
pm.execute_notebook(
notebook_path, output_notebook_path, kernel_name="forecast_cpu", parameters=dict(N_SPLITS=1),
)
nb = sb.read_notebook(output_notebook_path)
df = nb.scraps.dataframe
assert df.shape[0] == 1
mape = df.loc[df.name == "MAPE"]["data"][0]
assert mape == pytest.approx(36.0, abs=ABS_TOL)
@pytest.mark.integration
def test_dilatedcnn_multi_round(notebooks):
notebook_path = notebooks["dilatedcnn_multi_round"]
output_notebook_path = os.path.join(os.path.dirname(notebook_path), "output.ipynb")
pm.execute_notebook(
notebook_path, output_notebook_path, kernel_name="forecast_cpu", parameters=dict(N_SPLITS=2),
)
nb = sb.read_notebook(output_notebook_path)
df = nb.scraps.dataframe
assert df.shape[0] == 1
mape = df.loc[df.name == "MAPE"]["data"][0]
assert mape == pytest.approx(37.7, abs=ABS_TOL)

Просмотреть файл

@ -3,7 +3,7 @@
# To create the conda environment:
# $ conda env create -f environment.yaml
#
#
# To update the conda environment:
# $ conda env update -f environment.yaml
#
@ -16,32 +16,30 @@ channels:
- defaults
- conda-forge
dependencies:
- python=3.6.10
- pip>=19.0.3
- jupyter>=1.0.0
- ipykernel>=4.6.1
- jupyter_nbextensions_configurator=0.4.1
- scipy=1.1.0
- numpy=1.16.2
- python=3.6
- pip
- jupyter
- ipykernel
- scipy==1.1.0
- numpy==1.16.2
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-learn=0.20.3
- pytest>=3.6.4
- tqdm>=4.43.0
- pylint
- pytest
- papermill>=1.0.1
- matplotlib=3.1.2
- r-base>=3.3.0
- r-base
- r-bayesm
- pip:
- black>=18.6b4
- flake8>=3.3.0
- jupytext>=1.3.0
- black
- flake8
- jupytext==1.3.0
- lightgbm==2.3.0
- tensorflow==2.0
- tensorboard==2.1.0
- nteract-scrapbook==0.3.1
- gitpython==3.0.8
- azureml-sdk[explain,automl]==1.0.85
- statsmodels==0.11.1
- pmdarima==1.1.1
- gitpython==3.0.8

Просмотреть файл

@ -1,24 +0,0 @@
REM Copyright (c) Microsoft Corporation.
REM Licensed under the MIT License.
REM Please follow instructions in this link
REM https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html
REM to install Miniconda before running this script.
echo Update conda
call conda update conda --yes
echo Create conda environment
call conda env create -f tools/environment.yml
echo Activate conda environment
call conda activate forecasting_env
echo Install forecasting utility library
call pip install -e fclib
echo Register conda environment in Jupyter
call python -m ipykernel install --user --name forecasting_env
echo Environment setup is done!

Просмотреть файл

@ -1,165 +0,0 @@
#!/usr/bin/python
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# This script creates yaml files to build conda environments
# For generating a conda file for running only python code:
# $ python generate_conda_file.py
#
# For generating a conda file for running python gpu:
# $ python generate_conda_file.py --gpu
import argparse
import textwrap
from sys import platform
HELP_MSG = """
To create the conda environment:
$ conda env create -f {conda_env}.yaml
To update the conda environment:
$ conda env update -f {conda_env}.yaml
To register the conda environment in Jupyter:
$ conda activate {conda_env}
$ python -m ipykernel install --user --name {conda_env} \
--display-name "Python ({conda_env})"
"""
CHANNELS = ["defaults", "conda-forge"]
CONDA_BASE = {
"python": "python==3.6.10",
"pip": "pip>=19.1.1",
"ipykernel": "ipykernel>=4.6.1",
"jupyter": "jupyter>=1.0.0",
"jupyter_nbextensions_configurator": "jupyter_nbextensions_configurator>=0.4.1",
"numpy": "numpy>=1.16.2",
"pandas": "pandas>=0.23.4",
"pytest": "pytest>=3.6.4",
"scipy": "scipy>=1.1.0",
"xlrd": "xlrd>=1.1.0",
"urllib3": "urllib3>=1.21.1",
"scikit-learn": "scikit-learn>=0.20.3",
"tqdm": "tqdm>=4.43.0",
"pylint": "pylint>=2.4.4",
"matplotlib": "matplotlib>=3.1.2",
"r-base": "r-base>=3.3.0",
"papermill": "papermill>=1.0.1",
}
CONDA_GPU = {}
PIP_BASE = {
"azureml-sdk": "azureml-sdk[explain,automl]==1.0.85",
"black": "black>=18.6b4",
"nteract-scrapbook": "nteract-scrapbook>=0.3.1",
"pre-commit": "pre-commit>=1.14.4",
"tensorboard": "tensorboard==2.1.0",
"tensorflow": "tensorflow==2.0",
"flake8": "flake8>=3.3.0",
"jupytext": "jupytext>=1.3.0",
"lightgbm": "lightgbm==2.3.0",
"statsmodels": "statsmodels==0.11.1",
"pmdarima": "pmdarima==1.1.1",
"gitpython": "gitpython==3.0.8",
}
PIP_GPU = {}
PIP_DARWIN = {}
PIP_DARWIN_GPU = {}
PIP_LINUX = {}
PIP_LINUX_GPU = {}
PIP_WIN32 = {}
PIP_WIN32_GPU = {}
CONDA_DARWIN = {}
CONDA_DARWIN_GPU = {}
CONDA_LINUX = {}
CONDA_LINUX_GPU = {}
CONDA_WIN32 = {}
CONDA_WIN32_GPU = {}
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=textwrap.dedent(
"""
This script generates a conda file for different environments.
Plain python is the default,
but flags can be used to support GPU functionality."""
),
epilog=HELP_MSG,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--name", help="specify name of conda environment")
parser.add_argument("--gpu", action="store_true", help="include packages for GPU support")
args = parser.parse_args()
# set name for environment and output yaml file
conda_env = "forecasting_cpu"
if args.gpu:
conda_env = "forecasting_gpu"
# overwrite environment name with user input
if args.name is not None:
conda_env = args.name
# add conda and pip base packages
conda_packages = CONDA_BASE
pip_packages = PIP_BASE
# update conda and pip packages based on flags provided
if args.gpu:
conda_packages.update(CONDA_GPU)
pip_packages.update(PIP_GPU)
# update conda and pip packages based on os platform support
if platform == "darwin":
conda_packages.update(CONDA_DARWIN)
pip_packages.update(PIP_DARWIN)
if args.gpu:
conda_packages.update(CONDA_DARWIN_GPU)
pip_packages.update(PIP_DARWIN_GPU)
elif platform.startswith("linux"):
conda_packages.update(CONDA_LINUX)
pip_packages.update(PIP_LINUX)
if args.gpu:
conda_packages.update(CONDA_LINUX_GPU)
pip_packages.update(PIP_LINUX_GPU)
elif platform == "win32":
conda_packages.update(CONDA_WIN32)
pip_packages.update(PIP_WIN32)
if args.gpu:
conda_packages.update(CONDA_WIN32_GPU)
pip_packages.update(PIP_WIN32_GPU)
else:
raise Exception("Unsupported platform. Must be Windows, Linux, or macOS")
# write out yaml file
conda_file = "{}.yaml".format(conda_env)
with open(conda_file, "w") as f:
for line in HELP_MSG.format(conda_env=conda_env).split("\n"):
f.write("# {}\n".format(line))
f.write("name: {}\n".format(conda_env))
f.write("channels:\n")
for channel in CHANNELS:
f.write("- {}\n".format(channel))
f.write("dependencies:\n")
for conda_package in conda_packages.values():
f.write("- {}\n".format(conda_package))
f.write("- pip:\n")
for pip_package in pip_packages.values():
f.write(" - {}\n".format(pip_package))
print("Generated conda file: {}".format(conda_file))
print(HELP_MSG.format(conda_env=conda_env))

Просмотреть файл

@ -1,43 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
# This file outputs a requirements.txt based on the libraries defined in generate_conda_file.py
from generate_conda_file import (
CONDA_BASE,
CONDA_GPU,
PIP_BASE,
PIP_GPU,
PIP_DARWIN,
PIP_LINUX,
PIP_WIN32,
CONDA_DARWIN,
CONDA_LINUX,
CONDA_WIN32,
PIP_DARWIN_GPU,
PIP_LINUX_GPU,
PIP_WIN32_GPU,
CONDA_DARWIN_GPU,
CONDA_LINUX_GPU,
CONDA_WIN32_GPU,
)
if __name__ == "__main__":
deps = list(CONDA_BASE.values())
deps += list(CONDA_GPU.values())
deps += list(PIP_BASE.values())
deps += list(PIP_GPU.values())
deps += list(PIP_DARWIN.values())
deps += list(PIP_LINUX.values())
deps += list(PIP_WIN32.values())
deps += list(CONDA_DARWIN.values())
deps += list(CONDA_LINUX.values())
deps += list(CONDA_WIN32.values())
deps += list(PIP_DARWIN_GPU.values())
deps += list(PIP_LINUX_GPU.values())
deps += list(PIP_WIN32_GPU.values())
deps += list(CONDA_DARWIN_GPU.values())
deps += list(CONDA_LINUX_GPU.values())
deps += list(CONDA_WIN32_GPU.values())
with open("requirements.txt", "w") as f:
f.write("\n".join(set(deps)))

Просмотреть файл

@ -0,0 +1,119 @@
#!/usr/bin/env python
# coding: utf-8
import csvtomd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
### Generating performance charts
#################################################
#Function to plot a performance chart
def plot_perf(x,y,df):
# extract submission name from submission URL
labels = df.apply(lambda x: x['Submission Name'][1:].split(']')[0], axis=1)
fig = plt.scatter(x=df[x],y=df[y], label=labels, s=150, alpha = 0.5,
c= ['b', 'g', 'r', 'c', 'm', 'y', 'k'])
plt.xlabel(x)
plt.ylabel(y)
plt.title(y + ' by ' + x)
offset = (max(df[y]) - min(df[y]))/50
for i,name in enumerate(labels):
ax = df[x][i]
ay = df[y][i] + offset * (-2.5 + i % 5)
plt.text(ax, ay, name, fontsize=10)
return(fig)
### Printing the Readme.md file
############################################
readmefile = '../../Readme.md'
#Write header
#print(file=open(readmefile))
print('# TSPerf\n', file=open(readmefile, "w"))
print('TSPerf is a collection of implementations of time-series forecasting algorithms in Azure cloud and comparison of their performance over benchmark datasets. \
Algorithm implementations are compared by model accuracy, training and scoring time and cost. Each implementation includes all the necessary \
instructions and tools that ensure its reproducibility.', file=open(readmefile, "a"))
print('The following table summarizes benchmarks that are currently included in TSPerf.\n', file=open(readmefile, "a"))
#Read the benchmark table the CSV file and converrt to a table in md format
with open('Benchmarks.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
print('A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, \
can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of \
implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.\n', file=open(readmefile, "a"))
### Write the Energy section
#============================
print('## Probabilistic energy forecasting performance board\n\n', file=open(readmefile, "a"))
print('The following table lists the current submision for the energy forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
#Read the energy perfromane board from the CSV file and converrt to a table in md format
with open('TSPerfBoard-Energy.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
#Read Energy Performance Board CSV file
df = pd.read_csv('TSPerfBoard-Energy.csv', engine='python')
#df
#Plot ,'Pinball Loss' by 'Training and Scoring Cost($)' chart
fig4 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
fig4 = plot_perf('Training and Scoring Cost($)','Pinball Loss',df)
plt.savefig('../../docs/images/Energy-Cost.png')
#insetting the performance charts
print('\n\nThe following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
print('![EnergyPBLvsTime](./docs/images/Energy-Cost.png)' ,file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
#print the retail sales forcsating section
#========================================
print('## Retail sales forecasting performance board\n\n', file=open(readmefile, "a"))
print('The following table lists the current submision for the retail forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
#Read the energy perfromane board from the CSV file and converrt to a table in md format
with open('TSPerfBoard-Retail.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
#Read Retail Performane Board CSV file
df = pd.read_csv('TSPerfBoard-Retail.csv', engine='python')
#df
#Plot MAPE (%) by Training and Scoring Cost ($) chart
fig2 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
fig2 = plot_perf('Training and Scoring Cost ($)','MAPE (%)',df)
plt.savefig('../../docs/images/Retail-Cost.png')
#insetting the performance charts
print('\n\nThe following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
print('![EnergyPBLvsTime](./docs/images/Retail-Cost.png)' ,file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
#insertting build status badge
print('## Build Status\n\n', file=open(readmefile, "a"))
print('| Build Type | Branch | Status | | Branch | Status |' ,file=open(readmefile, "a"))
print('| --- | --- | --- | --- | --- | --- |' ,file=open(readmefile, "a"))
print('| **Python Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/python_unit_tests_base?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=12&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/python_unit_tests_base?branchName=chenhui/python_test_pipeline)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=12&branchName=chenhui/python_test_pipeline) |' ,file=open(readmefile, "a"))
print('| **R Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/Forecasting/r_unit_tests_prototype?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=9&branchName=master) | | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/Forecasting/r_unit_tests_prototype?branchName=zhouf/r_test_pipeline)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=9&branchName=zhouf/r_test_pipeline) |' ,file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
print('A new Readme.md file has been generated successfully.')

Просмотреть файл

@ -0,0 +1 @@
placeholder