Add retail_turnover example, cleanup contrib folder (#203)

* Removing tsperf files from contrib/ (#199)

* removed tsperf from contrib directory

* modified contrib/readme

* address PR comments

* Hongooi/fable intro (#200)

Adds an introductory example that goes over the basics of time series analysis, using the tsibbledata::aus_retail dataset. Includes discussion on forecasting in general. Closes #57

* Tidyverts update (#202)

Updates the R code in the examples for the latest tidyverts package versions on CRAN.

Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com>
Former-commit-id: b98487f42c701e956e0301b33ec87827645b727d
This commit is contained in:
Hong Ooi 2020-06-20 06:58:54 +10:00 коммит произвёл GitHub
Родитель 6f8242a0f5
Коммит b62c56cfc3
164 изменённых файлов: 1939 добавлений и 44547 удалений

Просмотреть файл

@ -14,13 +14,13 @@ get_forecasts <- function(mable, newdata, ...)
keyvars <- key_vars(fcast)
keyvars <- keyvars[-length(keyvars)]
indexvar <- index_var(fcast)
fcastvar <- as.character(attr(fcast, "response")[[1]])
fcastvar <- names(fcast)[length(keyvars) + 3]
fcast <- fcast %>%
as_tibble() %>%
pivot_wider(
id_cols=all_of(c(keyvars, indexvar)),
names_from=.model,
values_from=all_of(fcastvar))
values_from=.mean)
select(newdata, !!keyvars, !!indexvar, !!fcastvar) %>%
rename(.response=!!fcastvar) %>%
inner_join(fcast)

Просмотреть файл

@ -1,3 +1,11 @@
# Contrib
Independent or incubating algorithms and utilities are candidates for the `contrib` folder. This folder will house contributions which may not easily fit into the core repository or need time to refactor the code and add necessary tests.
The contrib directory contains code which is not part of the main Forecasting repository, but deemed to be of interest and aligned with the goals of this repository. The code does not need to follow our strict coding guidelines, and is typically not tested via our DevOps pipeline.
Each project should live in its own subdirectory `/contrib/<project>` and contain a README.md file with detailed description what the project does and how to use it. In addition, when adding a new project, a brief description should be added to the table below.
## Projects
| Directory | Project description |
|---|---|
| [\<project name and link\>] | \<short project description\> |

Просмотреть файл

@ -1,57 +0,0 @@
## Download base image
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
lsb-release \
apt-utils \
python3-pip \
vim
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev \
g++
## Mount R dependency file into the docker container and install dependencies
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
RUN Rscript install_R_dependencies.R
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,178 +0,0 @@
# Implementation submission form
## Submission information
**Submission date**: 11/14/2018
**Benchmark name:** GEFCom2017_D_Prob_MT_hourly
**Submitter(s):** Vanja Paunic
**Submitter(s) email:** vanja.paunic@microsoft.com
**Submission name:** GBM
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
## Implementation description
### Modelling approach
In this submission, we implement a simple Gradient Boosting Machine model for quantile regression task using the `gbm` package in R.
### Feature engineering
The following features are used:
**LoadLag**: Average load based on the same-day and same-hour load values of the same week, the week before the same week, and the week after the same week of the previous three years, i.e. 9 values are averaged to compute this feature.
**DryBulbLag**: Average DryBulb temperature based on the same-hour DryBulb values of the same day, the day before the same day, and the day after the same day of the previous three years, i.e. 9 values are averaged to compute this feature.
**Weekly Fourier Series**: weekly_sin_1, weekly_cos_1, weekly_sin_2, weekly_cos_2, weekly_sin_3, weekly_cos_3
**Annual Fourier Series**: annual_sin_1, annual_cos_1, annual_sin_2, annual_cos_2, annual_sin_3, annual_cos_3
### Model tuning
The data of January - April of 2016 were used as validation dataset for some minor model tuning. Based on the model performance on this validation dataset, a larger feature set was narrowed down to the features described above. No parameter tuning was done.
### Description of implementation scripts
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Gradient Boosting Machine model for quantile regression task and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
### Steps to reproduce results
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
2. Clone the Forecasting repository to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
From the `~/Forecasting` directory on the VM create a conda environment named `tsperf` by running:
```bash
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
5. Prepare Docker container for model training and predicting.
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
4.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
4.2 Build a local Docker image
```bash
sudo docker build -t gbm_image benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
```
6. Train and predict **within Docker container**
6.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name gbm_container gbm_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/train_score_vm.sh > out.txt &
```
After generating the forecast results, you can exit the Docker container with command `exit`.
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash tsperf/benchmarking/evaluate GBM tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.7
* R
- r-base==3.5.3
- gbm==2.1.3
- data.table==1.11.4
## Resource deployment instructions
Please follow the instructions below to deploy the Linux DSVM.
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a *Data Science Virtual Machine for Linux (Ubuntu)*. Select *D8s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 78.85
* Pinball loss run 2: 78.84
* Pinball loss run 3: 78.86
* Pinball loss run 4: 78.76
* Pinball loss run 5: 78.82
Median Pinball loss: **78.84**
**Time:**
* Run time 1: 268 seconds
* Run time 2: 269 seconds
* Run time 3: 269 seconds
* Run time 4: 269 seconds
* Run time 5: 266 seconds
Median run time: **269 seconds**
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date. Thus, the total cost is `269/3600 * 0.3840 = $0.0287`.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 9.55
Round 2: 18.24
Round 3: 17.90
Round 4: 8.27
Round 5: 7.22
Round 6: 6.80
**Ranking in the qualifying round of GEFCom2017 competition**
4

Просмотреть файл

@ -1,66 +0,0 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Gradient Boosting Machines model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -1,10 +0,0 @@
name: tsperf
channels:
- defaults
dependencies:
- python=3.6
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -1,7 +0,0 @@
pkgs <- c(
'data.table',
'gbm',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -1,11 +0,0 @@
"""
This script inserts the TSPerf directory into sys.path, so that scripts can import all the modules in TSPerf. Each submission folder needs its own localpath.py file.
"""
import os, sys
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -1,101 +0,0 @@
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('gbm')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("gbm", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/submission_seed_', seed_value, '.csv', sep=""))
normalize_columns = list( 'DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
result_all = list()
N_ROUNDS = 6
for (iR in 1:N_ROUNDS){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
ntrees = 1000
shrinkage = 0.005
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
set.seed(seed_value)
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles) {
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
gbmModel = gbm(formula = DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
distribution = list(name = "quantile", alpha = tau),
data = train_df_sub,
n.trees = ntrees,
shrinkage = shrinkage)
gbmPredictions = predict(object = gbmModel,
newdata = test_df_sub,
n.trees = ntrees,
type = "response") * test_df_sub$load_ratio
result$Prediction = gbmPredictions
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,15 +0,0 @@
#!/bin/bash
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/GBM/compute_features.py --submission GBM
echo 'Training and predicting...'
Rscript $path/GBM/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done

Просмотреть файл

@ -1,111 +0,0 @@
# Problem
Probabilistic load forecasting (PLF) has become increasingly important in
power systems planning and operations in recent years. The applications of PLF
include energy production planning, reliability analysis, probabilistic price
forecasting, etc.
The task of this benchmark is to generate probabilistic forecasting of
electricity load on the GEFCom2017 competition qualifying match data. The
forecast horizon is 1~2 months ahead and granularity is hourly, see [Training
and test separation](#training-and-test-data-separation) for details. The
forecasts should be in the form of 9 quantiles, i.e. the 10th, 20th, ... 90th
percentiles, following the format of the provided template file. There are 10
time series (zones) to forecast, including the 8 ISO New England zones, the
Massachusetts (sum of three zones under Massachusetts), and the total (sum of
the first 8 zones).
The table below summarizes the benchmark problem definition:
|||
| ----------------------------------- | ---- |
| **Number of time series** | 10 |
| **Forecast frequency** | twice every month, mid and end of month |
| **Forecast granularity** | hourly |
| **Forecast type** | probabilistic, 9 quantiles: 10th, 20th, ...90th percentiles|
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/benchmarks/GEFCom2017_D_Prob_MT_hourly/sample_submission.csv)
# Data
### Dataset attribution
[ISO New England](https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info)
### Dataset description
1. The data files can be downloaded from ISO New England website via the
[zonal information page of the energy, load and demand reports](https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info). If you
are outside United States, you may need a VPN to access the data. Use columns
A, B, D, M and N in the worksheets of "YYYY SMD Hourly Data" files, where YYYY
represents the year. Detailed information of each column can be found in the
"Notes" sheet of the data files.
The script energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py downloads the load data to energy_load/GEFCom2017_D_Prob_MT_hourly/data/.
2. US Federal Holidays as published via [US Office of Personnel Management](https://www.opm.gov/policy-data-oversight/snow-dismissal-procedures/federal-holidays/).
This data can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/us_holidays.csv).
### Data preprocessing
The script energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
parses the excel files and creates training and testing csv load files. The
following preprocessing steps are performed by this script:
* Map the holiday names to integers and join holiday data with load data.
* When the --preprocess argument is True, zero load values are filled by
the values of the same hour of the previous day, outliers caused by end of
Daylight Saving Time are divided by 2.
* In addition to the eight zones in the excel files, 'SEMA', 'WCMA', and 'NEMA'
are aggregated to generate the MA_TOTAL zone and all eight zones are aggregated
to generate the TOTAL zone.
### Training and test data separation
For this problem, you are provided successive folds of training data. The goal
is to generate forecasts for the forecast periods listed in the table below,
using the available training data:
| **Round** | **Train period start** | **Train period end** | **Forecast period start** | **Forecast period end** |
| -------- | --------------- | ------------------ | ------------------------- | ----------------------- |
| 1 | 2011-01-01 01:00:00 | 2016-11-30 00:00:00 | 2017-01-01 01:00:00 | 2017-01-31 00:00:00 |
| 2 | 2011-01-01 01:00:00 | 2016-11-30 00:00:00 | 2017-02-01 01:00:00 | 2017-02-28 00:00:00 |
| 3 | 2011-01-01 01:00:00 | 2016-12-31 00:00:00 | 2017-02-01 01:00:00 | 2017-02-28 00:00:00 |
| 4 | 2011-01-01 01:00:00 | 2016-12-31 00:00:00 | 2017-03-01 01:00:00 | 2017-03-31 00:00:00 |
| 5 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-03-01 01:00:00 | 2017-03-31 00:00:00 |
| 6 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-04-01 01:00:00 | 2017-04-30 00:00:00 |
### Feature engineering
A common feature engineering script, common/feature_engineering.py, is provided to be used by individual submissions.
Below is an example of using this script.
The feature configuration list is used to specify the features to be computed by the compute_features function.
Each feature configuration is a tuple in the format of (feature_name, featurizer_args).
* feature_name is used to determine the featurizer to use, see FEATURE_MAP in
common/feature_engineering.py.
* featurizer_args is a dictionary of arguments passed to the featurizer.
```python
from energy_load.GEFCom2017_D_Prob_MT_hourly.common.feature_engineering\
import compute_features
DF_CONFIG = {
'time_col_name': 'Datetime',
'grain_col_name': 'Zone',
'value_col_name': 'DEMAND',
'frequency': 'hourly',
'time_format': '%Y-%m-%d %H:%M:%S'
}
feature_config_list = \
[('temporal', {'feature_list': ['hour_of_day', 'month_of_year']}),
('annual_fourier', {'n_harmonics': 3}),
('weekly_fourier', {'n_harmonics': 3}),
('previous_year_load_lag',
{'input_col_name': 'DEMAND', 'output_col_name': 'load_lag'}),
('previous_year_dry_bulb_lag',
{'input_col_name': 'DryBulb', 'output_col_name': 'dry_bulb_lag'})]
TRAIN_DATA_DIR = './data/train'
TEST_DATA_DIR = './data/test'
OUTPUT_DIR = './data/features'
compute_features(TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG,
feature_config_list,
filter_by_month=True)
```
# Model Evaluation
**Evaluation metric**: Pinball loss

Просмотреть файл

@ -1,57 +0,0 @@
## Download base image
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
lsb-release \
apt-utils \
python3-pip \
vim
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev \
g++
## Mount R dependency file into the docker container and install dependencies
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
RUN Rscript install_R_dependencies.R
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,188 +0,0 @@
# Implementation submission form
## Submission information
**Submission date**: 09/14/2018
**Benchmark name:** GEFCom2017_D_Prob_MT_hourly
**Submitter(s):** Hong Lu
**Submitter(s) email:** honglu@microsoft.com
**Submission name:** baseline
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
## Implementation description
### Modelling approach
In this submission, we implement a simple quantile regression model using the `quantreg` package in R.
### Feature engineering
The following features are used:
**LoadLag**: Average load based on the same-day and same-hour load values of the same week, the week before the same week, and the week after the same week of the previous three years, i.e. 9 values are averaged to compute this feature.
**DryBulbLag**: Average DryBulb temperature based on the same-hour DryBulb values of the same day, the day before the same day, and the day after the same day of the previous three years, i.e. 9 values are averaged to compute this feature.
**Weekly Fourier Series**: weekly_sin_1, weekly_cos_1, weekly_sin_2, weekly_cos_2, weekly_sin_3, weekly_cos_3
**Annual Fourier Series**: annual_sin_1, annual_cos_1, annual_sin_2, annual_cos_2, annual_sin_3, annual_cos_3
### Model tuning
The data of January - April of 2016 were used as validation dataset for some minor model tuning. Based on the model performance on this validation dataset, a larger feature set was narrowed down to the features described above.
No parameter tuning was done.
### Description of implementation scripts
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Quantile Regression models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
### Steps to reproduce results
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
5. Prepare Docker container for model training and predicting.
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
5.2 Build a local Docker image
```bash
sudo docker build -t baseline_image benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
```
6. Train and predict **within Docker container**
6.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name baseline_container baseline_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/train_score_vm.sh
```
After generating the forecast results, you can exit the Docker container by command `exit`.
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash tsperf/benchmarking/evaluate baseline tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.7
* R
- r-base==3.5.3
- quantreg==5.34
- data.table==1.10.4.3
## Resource deployment instructions
Please follow the instructions below to deploy the Linux DSVM.
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a *Data Science Virtual Machine for Linux (Ubuntu)*. Select *D8s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
Note there is no randomness in this baseline model, so the model quality is the same for all five runs.
* Pinball loss run 1: 84.12
* Pinball loss run 2: 84.12
* Pinball loss run 3: 84.12
* Pinball loss run 4: 84.12
* Pinball loss run 5: 84.12
* Median Pinball loss: 84.12
**Time:**
* Run time 1: 188 seconds
* Run time 2: 185 seconds
* Run time 3: 185 seconds
* Run time 4: 189 seconds
* Run time 5: 189 seconds
* Median run time: **188 seconds**
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
Thus, the total cost is 188/3600 * 0.3840 = $0.0201.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: -6.67
Round 2: 20.26
Round 3: 20.05
Round 4: -5.61
Round 5: -6.45
Round 6: 11.21
**Ranking in the qualifying round of GEFCom2017 competition**
10

Просмотреть файл

@ -1,67 +0,0 @@
"""
This script uses
tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/feature_engineering.py to
compute a list of features needed by the Quantile Regression model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -1,10 +0,0 @@
name: tsperf
channels:
- defaults
dependencies:
- python=3.6
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -1,7 +0,0 @@
pkgs <- c(
'data.table',
'quantreg',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -1,13 +0,0 @@
"""
This script inserts the TSPerf directory into sys.path, so that scripts can
import all the modules in TSPerf. Each submission folder needs its own
localpath.py file.
"""
import os, sys
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -1,87 +0,0 @@
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('quantreg')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("quantreg", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/submission_seed_', seed_value, '.csv', sep=""))
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
result_all = list()
for (iR in 1:6){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles){
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
model = rq(DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
data=train_df_sub, tau = tau)
result$Prediction = predict(model, test_df_sub) * test_df_sub$load_ratio
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,15 +0,0 @@
#!/bin/bash
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/baseline/compute_features.py --submission baseline
echo 'Training and predicting...'
Rscript $path/baseline/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done

Просмотреть файл

@ -1,5 +0,0 @@
# include placeholder data directory in repository.
# Ignore all files in this directory except the .gitignore file.
*
!.gitignore

Просмотреть файл

@ -1,426 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# GEFCOM2017 Data Exploration Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set up an environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run this notebook, please download GEFCom 2017 dataset by executing these commands from the root folder of TSPerf:\n",
" \n",
" conda env create --file ./common/conda_dependencies.yml\n",
" source activate tsperf\n",
" python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py\n",
" python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: patsy in c:\\miniconda3\\envs\\myenv\\lib\\site-packages (0.5.1)\n",
"Requirement already satisfied: numpy>=1.4 in c:\\miniconda3\\envs\\myenv\\lib\\site-packages (from patsy) (1.15.4)\n",
"Requirement already satisfied: six in c:\\miniconda3\\envs\\myenv\\lib\\site-packages (from patsy) (1.11.0)\n",
"Requirement already satisfied: statsmodels in c:\\miniconda3\\envs\\myenv\\lib\\site-packages (0.9.0)\n"
]
}
],
"source": [
"!pip install patsy\n",
"!pip install statsmodels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load training data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from statsmodels.graphics.tsaplots import plot_acf, plot_pacf\n",
"from statsmodels.tsa.stattools import pacf\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = [15, 5]\n",
"\n",
"train_data_dir = '../data/train'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combine train_base.csv and train_round_6.csv to get the entire training dataset. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_base = pd.read_csv(os.path.join(train_data_dir, 'train_base.csv'),parse_dates=['Datetime'])\n",
"train_round_6 = pd.read_csv(os.path.join(train_data_dir, 'train_round_6.csv'),parse_dates=['Datetime'])\n",
"train_all = pd.concat([train_base, train_round_6]).reset_index(drop=True)\n",
"train_all"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check missing values and feature ranges"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check if there are missing values in any of the columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Number of missing values: {}\".format(train_all.isna().sum().sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Summary of the distribution of values of numeric columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_all.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show all distinct zones and their timespans"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_all.groupby('Zone')['Datetime'].agg([np.min, np.max]).reset_index().\\\n",
" rename(columns={'amin':'min time', 'amax':'max time'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show summary of the distribution of DEMAND values across zones"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_all.groupby('Zone')['DEMAND'].agg([np.mean, np.min, np.max]).\\\n",
" rename(columns={'mean':'mean demand', 'amin':'min demand', 'amax':'max demand'}).\\\n",
" sort_values(by='mean demand').reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compute correlations between different features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_all[['DEMAND','DewPnt','DryBulb','Holiday']].corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This table shows that DewPnt and DryBulb features are highly correlated. Note that these temperature features can not be used directly in forecasting, because they are not available at forecasting time. However, lagged temperatures from the available training data can be used. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualize seasonalities in energy demand"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we show that DEMAND data has multiple seasonalities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_demand = train_all.groupby('Datetime')['DEMAND'].mean()\n",
"mean_demand.plot(title=\"Mean demand over 6.5 years\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Daily seasonality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following graph shows that mean energy consumption has daily seasonality. Energy consumption peaks around noon and then around 6pm. Also energy consumption drops significantly at night."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_demand[:24*3].plot(title=\"Mean demand over 3 days\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Weekly seasonality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following graph shows that mean energy consumption has weekly seasonality. Energy consumption is higher at week days (January 3-7, January 10-14, January 17-22) and lower during weekend (January 1-2, January 8-9, January 15-16)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_demand[:24*21].plot(title=\"Mean demand over 21 days\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_total_daily_demand = mean_demand.resample('24h').sum()\n",
"weekday_mean_total_demand = mean_total_daily_demand[mean_total_daily_demand.index.dayofweek<5].mean()\n",
"weekend_mean_total_demand = mean_total_daily_demand[mean_total_daily_demand.index.dayofweek>=5].mean()\n",
"print('Total demand during weekday: {0:.2f} (averaged over all zones and weekdays)'.format(weekday_mean_total_demand))\n",
"print('Total demand during weekend day: {0:.2f} (averaged over all zones and weekend days)'.format(weekend_mean_total_demand))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Annual seasonality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following graph shows that mean energy consumption has annual seasonality. Energy consumption increases in winter and summer and decreases in spring and fall."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean_demand.resample('1m').sum().plot(title=\"Total monthly demand (averaged over all zones)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compute partial autocorrelation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following plot shows partial autocorrelation with of the lags up to 24 hours * 14 days = 2 weeks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_pacf(mean_demand, lags=24*14)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This graph shows that most of the lags have very small correlation. In the next cell we find 20 lags with the largest partial autocorrelation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pacf_values, pacf_conf_intervals = pacf(mean_demand, nlags=24*14, alpha=0.05)\n",
"top20_lags = np.argsort(np.abs(pacf_values))[-2::-1][:20]\n",
"print(top20_lags)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pacf_values[top20_lags]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The lags with the highest correlation are from today (lags 1,2,13-19), from about a day ago (lags 22, 24, 25, 27), from 3 days ago (lag 73), from 6 days ago (lags 144, 145, 147) and from 7 days ago (lags 168, 169)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"95% confidence intervals of 20 lags with the largest partial autocorrelation:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pacf_conf_intervals[top20_lags]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 95% confidence intervals of partial correlations of these lags do not contain zeros. Hence all these lags have statistically significant partial autocorrelation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This analysis suggests to use these lags when developing feature sets of energy demand forecasting models. However, in this benchmark, the forecast horizon is 1 to 2 months ahead and most recent lags cannot be used as features. But features from the same hour, same day of week, and same week of year could be useful."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}

Просмотреть файл

@ -1,57 +0,0 @@
## Download base image
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
lsb-release \
apt-utils \
python3-pip \
vim
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev \
g++
## Mount R dependency file into the docker container and install dependencies
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
RUN Rscript install_R_dependencies.R
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,243 +0,0 @@
# Implementation submission form
## Submission information
**Submission date**: 10/26/2018
**Benchmark name:** GEFCom2017_D_Prob_MT_hourly
**Submitter(s):** Fang Zhou
**Submitter(s) email:** zhouf@microsoft.com
**Submission name:** Quantile Regression Neural Network
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
## Implementation description
### Modelling approach
In this submission, we implement a quantile regression neural network model using the `qrnn` package in R.
### Feature engineering
The following features are used:
**LoadLag**: Average load based on the same-day and same-hour load values of the same week, the week before the same week, and the week after the same week of the previous three years, i.e. 9 values are averaged to compute this feature.
**DryBulbLag**: Average DryBulb temperature based on the same-hour DryBulb values of the same day, the day before the same day, and the day after the same day of the previous three years, i.e. 9 values are averaged to compute this feature.
**Weekly Fourier Series**: weekly_sin_1, weekly_cos_1, weekly_sin_2, weekly_cos_2, weekly_sin_3, weekly_cos_3
**Annual Fourier Series**: annual_sin_1, annual_cos_1, annual_sin_2, annual_cos_2, annual_sin_3, annual_cos_3
### Model tuning
The data of January - April of 2016 were used as validation dataset for some minor model tuning. Based on the model performance on this validation dataset, a larger feature set was narrowed down to the features described above. The model hyperparameter tuning is done on the 6 train round data using 4 cross validation folds with 6 forecasting rounds in each fold. The set of hyperparameters which yield the best cross validation pinball loss will be used to train models and forecast energy load across all 6 forecast rounds.
### Description of implementation scripts
Train and Predict:
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Quantile Regression Neural Network models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
Tune hyperparameters using R:
* `cv_settings.json`: JSON script that sets cross validation folds.
* `train_validate.R`: R script that trains Quantile Regression Neural Network models and evaluate the loss on validation data of each cross validation round and forecast round with a set of hyperparameters and calculate the average loss. This script is used for grid search on vm.
* `train_validate_vm.sh`: Bash script that runs `compute_features.py` and `train_validate.R` multiple times to generate cross validation result files and measure model tuning time.
Tune hyperparameters using AzureML HyperDrive:
* `cv_settings.json`: JSON script that sets cross validation folds.
* `train_validate_aml.R`: R script that trains Quantile Regression Neural Network models and evaluate the loss on validation data of each cross validation round and forecast round with a set of hyperparameters and calculate the average loss. This script is used as the entry script for hyperdrive.
* `aml_estimator.py`: Python script that passes the inputs and outputs between hyperdrive and the entry script `train_validate_aml.R`.
* `hyperparameter_tuning.ipynb`: Jupyter notebook that does hyperparameter tuning with azureml hyperdrive.
### Steps to reproduce results
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
cd Forecasting
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://docs.microsoft.com/en-us/vsts/organizations/accounts/use-personal-access-tokens-to-authenticate?view=vsts)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://docs.microsoft.com/en-us/vsts/repos/git/set-up-credential-managers?view=vsts)
* [Authenticate with SSH](https://docs.microsoft.com/en-us/vsts/repos/git/use-ssh-keys-to-authenticate?view=vsts)
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
5. Prepare Docker container for model training and predicting.
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
5.2 Build a local Docker image
```bash
sudo docker build -t fnn_image benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
```
6. Tune Hyperparameters **within Docker container** or **with AzureML hyperdrive**.
6.1.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_cv_container fnn_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
6.1.2 Train and validate
```
source activate tsperf
cd /Forecasting
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_validate_vm.sh >& cv_out.txt &
```
After generating the cross validation results, you can exit the Docker container by command `exit`.
6.2 Do hyperparameter tuning with AzureML hyperdrive
To tune hyperparameters with AzureML hyperdrive, you don't need to create a local Docker container. You can do feature engineering on the VM by the command
```
cd ~/Forecasting
source activate tsperf
python benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/compute_features.py
```
and then run through the jupyter notebook `hyperparameter_tuning.ipynb` on the VM with the conda env `tsperf` as the jupyter kernel.
Based on the average pinball loss obtained at each set of hyperparameters, you can choose the best set of hyperparameters and use it in the Rscript of `train_predict.R`.
7. Train and predict **within Docker container**.
7.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_container fnn_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
7.2 Train and predict
```
source activate tsperf
cd /Forecasting
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_score_vm.sh >& out.txt &
```
The last command will take about 7 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
```
sudo docker exec -it fnn_container /bin/bash
tail out.txt
```
to connect to the running container and check the status of the run.
After generating the forecast results, you can exit the Docker container by command `exit`.
8. Model evaluation **on the VM**.
```bash
source activate tsperf
cd ~/Forecasting
bash tsperf/benchmarking/evaluate fnn tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.7
* R
- r-base==3.5.3
- qrnn==2.0.2
- data.table==1.10.4.3
- rjson==0.2.20 (optional for cv)
- doParallel==1.0.14 (optional for cv)
## Resource deployment instructions
Please follow the instructions below to deploy the Linux DSVM.
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a *Data Science Virtual Machine for Linux (Ubuntu)*. Select *D8s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 79.54
* Pinball loss run 2: 78.32
* Pinball loss run 3: 80.06
* Pinball loss run 4: 80.12
* Pinball loss run 5: 80.13
* Median Pinball loss: 80.06
**Time:**
* Run time 1: 1092 seconds
* Run time 2: 1085 seconds
* Run time 3: 1062 seconds
* Run time 4: 1083 seconds
* Run time 5: 1110 seconds
* Median run time: 1085 seconds
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
Thus, the total cost is 1085/3600 * 0.3840 = $0.1157.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 6.13
Round 2: 19.20
Round 3: 18.86
Round 4: 3.84
Round 5: 2.76
Round 6: 11.10
**Ranking in the qualifying round of GEFCom2017 competition**
4

Просмотреть файл

@ -1,70 +0,0 @@
"""
This script passes the input arguments of AzureML job to the R script train_validate_aml.R,
and then passes the output of train_validate_aml.R back to AzureML.
"""
import subprocess
import os
import sys
import getopt
import pandas as pd
from datetime import datetime
from azureml.core import Run
import time
start_time = time.time()
run = Run.get_submitted_run()
base_command = "Rscript train_validate_aml.R"
if __name__ == "__main__":
opts, args = getopt.getopt(
sys.argv[1:], "", ["path=", "cv_path=", "n_hidden_1=", "n_hidden_2=", "iter_max=", "penalty="]
)
for opt, arg in opts:
if opt == "--path":
path = arg
elif opt == "--cv_path":
cv_path = arg
elif opt == "--n_hidden_1":
n_hidden_1 = arg
elif opt == "--n_hidden_2":
n_hidden_2 = arg
elif opt == "--iter_max":
iter_max = arg
elif opt == "--penalty":
penalty = arg
time_stamp = datetime.now().strftime("%Y%m%d%H%M%S")
task = " ".join(
[
base_command,
"--path",
path,
"--cv_path",
cv_path,
"--n_hidden_1",
n_hidden_1,
"--n_hidden_2",
n_hidden_2,
"--iter_max",
iter_max,
"--penalty",
penalty,
"--time_stamp",
time_stamp,
]
)
process = subprocess.call(task, shell=True)
# process.communicate()
# process.wait()
output_file_name = "cv_output_" + time_stamp + ".csv"
result = pd.read_csv(os.path.join(cv_path, output_file_name))
APL = result["loss"].mean()
print(APL)
print("--- %s seconds ---" % (time.time() - start_time))
run.log("average pinball loss", APL)

Просмотреть файл

@ -1,68 +0,0 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Feed-forward Neural Network model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -1,10 +0,0 @@
name: tsperf
channels:
- defaults
dependencies:
- python=3.6
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -1,250 +0,0 @@
{
"cv_round_1": {
"1": {
"train_range": [
"2012-01-01 00:00:00",
"2012-11-30 23:00:00"
],
"validation_range": [
"2013-01-01 00:00:00",
"2013-01-31 23:00:00"
]
},
"2": {
"train_range": [
"2012-01-01 00:00:00",
"2012-11-30 23:00:00"
],
"validation_range": [
"2013-02-01 00:00:00",
"2013-02-28 23:00:00"
]
},
"3": {
"train_range": [
"2012-01-01 00:00:00",
"2012-12-31 23:00:00"
],
"validation_range": [
"2013-02-01 00:00:00",
"2013-02-28 23:00:00"
]
},
"4": {
"train_range": [
"2012-01-01 00:00:00",
"2012-12-31 23:00:00"
],
"validation_range": [
"2013-03-01 00:00:00",
"2013-03-31 23:00:00"
]
},
"5": {
"train_range": [
"2012-01-01 00:00:00",
"2013-01-31 23:00:00"
],
"validation_range": [
"2013-03-01 00:00:00",
"2013-03-31 23:00:00"
]
},
"6": {
"train_range": [
"2012-01-01 00:00:00",
"2013-01-31 23:00:00"
],
"validation_range": [
"2013-04-01 00:00:00",
"2013-04-30 23:00:00"
]
}
},
"cv_round_2": {
"1": {
"train_range": [
"2012-01-01 00:00:00",
"2013-11-30 23:00:00"
],
"validation_range": [
"2014-01-01 00:00:00",
"2014-01-31 23:00:00"
]
},
"2": {
"train_range": [
"2012-01-01 00:00:00",
"2013-11-30 23:00:00"
],
"validation_range": [
"2014-02-01 00:00:00",
"2014-02-28 23:00:00"
]
},
"3": {
"train_range": [
"2012-01-01 00:00:00",
"2013-12-31 23:00:00"
],
"validation_range": [
"2014-02-01 00:00:00",
"2014-02-28 23:00:00"
]
},
"4": {
"train_range": [
"2012-01-01 00:00:00",
"2013-12-31 23:00:00"
],
"validation_range": [
"2014-03-01 00:00:00",
"2014-03-31 23:00:00"
]
},
"5": {
"train_range": [
"2012-01-01 00:00:00",
"2014-01-31 23:00:00"
],
"validation_range": [
"2014-03-01 00:00:00",
"2014-03-31 23:00:00"
]
},
"6": {
"train_range": [
"2012-01-01 00:00:00",
"2014-01-31 23:00:00"
],
"validation_range": [
"2014-04-01 00:00:00",
"2014-04-30 23:00:00"
]
}
},
"cv_round_3": {
"1": {
"train_range": [
"2012-01-01 00:00:00",
"2014-11-30 23:00:00"
],
"validation_range": [
"2015-01-01 00:00:00",
"2015-01-31 23:00:00"
]
},
"2": {
"train_range": [
"2012-01-01 00:00:00",
"2014-11-30 23:00:00"
],
"validation_range": [
"2015-02-01 00:00:00",
"2015-02-28 23:00:00"
]
},
"3": {
"train_range": [
"2012-01-01 00:00:00",
"2014-12-31 23:00:00"
],
"validation_range": [
"2015-02-01 00:00:00",
"2015-02-28 23:00:00"
]
},
"4": {
"train_range": [
"2012-01-01 00:00:00",
"2014-12-31 23:00:00"
],
"validation_range": [
"2015-03-01 00:00:00",
"2015-03-31 23:00:00"
]
},
"5": {
"train_range": [
"2012-01-01 00:00:00",
"2015-01-31 23:00:00"
],
"validation_range": [
"2015-03-01 00:00:00",
"2015-03-31 23:00:00"
]
},
"6": {
"train_range": [
"2012-01-01 00:00:00",
"2015-01-31 23:00:00"
],
"validation_range": [
"2015-04-01 00:00:00",
"2015-04-30 23:00:00"
]
}
},
"cv_round_4": {
"1": {
"train_range": [
"2012-01-01 00:00:00",
"2015-11-30 23:00:00"
],
"validation_range": [
"2016-01-01 00:00:00",
"2016-01-31 23:00:00"
]
},
"2": {
"train_range": [
"2012-01-01 00:00:00",
"2015-11-30 23:00:00"
],
"validation_range": [
"2016-02-01 00:00:00",
"2016-02-29 23:00:00"
]
},
"3": {
"train_range": [
"2012-01-01 00:00:00",
"2015-12-31 23:00:00"
],
"validation_range": [
"2016-02-01 00:00:00",
"2016-02-29 23:00:00"
]
},
"4": {
"train_range": [
"2012-01-01 00:00:00",
"2015-12-31 23:00:00"
],
"validation_range": [
"2016-03-01 00:00:00",
"2016-03-31 23:00:00"
]
},
"5": {
"train_range": [
"2012-01-01 00:00:00",
"2016-01-31 23:00:00"
],
"validation_range": [
"2016-03-01 00:00:00",
"2016-03-31 23:00:00"
]
},
"6": {
"train_range": [
"2012-01-01 00:00:00",
"2016-01-31 23:00:00"
],
"validation_range": [
"2016-04-01 00:00:00",
"2016-04-30 23:00:00"
]
}
}
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -1,7 +0,0 @@
pkgs <- c(
'data.table',
'qrnn',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -1,12 +0,0 @@
"""
This script inserts the TSPerf directory into sys.path, so that scripts can import
all the modules in TSPerf. Each submission folder needs its own localpath.py file.
"""
import os, sys
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -1,107 +0,0 @@
#!/usr/bin/Rscript
#
# This script trains the Quantile Regression Neural Network model and predicts on each data
# partition per zone and hour at each quantile point.
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('qrnn')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
# Specify data directory
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/submission_seed_', seed_value, '.csv', sep=""))
# Data and forecast parameters
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
# Train and predict
result_all = list()
for (iR in 1:6){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
set.seed(seed_value)
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
train_x <- as.matrix(train_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
drop=FALSE])
train_y <- as.matrix(train_df_sub[, c('DEMAND'), drop=FALSE])
test_x <- as.matrix(test_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
drop=FALSE])
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles){
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
model = qrnn2.fit(x=train_x, y=train_y,
n.hidden=8, n.hidden2=4,
tau=tau, Th=tanh,
iter.max=1,
penalty=0)
result$Prediction = qrnn2.predict(model, x=test_x) * test_df_sub$load_ratio
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,15 +0,0 @@
#!/bin/bash
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/fnn/compute_features.py --submission fnn
echo 'Training and predicting...'
Rscript $path/fnn/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done

Просмотреть файл

@ -1,172 +0,0 @@
#!/usr/bin/Rscript
#
# This script trains Quantile Regression Neural Network models and evaluate the loss
# on validation data of each cross validation round and forecast round with a set of
# hyperparameters and calculate the average loss.
# This script is used for grid search on vm.
args = commandArgs(trailingOnly=TRUE)
parameter_set = args[1]
install.packages('rjson', repo="http://cran.r-project.org/")
install.packages('doParallel', repo="http://cran.r-project.org/")
library('data.table')
library('qrnn')
library('rjson')
library('doParallel')
cl <- parallel::makeCluster(4)
parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
# Specify data directory
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
train_dir = file.path(data_dir, 'train')
train_file_prefix = 'train_round_'
# Define parameter grid
n.hidden_choice = c(4, 8)
n.hidden2_choice = c(4, 8)
iter.max_choice = c(1, 2, 4, 6, 8)
penalty_choice = c(0, 0.001)
param_grid = expand.grid(n.hidden_choice,
n.hidden2_choice,
iter.max_choice,
penalty_choice)
colnames(param_grid) = c('n.hidden', 'n.hidden2', 'iter.max', 'penalty')
parameter_names = colnames(param_grid)
parameter_values = param_grid[parameter_set, ]
output_file_name = 'cv_output'
for (j in 1:length(parameter_names)){
output_file_name = paste(output_file_name, parameter_names[j], parameter_values[j], sep="_")
}
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
# Define cross validation split settings
cv_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', 'cv_settings.json', sep=""))
cv_settings = fromJSON(file=cv_file)
# Parameters of model
n.hidden = as.integer(param_grid[parameter_set, 'n.hidden'])
n.hidden2 = as.integer(param_grid[parameter_set, 'n.hidden2'])
iter.max = as.integer(param_grid[parameter_set, 'iter.max'])
penalty = as.integer(param_grid[parameter_set, 'penalty'])
# Data and forecast parameters
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
subset_columns_train = c(features, 'DEMAND')
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'LoadRatio')
# Utility function
pinball_loss <- function(q, y, f) {
L = ifelse(y>=f, q * (y-f), (1-q) * (f-y))
return(L)
}
# Cross Validation
result_all = list()
counter = 1
for (i in 1:length(cv_settings)){
round = paste("cv_round_", i, sep='')
cv_settings_round = cv_settings[[round]]
print(round)
for (iR in 1:6){
print(iR)
train_file = file.path(train_dir, paste(train_file_prefix, as.character(iR), '.csv', sep=''))
cvdata_df = fread(train_file)
cv_settings_cur = cv_settings_round[[as.character(iR)]]
train_range = cv_settings_cur$train_range
validation_range = cv_settings_cur$validation_range
train_data = cvdata_df[Datetime >=train_range[1] & Datetime <= train_range[2]]
validation_data = cvdata_df[Datetime >= validation_range[1] & Datetime <= validation_range[2]]
zones = unique(validation_data$Zone)
hours = unique(validation_data$hour_of_day)
for (c in normalize_columns){
min_c = min(train_data[, ..c])
max_c = max(train_data[, ..c])
train_data[, c] = (train_data[, ..c] - min_c)/(max_c - min_c)
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
}
validation_data$AverageLoadRatio = rowMeans(validation_data[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(hour_of_day, month_of_year)]
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
print(paste('Zone', z))
result_all_hours = list()
hour_counter = 1
for (h in hours){
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
train_x <- as.matrix(train_df_sub[, ..features, drop=FALSE])
train_y <- as.matrix(train_df_sub[, c('DEMAND'), drop=FALSE])
validation_x <- as.matrix(validation_df_sub[, ..features, drop=FALSE])
result_all_quantiles = list()
quantile_counter = 1
for (tau in quantiles){
model = qrnn2.fit(x=train_x, y=train_y,
n.hidden=n.hidden, n.hidden2=n.hidden2,
tau=tau, Th=tanh,
iter.max=iter.max,
penalty=penalty)
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$LoadRatio
result$DEMAND = validation_df_sub$DEMAND
result$loss = pinball_loss(tau, validation_df_sub$DEMAND, result$Prediction)
result$q = tau
result_all_quantiles[[quantile_counter]] = result
quantile_counter = quantile_counter + 1
}
result_all_hours[[hour_counter]] = rbindlist(result_all_quantiles)
hour_counter = hour_counter + 1
}
rbindlist(result_all_hours)
}
result_all[[counter]] = result_all_zones
counter = counter + 1
}
}
result_final = rbindlist(result_all)
average_PL = round(colMeans(result_final[, 'loss'], na.rm = TRUE), 2)
print(paste('Average Pinball Loss:', average_PL))
output_file_name = paste(output_file_name, 'APL', average_PL, sep="_")
output_file_name = paste(output_file_name, '.csv', sep="")
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
fwrite(result_final, output_file)
parallel::stopCluster(cl)

Просмотреть файл

@ -1,171 +0,0 @@
#!/usr/bin/Rscript
#
# This script trains Quantile Regression Neural Network models and evaluate the loss
# on validation data of each cross validation round and forecast round with a set of
# hyperparameters and calculate the average loss.
# This script is used as the entry script for azureml hyperdrive.
args = commandArgs(trailingOnly=TRUE)
install.packages('qrnn', repo="http://cran.rstudio.com/")
install.packages('optparse', repo="http://cran.rstudio.com/")
library('data.table')
library('qrnn')
library("optparse")
library("rjson")
library('doParallel')
cl <- parallel::makeCluster(4)
parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
option_list = list(
make_option(c("-d", "--path"), type="character", default=NULL,
help="Path to the data files"),
make_option(c("-c", "--cv_path"), type="character", default=NULL,
help="Path to the cv setting files"),
make_option(c("-n", "--n_hidden_1"), type="integer", default=NULL,
help="Number of neurons in layer 1"),
make_option(c("-m", "--n_hidden_2"), type="integer", default=NULL,
help="Number of neurons in layer 2"),
make_option(c("-i", "--iter_max"), type="integer", default=NULL,
help="Number of maximum iterations"),
make_option(c("-p", "--penalty"), type="integer", default=NULL,
help="Penalty"),
make_option(c("-t", "--time_stamp"), type="character", default=NULL,
help="Timestamp")
);
opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser)
path = opt$path
cvpath = opt$cv_path
n.hidden = opt$n_hidden_1
n.hidden2= opt$n_hidden_2
iter.max = opt$iter_max
penalty = opt$penalty
ts = opt$time_stamp
# Data directory
train_dir = path
train_file_prefix = 'train_round_'
# Define cross validation split settings
cv_file = file.path(cvpath, 'cv_settings.json')
cv_settings = fromJSON(file=cv_file)
# Data and forecast parameters
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
# Utility function
pinball_loss <- function(q, y, f) {
L = ifelse(y>=f, q * (y-f), (1-q) * (f-y))
return(L)
}
# Cross Validation
result_all = list()
counter = 1
for (i in 1:length(cv_settings)){
round = paste("cv_round_", i, sep='')
cv_settings_round = cv_settings[[round]]
print(round)
for (iR in 1:6){
print(iR)
train_file = file.path(train_dir, paste(train_file_prefix, as.character(iR), '.csv', sep=''))
cvdata_df = fread(train_file)
cv_settings_cur = cv_settings_round[[as.character(iR)]]
train_range = cv_settings_cur$train_range
validation_range = cv_settings_cur$validation_range
train_data = cvdata_df[Datetime >=train_range[1] & Datetime <= train_range[2]]
validation_data = cvdata_df[Datetime >= validation_range[1] & Datetime <= validation_range[2]]
zones = unique(validation_data$Zone)
hours = unique(validation_data$Hour)
for (c in normalize_columns){
min_c = min(train_data[, ..c])
max_c = max(train_data[, ..c])
train_data[, c] = (train_data[, ..c] - min_c)/(max_c - min_c)
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
}
validation_data$average_load_ratio = rowMeans(validation_data[, c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
validation_data[, load_ratio:=mean(average_load_ratio), by=list(Hour, month_of_year)]
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
print(paste('Zone', z))
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
subset_columns_train = c(features, 'DEMAND')
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'load_ratio')
result_all_hours = list()
hour_counter = 1
for (h in hours){
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
train_x <- as.matrix(train_df_sub[, ..features, drop=FALSE])
train_y <- as.matrix(train_df_sub[, c('DEMAND'), drop=FALSE])
validation_x <- as.matrix(validation_df_sub[, ..features, drop=FALSE])
result_all_quantiles = list()
quantile_counter = 1
for (tau in quantiles){
model = qrnn2.fit(x=train_x, y=train_y,
n.hidden=n.hidden, n.hidden2=n.hidden2,
tau=tau, Th=tanh,
iter.max=iter.max,
penalty=penalty)
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$load_ratio
result$DEMAND = validation_df_sub$DEMAND
result$loss = pinball_loss(tau, validation_df_sub$DEMAND, result$Prediction)
result$q = tau
result_all_quantiles[[quantile_counter]] = result
quantile_counter = quantile_counter + 1
}
result_all_hours[[hour_counter]] = rbindlist(result_all_quantiles)
hour_counter = hour_counter + 1
}
rbindlist(result_all_hours)
}
result_all[[counter]] = result_all_zones
counter = counter + 1
}
}
result_final = rbindlist(result_all)
output_file_name = paste("cv_output_", ts, ".csv", sep = "")
output_file = file.path(paste(cvpath, '/', output_file_name, sep=""))
fwrite(result_final, output_file)
parallel::stopCluster(cl)

Просмотреть файл

@ -1,15 +0,0 @@
#!/bin/bash
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 40`;
do
echo "Parameter Set $i"
start=`date +%s`
echo 'Creating features...'
python $path/fnn/compute_features.py --submission fnn
echo 'Training and validation...'
Rscript $path/fnn/train_validate.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done

Просмотреть файл

@ -1,40 +0,0 @@
## Download base image
FROM continuumio/anaconda3:5.3.0
ADD ./conda_dependencies.yml /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Create and activate conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
RUN rm conda_dependencies.yml
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,201 +0,0 @@
# Implementation submission form
## Submission information
**Submission date**: 01/14/2018
**Benchmark name:** GEFCom2017_D_Prob_MT_hourly
**Submitter(s):** Dmitry Pechyoni
**Submitter(s) email:** dmpechyo@microsoft.com
**Submission name:** Quantile Random Forest
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
## Implementation description
### Modelling approach
In this submission, we implement a quantile random forest model using the `scikit-garden` package in Python.
### Feature engineering
The following features are used:
**Basic temporal features**: hour of day, day of week, day of month, time of the year (normalized to range [0,1]), week of the year, month of the year
**RecentLoad**: moving average of load values of the same day of
week and same hour of day of at the window of 4 weeks. We use 8 moving windows, the first one at weeks 10-13 before forecasting week, the last one is at weeks 17-20 before forecasting week. Each window generates a separate RecentLoad feature.
**RecentDryBulb**: moving average of Dry Bulb values of the same day of
week and same hour of day of at the window of 4 weeks. We use 8 moving windows, the first one at weeks 9-12 before forecasting week, the last one is at weeks 16-19 before forecasting week. Each window generates a separate RecentDryBulb feature.
**RecentDewPnt**: moving average of Dew Point values of the same day of
week and same hour of day of at the window of 4 weeks. We use 8 windows, the first one at weeks 9-12 before forecasting week, the last one is at weeks 16-19 before forecasting week. Each window generates a separate RecentDewPnt feature.
**Daily Fourier Series features**: sine and cosine of the hour of the day, with harmonics 1 and 2. Altogether we generate 4 such features.
**Weekly Fourier Series features**: sine and cosine of the day of the week, with harmonics 1, 2 and 3. Altogether we generate 6 such features.
**Annual Fourier Series features**: sine and cosine of the day of the year, with harmonics 1, 2 and 3. Altogether we generate 6 such features.
### Model tuning
We chose hyperparameter values that minimize average pinball loss over validation folds.
We used 2 validation time frames, the first one in January-April 2015, the second one at the same months in 2016. Each validation timeframe was partitioned into 6 folds, each one spanning entire month. The training set of each fold ends one or two months before the first date of validation fold.
### Description of implementation scripts
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_score.py`: Python script that trains Quantile Random Forest models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_score.py` five times to generate five submission files and measure model running time.
### Steps to reproduce results
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux Data Science Virtual Machine and log into it.
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
5. Prepare Docker container for model training and predicting.
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
5.2 Build a local Docker image
```bash
sudo docker build -t qrf_image benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
```
6. Train and predict **within Docker container**
6.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name qrf_container qrf_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf/train_score_vm.sh >& out.txt &
```
The last command will take about 31 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
```
sudo docker exec -it qrf_container /bin/bash
tail out.txt
```
to connect to the running container and check the status of the run.
After generating the forecast results, you can exit the Docker container by command `exit`.
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash tsperf/benchmarking/evaluate qrf tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** F72s v2 (72 vcpus, 144 GB memory) Ubuntu Linux VM
**Data storage:** Standard SSD
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.6
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-garden=0.1.3
- joblib=0.12.5
## Resource deployment instructions
Please follow the instructions below to deploy the Linux DSVM.
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a *Data Science Virtual Machine for Linux (Ubuntu)*. Select *F72s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 76.29
* Pinball loss run 2: 76.29
* Pinball loss run 3: 76.18
* Pinball loss run 4: 76.23
* Pinball loss run 5: 76.38
* Median Pinball loss: 76.29
**Time:**
* Run time 1: 20119 seconds
* Run time 2: 20489 seconds
* Run time 3: 20616 seconds
* Run time 4: 20297 seconds
* Run time 5: 20322 seconds
* Median run time: 20322 seconds (5.65 hours)
**Cost:**
The hourly cost of the F72s v2 Ubuntu Linux VM in East US Azure region is 3.045 USD, based on the price at the submission date.
Thus, the total cost is 20322/3600 * 3.045 = 17.19 USD.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 16.89
Round 2: 14.93
Round 3: 12.34
Round 4: 14.95
Round 5: 16.19
Round 6: -0.32
**Ranking in the qualifying round of GEFCom2017 competition**
3

Просмотреть файл

@ -1,94 +0,0 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Quantile Regression model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
HOLIDAY_COLNAME = "Holiday"
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
(
"temporal",
{
"feature_list": [
"hour_of_day",
"day_of_week",
"day_of_month",
"normalized_hour_of_year",
"week_of_year",
"month_of_year",
]
},
),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("daily_fourier", {"n_harmonics": 2}),
("normalized_date", {}),
("normalized_datehour", {}),
("normalized_year", {}),
("day_type", {"holiday_col_name": HOLIDAY_COLNAME}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": ["DryBulb", "DewPnt"], "round_agg_result": True},),
(
"recent_load_lag",
{"input_col_names": "DEMAND", "start_week": 10, "window_size": 4, "agg_count": 8, "round_agg_result": True,},
),
(
"recent_temp_lag",
{
"input_col_names": ["DryBulb", "DewPnt"],
"start_week": 10,
"window_size": 4,
"agg_count": 8,
"round_agg_result": True,
},
),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG, feature_config_list, filter_by_month=False,
)

Просмотреть файл

@ -1,12 +0,0 @@
name: tsperf
channels:
- conda-forge
dependencies:
- python=3.6
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- scikit-garden=0.1.3
- joblib=0.12.5
- scikit-learn=0.20.3

Просмотреть файл

@ -1,300 +0,0 @@
# this file replaces quantile/ensemble.py file of scikit-garden package
# this code, unlike the original code, makes use of all available cores when doing scoring
# also, unlike the original code, predict() function in the new code can generate predictions for multiple quantiles
import numpy as np
from numpy import ma
from sklearn.ensemble.forest import ForestRegressor
from sklearn.utils import check_array
from sklearn.utils import check_random_state
from sklearn.utils import check_X_y
from joblib import Parallel, delayed
from skgarden.quantile.tree import DecisionTreeQuantileRegressor
from skgarden.quantile.ensemble import generate_sample_indices
from ensemble_parallel_utils import weighted_percentile_vectorized
class BaseForestQuantileRegressor(ForestRegressor):
"""Training and scoring of Quantile Regression Random Forest
Training code is the same as in scikit-garden package. Scoring code uses all cores, unlike the original
code in scikit-garden.
Attributes:
y_train_ : array-like, shape=(n_samples,)
Cache the target values at fit time.
y_weights_ : array-like, shape=(n_estimators, n_samples)
y_weights_[i, j] is the weight given to sample ``j` while
estimator ``i`` is fit. If bootstrap is set to True, this
reduces to a 2-D array of ones.
y_train_leaves_ : array-like, shape=(n_estimators, n_samples)
y_train_leaves_[i, j] provides the leaf node that y_train_[i]
ends up when estimator j is fit. If y_train_[i] is given
a weight of zero when estimator j is fit, then the value is -1.
"""
def fit(self, X, y):
"""Builds a forest from the training set (X, y).
Args:
X : array-like or sparse matrix, shape = [n_samples, n_features]
The training input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csc_matrix``.
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (class labels) as integers or strings.
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. Splits are also
ignored if they would result in any single class carrying a
negative weight in either child node.
check_input : boolean, (default=True)
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
X_idx_sorted : array-like, shape = [n_samples, n_features], optional
The indexes of the sorted training input samples. If many tree
are grown on the same dataset, this allows the ordering to be
cached between trees. If None, the data will be sorted here.
Don't use this parameter unless you know what to do.
Returns:
self : object
Returns self.
"""
# apply method requires X to be of dtype np.float32
X, y = check_X_y(X, y, accept_sparse="csc", dtype=np.float32, multi_output=False)
super(BaseForestQuantileRegressor, self).fit(X, y)
self.y_train_ = y
self.y_train_leaves_ = -np.ones((self.n_estimators, len(y)), dtype=np.int32)
self.y_weights_ = np.zeros_like((self.y_train_leaves_), dtype=np.float32)
for i, est in enumerate(self.estimators_):
if self.bootstrap:
bootstrap_indices = generate_sample_indices(est.random_state, len(y))
else:
bootstrap_indices = np.arange(len(y))
est_weights = np.bincount(bootstrap_indices, minlength=len(y))
y_train_leaves = est.y_train_leaves_
for curr_leaf in np.unique(y_train_leaves):
y_ind = y_train_leaves == curr_leaf
self.y_weights_[i, y_ind] = est_weights[y_ind] / np.sum(est_weights[y_ind])
self.y_train_leaves_[i, bootstrap_indices] = y_train_leaves[bootstrap_indices]
return self
def _compute_percentiles(self, x_leaf, quantiles, sorter):
mask = self.y_train_leaves_ != np.expand_dims(x_leaf, 1)
x_weights = ma.masked_array(self.y_weights_, mask)
weights = x_weights.sum(axis=0)
return weighted_percentile_vectorized(self.y_train_, quantiles, weights, sorter)
def predict(self, X, quantiles=None):
"""Predict regression value for X.
Args:
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
quantiles : list of ints, optional
list of value ranging from 0 to 100. By default, the mean is returned.
check_input : boolean, (default=True)
Allow to bypass several input checking.
Don't use this parameter unless you know what you do.
Returns:
y : array of shape = [n_samples]
If quantile is set to None, then return E(Y | X). Else return
y such that F(Y=y | x) = quantile.
"""
# apply method requires X to be of dtype np.float32
X = check_array(X, dtype=np.float32, accept_sparse="csc")
if quantiles is None:
return super(BaseForestQuantileRegressor, self).predict(X)
sorter = np.argsort(self.y_train_)
X_leaves = self.apply(X)
with Parallel(n_jobs=-1, backend="multiprocessing", batch_size=10) as p:
percentiles = p(delayed(self._compute_percentiles)(x_leaf, quantiles, sorter) for x_leaf in X_leaves)
return np.array(percentiles)
class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
"""A random forest regressor that provides quantile estimates.
A random forest is a meta estimator that fits a number of classifying
decision trees on various sub-samples of the dataset and use averaging
to improve the predictive accuracy and control over-fitting.
The sub-sample size is always the same as the original
input sample size but the samples are drawn with replacement if
`bootstrap=True` (default).
References:
Nicolai Meinshausen, Quantile Regression Forests
http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf
Attributes:
estimators_ : list of DecisionTreeQuantileRegressor
The collection of fitted sub-estimators.
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).
n_features_ : int
The number of features when ``fit`` is performed.
n_outputs_ : int
The number of outputs when ``fit`` is performed.
oob_score_ : float
Score of the training dataset obtained using an out-of-bag estimate.
oob_prediction_ : array of shape = [n_samples]
Prediction computed with out-of-bag estimate on the training set.
"""
def __init__(
self,
n_estimators=10,
criterion="mse",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features="auto",
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False,
):
"""Initialize RandomForestQuantileRegressor class
Args:
n_estimators : integer, optional (default=10)
The number of trees in the forest.
criterion : string, optional (default="mse")
The function to measure the quality of a split. Supported criteria
are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.
.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.
max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a percentage and
`int(max_features * n_features)` features are considered at each
split.
- If "auto", then `max_features=n_features`.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
max_depth : integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.
.. versionchanged:: 0.18
Added float values for percentages.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.
.. versionchanged:: 0.18
Added float values for percentages.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
max_leaf_nodes : int or None, optional (default=None)
Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate
the R^2 on unseen data.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both `fit` and `predict`.
If -1, then the number of jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
verbose : int, optional (default=0)
Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False)
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest.
"""
super(RandomForestQuantileRegressor, self).__init__(
base_estimator=DecisionTreeQuantileRegressor(),
n_estimators=n_estimators,
estimator_params=(
"criterion",
"max_depth",
"min_samples_split",
"min_samples_leaf",
"min_weight_fraction_leaf",
"max_features",
"max_leaf_nodes",
"random_state",
),
bootstrap=bootstrap,
oob_score=oob_score,
n_jobs=n_jobs,
random_state=random_state,
verbose=verbose,
warm_start=warm_start,
)
self.criterion = criterion
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes

Просмотреть файл

@ -1,85 +0,0 @@
# this file replaces quantile/utils.py file in scikit-garden package
# the new vector_percentile_vetorized() function, unlike the original weighted_percentile() function, can compute percentiles for multiple quantiles
import numpy as np
def weighted_percentile_vectorized(a, quantiles, weights=None, sorter=None):
"""Returns the weighted percentile of a at q given weights.
Note that weighted_percentile(a, q) is not equivalent to
np.percentile(a, q). This is because in np.percentile
sorted(a)[i] is assumed to be at quantile 0.0, while here we assume
sorted(a)[i] is given a weight of 1.0 / len(a), hence it is at the
1.0 / len(a)th quantile.
References:
https://en.wikipedia.org/wiki/Percentile#The_Weighted_Percentile_method
Args:
a: array-like, shape=(n_samples,)
samples at which the quantile is computed.
quantiles: array of ints
list of quantiles.
weights: array-like, shape=(n_samples,)
weights[i] is the weight given to point a[i] while computing the
quantile. If weights[i] is zero, a[i] is simply ignored during the
percentile computation.
sorter: array-like, shape=(n_samples,)
If provided, assume that a[sorter] is sorted.
Returns:
percentiles: array of floats
Weighted percentile of a at each of quantiles.
"""
if weights is None:
weights = np.ones_like(a)
a = np.asarray(a, dtype=np.float32)
weights = np.asarray(weights, dtype=np.float32)
if len(a) != len(weights):
raise ValueError("a and weights should have the same length.")
if sorter is not None:
a = a[sorter]
weights = weights[sorter]
nz = weights != 0
a = a[nz]
weights = weights[nz]
if sorter is None:
sorted_indices = np.argsort(a)
sorted_a = a[sorted_indices]
sorted_weights = weights[sorted_indices]
else:
sorted_a = a
sorted_weights = weights
# Step 1
sorted_cum_weights = np.cumsum(sorted_weights)
total = sorted_cum_weights[-1]
# Step 2
partial_sum = 100.0 / total * (sorted_cum_weights - sorted_weights / 2.0)
percentiles = np.zeros_like(quantiles)
for i, q in enumerate(quantiles):
if q > 100 or q < 0:
raise ValueError("q should be in-between 0 and 100, " "got %d" % q)
start = np.searchsorted(partial_sum, q) - 1
if start == len(sorted_cum_weights) - 1:
percentiles[i] = sorted_a[-1]
elif start == -1:
percentiles[i] = sorted_a[0]
else:
# Step 3.
fraction = (q - partial_sum[start]) / (partial_sum[start + 1] - partial_sum[start])
percentiles[i] = sorted_a[start] + fraction * (sorted_a[start + 1] - sorted_a[start])
return percentiles

Просмотреть файл

@ -1,12 +0,0 @@
"""
This script inserts the TSPerf directory into sys.path, so that scripts can import
all the modules in TSPerf. Each submission folder needs its own localpath.py file.
"""
import os, sys
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -1,80 +0,0 @@
# This script performs training and scoring with Quantile Random Forest model
from os.path import join
import argparse
import pandas as pd
from numpy import arange
from ensemble_parallel import RandomForestQuantileRegressor
# get seed value
parser = argparse.ArgumentParser()
parser.add_argument(
"--data-folder", type=str, dest="data_folder", help="data folder mounting point",
)
parser.add_argument(
"--output-folder", type=str, dest="output_folder", help="output folder mounting point",
)
parser.add_argument("--seed", type=int, dest="seed", help="random seed")
args = parser.parse_args()
# initialize location of input and output files
data_dir = join(args.data_folder, "features")
train_dir = join(data_dir, "train")
test_dir = join(data_dir, "test")
output_file = join(args.output_folder, "submission_seed_{}.csv".format(args.seed))
# do 6 rounds of forecasting, at each round output 9 quantiles
n_rounds = 6
quantiles = arange(0.1, 1, 0.1)
# schema of the output
y_test = pd.DataFrame(columns=["Datetime", "Zone", "Round", "q", "Prediction"])
for i in range(1, n_rounds + 1):
print("Round {}".format(i))
# read training and test files for the current round
train_file = join(train_dir, "train_round_{}.csv".format(i))
train_df = pd.read_csv(train_file)
test_file = join(test_dir, "test_round_{}.csv".format(i))
test_df = pd.read_csv(test_file)
# train and test for each hour separately
for hour in arange(0, 24):
print(hour)
# select training sets
train_df_hour = train_df[(train_df["hour_of_day"] == hour)]
# create one-hot encoding of Zone
# (scikit-garden works only with numerical columns)
train_df_hour = pd.get_dummies(train_df_hour, columns=["Zone"])
# remove column that are not useful (Datetime) or are not
# available in the test set (DEMAND, DryBulb, DewPnt)
X_train = train_df_hour.drop(columns=["Datetime", "DEMAND", "DryBulb", "DewPnt"]).values
y_train = train_df_hour["DEMAND"].values
# train a model
rfqr = RandomForestQuantileRegressor(
random_state=args.seed, n_jobs=-1, n_estimators=1000, max_features="sqrt", max_depth=12,
)
rfqr.fit(X_train, y_train)
# select test set
test_df_hour = test_df[test_df["hour_of_day"] == hour]
y_test_baseline = test_df_hour[["Datetime", "Zone"]]
test_df_cat = pd.get_dummies(test_df_hour, columns=["Zone"])
X_test = test_df_cat.drop(columns=["Datetime"]).values
# generate forecast for each quantile
percentiles = rfqr.predict(X_test, quantiles * 100)
for j, quantile in enumerate(quantiles):
y_test_round_quantile = y_test_baseline.copy(deep=True)
y_test_round_quantile["Round"] = i
y_test_round_quantile["q"] = quantile
y_test_round_quantile["Prediction"] = percentiles[:, j]
y_test = pd.concat([y_test, y_test_round_quantile])
# store forecasts
y_test.to_csv(output_file, index=False)

Просмотреть файл

@ -1,15 +0,0 @@
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/qrf/compute_features.py --submission qrf
echo 'Training and predicting...'
python $path/qrf/train_score.py --data-folder $path/qrf/data --output-folder $path/qrf --seed $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done
echo 'Training and scoring are completed'

Просмотреть файл

@ -1,51 +0,0 @@
## Download base image
FROM ubuntu:16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Install R
ENV R_BASE_VERSION 3.5.1
RUN sh -c 'echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list' \
&& gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 \
&& gpg -a --export E084DAB9 | apt-key add -
RUN apt-get update && apt-get install -y --no-install-recommends r-base=${R_BASE_VERSION}-* \
&& echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site
## Mount R dependency file into the docker container and install dependencies
# Install prerequisites of 'forecast' package
RUN apt-get update && apt-get install -y \
gfortran \
libblas-dev \
liblapack-dev
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-08-27/"))' >> /etc/R/Rprofile.site
ADD ./install_R_dependencies.r /tmp
RUN Rscript install_R_dependencies.r
RUN rm ./install_R_dependencies.r
WORKDIR /
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,195 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 09/01/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** ARIMA
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA
## Implementation description
### Modelling approach
In this submission, we implement the ARIMA model for retail sales forecasting benchmark OrangeJuice_Pt_3Weeks_Weekly using R package
`forecast`.
### Feature engineering
Only the weekly sales of each orange juice has been used in the implementation of the forecast method.
### Hyperparameter tuning
Default hyperparameters of the forecasting algorithm are used. Additionally, the frequency of the weekly sales time series is set to be 52,
since there are approximately 52 weeks in a year.
### Description of implementation scripts
* `train_score.r`: R script that trains the models and generates forecasts
* `model_selection.r` (optional): R script that selects the best ARIMA model for each time series
* `arima.Rmd` (optional): R markdown that trains the models and visualizes the results
* `arima.nb.html` (optional): Html file associated with the R markdown file
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA
```
7. Choose a name for a new Docker container (e.g. arima_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name arima_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Train the model and make predictions from `/Forecasting` folder by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA R
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard D2s v3 (2 vcpus, 8 GB memory, 16 GB temporary storage) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile)
**Key packages/dependencies:**
* R
- r-base==3.5.1
- forecast==8.1
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *D2s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 70.80%*
*MAPE run 2: 70.80%*
*MAPE run 3: 70.80%*
*MAPE run 4: 70.80%*
*MAPE run 5: 70.80%*
*median MAPE: 70.80%*
**Time:**
*run time 1: 259.60 seconds*
*run time 2: 266.96 seconds*
*run time 3: 265.94 seconds*
*run time 4: 264.22 seconds*
*run time 5: 267.89 seconds*
*median run time: 265.94 seconds*
**Cost:** The hourly cost of the D2s v3 Ubuntu Linux VM in East US Azure region is 0.096 USD, based on the price at the submission date. Thus, the total cost is 265.94/3600 $\times$ 0.096 = $0.0071.
Note that there is no randomness in the forecasts obtained by the above method. Thus, quality values do not change over
different runs.

Просмотреть файл

@ -1,194 +0,0 @@
---
title: "Auto ARIMA Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly"
output: html_notebook
---
```{r}
# Import packages
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
# Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Get the path of the current script and paths of data directories
SUBMISSION_DIR <- dirname(rstudioapi::getSourceEditorContext()$path)
TRAIN_DIR <- file.path(dirname(dirname(SUBMISSION_DIR)), 'data', 'train')
TEST_DIR <- file.path(dirname(dirname(SUBMISSION_DIR)), 'data', 'test')
```
```{r}
#### Test auto.arima method on a subset of the data ####
# Import training data
r <- 1
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
#head(train_df)
# Create a dataframe to hold all necessary data
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[, c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
# Auto ARIMA method
train_sub <- filter(train_filled, store=='2', brand=='1')
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
fit_arima <- auto.arima(train_ts)
pred_arima <- forecast(fit_arima, h=horizon)
print('Auto ARIMA forecasts')
pred_arima$mean[2:horizon]
plot(pred_arima, main='Auto ARIMA')
```
```{r}
#### Implement auto.arima method on all the data ####
basic_method <- 'arima'
pred_basic_all <- list()
print(paste0('Using ', basic_method))
# Basic methods
apply_basic_methods <- function(train_sub, method, r) {
# Trains a basic model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# method (String): Name of the basic method which can be 'naive', 'snaive',
# 'meanf', 'ets', or 'arima'
# r (Integer): Index of the forecast round
#
# Returns:
# pred_basic_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
if (method == 'naive'){
# Naive method
pred_basic <- naive(train_ts, h=pred_horizon)
} else if (method == 'snaive'){
# Seasonal naive
pred_basic <- snaive(train_ts, h=pred_horizon)
} else if (method == 'meanf'){
# Mean forecast
pred_basic <- meanf(train_ts, h=pred_horizon)
} else if (method == 'ets') {
# ETS
fit_ets <- ets(train_ts)
pred_basic <- forecast(fit_ets, h=pred_horizon)
} else if (method == 'arima'){
# Auto ARIMA
fit_arima <- auto.arima(train_ts)
pred_basic <- forecast(fit_arima, h=pred_horizon)
}
pred_basic_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_basic$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
# Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
# Create a dataframe to hold all necessary data
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[, c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
# Apply basic method
pred_basic_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_basic_methods(., basic_method, r))
}
# Combine and save forecast results
pred_basic_all <- do.call(rbind, pred_basic_all)
write.csv(pred_basic_all, file.path(SUBMISSION_DIR, 'submission.csv'), row.names = FALSE)
# Evaluate forecast performance
# Get the true value dataframe
true_sales_all <- list()
for (r in 1:NUM_ROUNDS){
test_df <- read.csv(file.path(TEST_DIR, paste0('test_round_', as.character(r), '.csv')))
true_sales_all[[paste0('Round', r)]] <-
data.frame(round = rep(r, dim(test_df)[1]),
store = test_df$store,
brand = test_df$brand,
week = test_df$week,
truth = round(exp(test_df$logmove)))
}
true_sales_all <- do.call(rbind, true_sales_all)
# Merge prediction and true sales
merged_df <- merge(pred_basic_all, true_sales_all,
by = c('round', 'store', 'brand', 'week'),
all.y = TRUE)
# Compute MAPE values
print('MAPE')
print(MAPE(merged_df$prediction, merged_df$truth)*100)
print('MedianAPE')
print(MedianAPE(merged_df$prediction, merged_df$truth)*100)
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,11 +0,0 @@
pkgs <- c(
'optparse',
'dplyr',
'tidyr',
'forecast',
'MLmetrics'
)
install.packages(pkgs)

Просмотреть файл

@ -1,87 +0,0 @@
#!/usr/bin/Rscript
#
# Select the best ARIMA model for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly
#
# This script can be executed with the following command from TSPerf directory
# Rscript <submission dir>/model_selection.r
# It outputs a csv file containing the orders of the best ARIMA models selected by auto.arima
# function in R package forecast.
# Import packages
library(dplyr)
library(tidyr)
library(forecast)
# Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
# Paths of the training data and submission folder
DATA_DIR <- './retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data'
TRAIN_DIR <- file.path(DATA_DIR, 'train')
SUBMISSION_DIR <- file.path(dirname(DATA_DIR), 'submissions', 'ARIMA')
#### Select ARIMA model for every store-brand based on 1st-round training data ####
print('Selecting ARIMA models')
arima_model_all <- list()
select_arima_model <- function(train_sub) {
# Selects the best ARIMA model for the time series of each store-brand.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
#
# Returns:
# arima_order_df (Dataframe): Configuration of the best ARIMA model
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
fit_arima <- auto.arima(train_ts)
arima_order <- arimaorder(fit_arima)
arima_order_df <- data.frame(store = cur_store,
brand = cur_brand,
seasonal = length(arima_order) > 3,
p = arima_order['p'],
d = arima_order['d'],
q = arima_order['q'],
P = arima_order['P'],
D = arima_order['D'],
Q = arima_order['Q'],
m = arima_order['Frequency'])
}
r = 1
print(paste0('---- Round ', r, ' ----'))
# Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
# Create a dataframe to hold all necessary data
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[, c('store','brand','week','logmove')]
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
# Select ARIMA models
arima_model_all <-
train_filled %>%
group_by(store, brand) %>%
do(select_arima_model(.))
# Combine and save model selection results
write.csv(arima_model_all, file.path(SUBMISSION_DIR, 'hparams.csv'), row.names = FALSE)

Просмотреть файл

@ -1,122 +0,0 @@
#!/usr/bin/Rscript
#
# ARIMA Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly
#
# Note that we first select the best ARIMA model for each time series using Auto ARIMA in
# model_selection.r script. Then, we simply train the best ARIMA model in this script to
# exclude the model selection time and achieve a fair comparison with other methods.
#
# This script can be executed with the following command from TSPerf directory
# Rscript <submission dir>/train_score.r -seed <seed value>
# where <seed value> is a random seed value from 1 to 5 (here since the forecast method
# is deterministic, this value will be simply used as a suffix of the output file name).
# Import packages
library(optparse)
library(dplyr)
library(tidyr)
library(forecast)
# Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Parse input argument
option_list <- list(
make_option(c('-s', '--seed'), type='integer', default=NULL,
help='random seed value from 1 to 5', metavar='integer')
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
# Paths of the training data and submission folder
DATA_DIR <- './retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data'
TRAIN_DIR <- file.path(DATA_DIR, 'train')
SUBMISSION_DIR <- file.path(dirname(DATA_DIR), 'submissions', 'ARIMA')
# Generate submission file name
if (is.null(opt$seed)){
output_file_name <- file.path(SUBMISSION_DIR, 'submission.csv')
print('Random seed is not specified. Output file name will be submission.csv.')
} else{
output_file_name <- file.path(SUBMISSION_DIR, paste0('submission_seed_', as.character(opt$seed), '.csv'))
print(paste0('Random seed is specified. Output file name will be submission_seed_',
as.character(opt$seed) , '.csv.'))
}
#### Implement ARIMA model for every store-brand ####
print('Using ARIMA Method')
pred_arima_all <- list()
# Load hyperparameters
hparams <- read.csv(file.path(SUBMISSION_DIR, 'hparams.csv'))
apply_arima_method <- function(train_sub, r) {
# Trains ARIMA model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# r (Integer): Index of the forecast round
#
# Returns:
# pred_arima_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
# Retrieve the best ARIMA model selected before
arima_order <- hparams[which(hparams$store==cur_store &
hparams$brand==cur_brand), ]
fit_arima <- Arima(y=train_ts, order=unlist(arima_order[c('p','d','q')], use.names=FALSE))
pred_arima <- forecast(fit_arima, h=pred_horizon)
pred_arima_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_arima$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
# Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
# Create a dataframe to hold all necessary data
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[, c('store','brand','week','logmove')]
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
# Apply ARIMA method
pred_arima_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_arima_method(., r))
}
# Combine and save forecast results
pred_arima_all <- do.call(rbind, pred_arima_all)
write.csv(pred_arima_all, output_file_name, row.names = FALSE)

Просмотреть файл

@ -1,63 +0,0 @@
## Download base image
FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends --allow-downgrades --allow-change-held-packages \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
libcudnn7=7.2.1.38-1+cuda9.0 \
libnccl2=2.2.13-1+cuda9.0 \
libfreetype6-dev \
libhdf5-serial-dev \
libpng12-dev \
libzmq3-dev
RUN apt-get update && \
apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && \
apt-get update && \
apt-get install libnvinfer4=4.1.2-1+cuda9.0
# Update pip and setuptools
RUN pip3 install --upgrade pip
RUN pip3 install --upgrade setuptools
## Mount Python dependency file into the docker container and install dependencies
WORKDIR /tmp
ADD ./python_dependencies.txt /tmp
RUN pip3 install -r python_dependencies.txt
# Fix the symlink issue of tensorflow (https://github.com/tensorflow/tensorflow/issues/10776)
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LD_LIBRARY_PATH='/usr/local/cuda/lib64/stubs/:${LD_LIBRARY_PATH}'
RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1
WORKDIR /
RUN rm -rf tmp
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,201 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 11/29/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** DilatedCNN
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN
## Implementation description
### Modelling approach
In this submission, we implement a Dilated Convolutional Neural Network (CNN) model using Keras package. Dilated CNN is a class of CNN that was initially
proposed to improve audio waveform generation in [this paper](https://arxiv.org/abs/1609.03499) by Oord et al in 2016. Later this model has shown great
performance in solving time series forecasting problems of several recent machine learning competitions.
### Feature engineering
The following features have been used in the implementation of the forecast method:
- datetime features including week of the month and month number
- weekly sales of each orange juice in recent weeks
- other dynamic features including deal information (*deal* column), feature advertisement information (*feat* column), price, and relative price
- static features including store index and brand index
### Hyperparameter tuning
We tune the hyperparameters of the model with HyperDrive which is accessible through Azure ML SDK. A remote compute cluster with GPU support is created
to distribute the computation. The hyperparameters tuned with HyperDrive and their ranges can be found in hyperparameter_tuning.ipynb.
### Description of implementation scripts
* `utils.py`: Python script including utility functions for building the Dilated CNN model
* `train_score.py`: Python script that trains the model and generates forecast results for each round
* `train_score.ipynb` (optional): Jupyter notebook that trains the model and visualizes the results
* `train_validate.py` (optional): Python script that does training and validation with the 1st round training data
* `hyperparameter_tuning.ipynb` (optional): Jupyter notebook that tries different model configurations and selects the best model by running
`train_validate.py` script in a remote compute cluster with different sets of hyperparameters
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t dcnn_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN
```
7. Choose a name for a new Docker container (e.g. dcnn_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name dcnn_container dcnn_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Train the model and make predictions from `/Forecasting` folder by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN Python3
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard NC6 (1 GPU, 6 vCPUs, 56 GB memory) Ubuntu Linux VM
**Data storage:** Standard HDD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile)
**Key packages/dependencies:**
* Python
- pandas==0.23.1
- scikit-learn==0.19.1
- tensorflow-gpu==1.12.0
- keras==2.2.4
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *NC6* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 36.76%*
*MAPE run 2: 39.01%*
*MAPE run 3: 37.76%*
*MAPE run 4: 36.36%*
*MAPE run 5: 37.09%*
*median MAPE: 37.09%*
**Time:**
*run time 1: 412.89 seconds*
*run time 2: 412.73 seconds*
*run time 3: 419.58 seconds*
*run time 4: 412.71 seconds*
*run time 5: 414.30 seconds*
*median run time: 412.89 seconds*
**Cost:** The hourly cost of NC6 Ubuntu Linuix VM in East US Azure region is 0.90 USD, based on the price at the submission date. Thus, the total cost is 412.89/3600 $\times$ 0.90 = $0.1032.

Просмотреть файл

@ -1,445 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of Dilated CNN Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of Dilated CNN model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_dcnn')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--dropout-rate', '0.2'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a GPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"gpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\", # GPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'gpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--dropout-rate': 0.2\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--seq-len': quniform(5, 40, 1),\n",
" '--dropout-rate': uniform(0, 0.4),\n",
" '--batch-size': choice(32, 64),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--epochs': quniform(2, 80, 1)\n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -1,88 +0,0 @@
# coding: utf-8
# Create input features for the Dilated Convolutional Neural Network (CNN) model.
import os
import sys
import math
import datetime
import numpy as np
import pandas as pd
# Append TSPerf path to sys.path
tsperf_dir = "."
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import *
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def make_features(pred_round, train_dir, pred_steps, offset, store_list, brand_list):
"""Create a dataframe of the input features.
Args:
pred_round (Integer): Prediction round
train_dir (String): Path of the training data directory
pred_steps (Integer): Number of prediction steps
offset (Integer): Length of training data skipped in the retraining
store_list (Numpy Array): List of all the store IDs
brand_list (Numpy Array): List of all the brand IDs
Returns:
data_filled (Dataframe): Dataframe including the input features
data_scaled (Dataframe): Dataframe including the normalized features
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled.drop("week_start", axis=1, inplace=True)
# Normalize the dataframe of features
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
return data_filled, data_scaled

Просмотреть файл

@ -1,6 +0,0 @@
numpy==1.14.2
scipy==1.0.1
pandas==0.23.1
scikit-learn==0.19.1
tensorflow-gpu==1.12.0
keras==2.2.4

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,223 +0,0 @@
# coding: utf-8
# Train and score a Dilated Convolutional Neural Network (CNN) model using Keras package with TensorFlow backend.
import os
import sys
import keras
import random
import argparse
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import optimizers
from keras.layers import *
from keras.models import Model, load_model
from keras.callbacks import ModelCheckpoint
# Append TSPerf path to sys.path (assume we run the script from TSPerf directory)
tsperf_dir = "."
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import *
from make_features import make_features
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
# Model definition
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
"""Create a Dilated CNN model.
Args:
seq_len (Integer): Input sequence length
kernel_size (Integer): Kernel size of each convolutional layer
n_filters (Integer): Number of filters in each convolutional layer
n_outputs (Integer): Number of outputs in the last layer
Returns:
Keras Model object
"""
# Sequential input
seq_in = Input(shape=(seq_len, n_input_series))
# Categorical input
cat_fea_in = Input(shape=(2,), dtype="uint8")
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
# Dilated convolutional layers
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
seq_in
)
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
# Skip connections
c4 = concatenate([c1, c3])
# Output of convolutional layers
conv_out = Conv1D(8, 1, activation="relu")(c4)
conv_out = Dropout(args.dropout_rate)(conv_out)
conv_out = Flatten()(conv_out)
# Concatenate with categorical features
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
x = Dense(16, activation="relu")(x)
output = Dense(n_outputs, activation="linear")(x)
# Define model interface, loss function, and optimizer
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
return model
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, dest="seed", default=1, help="random seed")
parser.add_argument("--seq-len", type=int, dest="seq_len", default=15, help="length of the input sequence")
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.01, help="dropout ratio")
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.015, help="learning rate")
parser.add_argument("--epochs", type=int, dest="epochs", default=25, help="# of epochs")
args = parser.parse_args()
# Fix random seeds
np.random.seed(args.seed)
random.seed(args.seed)
tf.set_random_seed(args.seed)
# Data paths
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
SUBMISSION_DIR = os.path.join(
tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "DilatedCNN"
)
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Dataset parameters
MAX_STORE_ID = 137
MAX_BRAND_ID = 11
# Parameters of the model
PRED_HORIZON = 3
PRED_STEPS = 2
SEQ_LEN = args.seq_len
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
STATIC_FEATURES = ["store", "brand"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
store_brand = [(x, y) for x in store_list for y in brand_list]
# Train and predict for all forecast rounds
pred_all = []
file_name = os.path.join(SUBMISSION_DIR, "dcnn_model.h5")
for r in range(bs.NUM_ROUNDS):
print("---- Round " + str(r + 1) + " ----")
offset = 0 if r == 0 else 40 + r * PRED_STEPS
# Create features
data_filled, data_scaled = make_features(r, TRAIN_DIR, PRED_STEPS, offset, store_list, brand_list)
# Create sequence array for 'move'
start_timestep = 0
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - PRED_HORIZON
train_input1 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep - offset
)
# Create sequence array for other dynamic features
start_timestep = PRED_HORIZON
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
train_input2 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep - offset
)
seq_in = np.concatenate([train_input1, train_input2], axis=2)
# Create array of static features
total_timesteps = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
cat_fea_in = static_feature_array(data_filled, total_timesteps - offset, STATIC_FEATURES)
# Create training output
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
train_output = gen_sequence_array(
data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep - offset
)
train_output = np.squeeze(train_output)
# Create and train model
if r == 0:
model = create_dcnn_model(
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
)
adam = optimizers.Adam(lr=args.learning_rate)
model.compile(loss="mape", optimizer=adam, metrics=["mape"])
# Define checkpoint and fit model
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
callbacks_list = [checkpoint]
history = model.fit(
[seq_in, cat_fea_in],
train_output,
epochs=args.epochs,
batch_size=args.batch_size,
callbacks=callbacks_list,
verbose=0,
)
else:
model = load_model(file_name)
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
callbacks_list = [checkpoint]
history = model.fit(
[seq_in, cat_fea_in],
train_output,
epochs=1,
batch_size=args.batch_size,
callbacks=callbacks_list,
verbose=0,
)
# Get inputs for prediction
start_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + PRED_STEPS
end_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK + PRED_STEPS - 1 - PRED_HORIZON
test_input1 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep - offset, end_timestep - offset
)
start_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN + 1
end_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
test_input2 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep - offset, end_timestep - offset
)
seq_in = np.concatenate([test_input1, test_input2], axis=2)
total_timesteps = 1
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
# Make prediction
pred = np.round(model.predict([seq_in, cat_fea_in]))
# Create dataframe for submission
exp_output = data_filled[data_filled.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
exp_output = exp_output[["store", "brand", "week"]]
pred_df = (
exp_output.sort_values(["store", "brand", "week"]).loc[:, ["store", "brand", "week"]].reset_index(drop=True)
)
pred_df["weeks_ahead"] = pred_df["week"] - bs.TRAIN_END_WEEK_LIST[r]
pred_df["round"] = r + 1
pred_df["prediction"] = np.reshape(pred, (pred.size, 1))
pred_all.append(pred_df)
# Generate submission
submission = pd.concat(pred_all, axis=0).reset_index(drop=True)
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
filename = "submission_seed_" + str(args.seed) + ".csv"
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)
print("Done")

Просмотреть файл

@ -1,212 +0,0 @@
# coding: utf-8
# Perform cross validation of a Dilated Convolutional Neural Network (CNN) model on the training data of the 1st forecast round.
import os
import sys
import math
import keras
import argparse
import datetime
import numpy as np
import pandas as pd
from utils import *
from keras.layers import *
from keras.models import Model
from keras import optimizers
from keras.utils import multi_gpu_model
from azureml.core import Run
# Model definition
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
"""Create a Dilated CNN model.
Args:
seq_len (Integer): Input sequence length
kernel_size (Integer): Kernel size of each convolutional layer
n_filters (Integer): Number of filters in each convolutional layer
n_outputs (Integer): Number of outputs in the last layer
Returns:
Keras Model object
"""
# Sequential input
seq_in = Input(shape=(seq_len, n_input_series))
# Categorical input
cat_fea_in = Input(shape=(2,), dtype="uint8")
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
# Dilated convolutional layers
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
seq_in
)
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
# Skip connections
c4 = concatenate([c1, c3])
# Output of convolutional layers
conv_out = Conv1D(8, 1, activation="relu")(c4)
conv_out = Dropout(args.dropout_rate)(conv_out)
conv_out = Flatten()(conv_out)
# Concatenate with categorical features
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
x = Dense(16, activation="relu")(x)
output = Dense(n_outputs, activation="linear")(x)
# Define model interface, loss function, and optimizer
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
return model
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data-folder", type=str, dest="data_folder", help="data folder mounting point")
parser.add_argument("--seq-len", type=int, dest="seq_len", default=20, help="length of the input sequence")
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.10, help="dropout ratio")
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.01, help="learning rate")
parser.add_argument("--epochs", type=int, dest="epochs", default=30, help="# of epochs")
args = parser.parse_args()
args.dropout_rate = round(args.dropout_rate, 2)
print(args)
# Start an Azure ML run
run = Run.get_context()
# Data paths
DATA_DIR = args.data_folder
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Data and forecast problem parameters
MAX_STORE_ID = 137
MAX_BRAND_ID = 11
PRED_HORIZON = 3
PRED_STEPS = 2
TRAIN_START_WEEK = 40
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
TEST_START_WEEK_LIST = list(range(137, 161, 2))
TEST_END_WEEK_LIST = list(range(138, 162, 2))
# The start datetime of the first week in the record
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Input sequence length and feature names
SEQ_LEN = args.seq_len
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
STATIC_FEATURES = ["store", "brand"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
store_brand = [(x, y) for x in store_list for y in brand_list]
# Train and validate the model using only the first round data
r = 0
print("---- Round " + str(r + 1) + " ----")
# Load training data
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled.apply(lambda x: x["price"] / x["avg_price"], axis=1)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled.drop("week_start", axis=1, inplace=True)
# Normalize the dataframe of features
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
# Create sequence array for 'move'
start_timestep = 0
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - PRED_HORIZON
train_input1 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep)
# Create sequence array for other dynamic features
start_timestep = PRED_HORIZON
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
train_input2 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep)
seq_in = np.concatenate((train_input1, train_input2), axis=2)
# Create array of static features
total_timesteps = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
# Create training output
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
train_output = gen_sequence_array(data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep)
train_output = np.squeeze(train_output)
# Create model
model = create_dcnn_model(
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
)
# Convert to GPU model
try:
model = multi_gpu_model(model)
print("Training using multiple GPUs...")
except:
print("Training using single GPU or CPU...")
adam = optimizers.Adam(lr=args.learning_rate)
model.compile(loss="mape", optimizer=adam, metrics=["mape", "mae"])
# Model training and validation
history = model.fit(
[seq_in, cat_fea_in], train_output, epochs=args.epochs, batch_size=args.batch_size, validation_split=0.05
)
val_loss = history.history["val_loss"][-1]
print("Validation loss is {}".format(val_loss))
# Log the validation loss/MAPE
run.log("MAPE", np.float(val_loss))

Просмотреть файл

@ -1,127 +0,0 @@
# coding: utf-8
# Utility functions for building the Dilated Convolutional Neural Network (CNN) model.
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def week_of_month(dt):
"""Get the week of the month for the specified date.
Args:
dt (Datetime): Input date
Returns:
wom (Integer): Week of the month of the input date
"""
from math import ceil
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
wom = int(ceil(adjusted_dom / 7.0))
return wom
def df_from_cartesian_product(dict_in):
"""Generate a Pandas dataframe from Cartesian product of lists.
Args:
dict_in (Dictionary): Dictionary containing multiple lists
Returns:
df (Dataframe): Dataframe corresponding to the Caresian product of the lists
"""
from collections import OrderedDict
from itertools import product
od = OrderedDict(sorted(dict_in.items()))
cart = list(product(*od.values()))
df = pd.DataFrame(cart, columns=od.keys())
return df
def gen_sequence(df, seq_len, seq_cols, start_timestep=0, end_timestep=None):
"""Reshape features into an array of dimension (time steps, features).
Args:
df (Dataframe): Time series data of a specific (store, brand) combination
seq_len (Integer): The number of previous time series values to use as input features
seq_cols (List): A list of names of the feature columns
start_timestep (Integer): First time step you can use to create feature sequences
end_timestep (Integer): Last time step you can use to create feature sequences
Returns:
A generator object for iterating all the feature sequences
"""
data_array = df[seq_cols].values
if end_timestep is None:
end_timestep = df.shape[0]
for start, stop in zip(
range(start_timestep, end_timestep - seq_len + 2), range(start_timestep + seq_len, end_timestep + 2)
):
yield data_array[start:stop, :]
def gen_sequence_array(df_all, store_brand, seq_len, seq_cols, start_timestep=0, end_timestep=None):
"""Combine feature sequences for all the combinations of (store, brand) into an 3d array.
Args:
df_all (Dataframe): Time series data of all stores and brands
seq_len (Integer): The number of previous time series values to use as input features
seq_cols (List): A list of names of the feature columns
start_timestep (Integer): First time step you can use to create feature sequences
end_timestep (Integer): Last time step you can use to create feature sequences
Returns:
seq_array (Numpy Array): An array of the feature sequences of all stores and brands
"""
seq_gen = (
list(
gen_sequence(
df_all[(df_all["store"] == cur_store) & (df_all["brand"] == cur_brand)],
seq_len,
seq_cols,
start_timestep,
end_timestep,
)
)
for cur_store, cur_brand in store_brand
)
seq_array = np.concatenate(list(seq_gen)).astype(np.float32)
return seq_array
def static_feature_array(df_all, total_timesteps, seq_cols):
"""Generate an array which encodes all the static features.
Args:
df_all (Dataframe): Time series data of all stores and brands
total_timesteps (Integer): Total number of training samples for each store and brand
seq_cols (List): A list of names of the static feature columns (e.g., store index)
Return:
fea_array (Numpy Array): An array of static features of all stores and brands
"""
fea_df = df_all.groupby(["store", "brand"]).apply(lambda x: x.iloc[:total_timesteps, :]).reset_index(drop=True)
fea_array = fea_df[seq_cols].values
return fea_array
def normalize_dataframe(df, seq_cols, scaler=MinMaxScaler()):
"""Normalize a subset of columns of a dataframe.
Args:
df (Dataframe): Input dataframe
seq_cols (List): A list of names of columns to be normalized
scaler (Scaler): A scikit learn scaler object
Returns:
df_scaled (Dataframe): Normalized dataframe
"""
cols_fixed = df.columns.difference(seq_cols)
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]), columns=seq_cols, index=df.index)
df_scaled = pd.concat([df[cols_fixed], df_scaled], axis=1)
return df_scaled, scaler

Просмотреть файл

@ -1,51 +0,0 @@
## Download base image
FROM ubuntu:16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Install R
ENV R_BASE_VERSION 3.5.1
RUN sh -c 'echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list' \
&& gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 \
&& gpg -a --export E084DAB9 | apt-key add -
RUN apt-get update && apt-get install -y --no-install-recommends r-base=${R_BASE_VERSION}-* \
&& echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site
## Mount R dependency file into the docker container and install dependencies
# Install prerequisites of 'forecast' package
RUN apt-get update && apt-get install -y \
gfortran \
libblas-dev \
liblapack-dev
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-08-27/"))' >> /etc/R/Rprofile.site
ADD ./install_R_dependencies.r /tmp
RUN Rscript install_R_dependencies.r
RUN rm ./install_R_dependencies.r
WORKDIR /
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,192 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 09/01/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** ETS
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS
## Implementation description
### Modelling approach
In this submission, we implement ETS method using R package `forecast`.
### Feature engineering
Only the weekly sales of each orange juice has been used in the implementation of the forecast method.
### Hyperparameter tuning
Default hyperparameters of the forecasting algorithm are used. Additionally, the frequency of the weekly sales time series is set to be 52,
since there are approximately 52 weeks in a year.
### Description of implementation scripts
* `train_score.r`: R script that trains the model and evaluate its performance
* `ets.Rmd` (optional): R markdown that trains the model and visualizes the results
* `ets.nb.html` (optional): Html file associated with the R markdown file
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS
```
7. Choose a name for a new Docker container (e.g. ets_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name ets_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Inside `/Forecasting` folder, train the model and make predictions by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS R
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard D2s v3 (2 vcpus, 8 GB memory, 16 GB temporary storage) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile)
**Key packages/dependencies:**
* R
- r-base==3.5.1
- forecast==8.1
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *D2s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 70.99%*
*MAPE run 2: 70.99%*
*MAPE run 3: 70.99%*
*MAPE run 4: 70.99%*
*MAPE run 5: 70.99%*
*median MAPE: 70.99%*
**Time:**
*run time 1: 277.03 seconds*
*run time 2: 277.00 seconds*
*run time 3: 277.75 seconds*
*run time 4: 277.01 seconds*
*run time 5: 274.50 seconds*
*median run time: 277.01 seconds*
**Cost:** The hourly cost of the D2s v3 Ubuntu Linux VM in East US Azure region is 0.096 USD, based on the price at the submission date. Thus, the total cost is 277.01/3600 $\times$ 0.096 = $0.0074.
Note that there is no randomness in the forecasts obtained by the above method. Thus, quality values do not change over
different runs.

Просмотреть файл

@ -1,185 +0,0 @@
---
title: "ETS Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly"
output: html_notebook
---
```{r}
## Import packages
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Get the path of the current script and paths of data directories
SCRIPT_PATH <- dirname(rstudioapi::getSourceEditorContext()$path)
TRAIN_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'train')
TEST_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'test')
```
```{r}
#### Test ets method on a subset of the data ####
## Import data
r <- 1
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
#head(train_df)
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
## ETS method
train_sub <- filter(train_filled, store=='2', brand=='1')
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
fit_ets <- ets(train_ts)
pred_ets <- forecast(fit_ets, h=horizon)
print('EST forecasts:')
pred_ets$mean[2:horizon]
plot(pred_ets, main='ETS')
```
```{r}
#### Implement ets method on all the data ####
basic_method <- 'ets'
pred_basic_all <- list()
print(paste0('Using ', basic_method))
## Basic methods
apply_basic_methods <- function(train_sub, method, r) {
# Trains a basic model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# method (String): Name of the basic method which can be 'naive', 'snaive',
# 'meanf', 'ets', or 'arima'
# r (Integer): Index of the forecast round
#
# Returns:
# pred_basic_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
if (method == 'naive'){
pred_basic <- naive(train_ts, h=pred_horizon)
} else if (method == 'snaive'){
pred_basic <- snaive(train_ts, h=pred_horizon)
} else if (method == 'meanf'){
pred_basic <- meanf(train_ts, h=pred_horizon)
} else if (method == 'ets') {
fit_ets <- ets(train_ts)
pred_basic <- forecast(fit_ets, h=pred_horizon)
} else if (method == 'arima'){
fit_arima <- auto.arima(train_ts)
pred_basic <- forecast(fit_arima, h=pred_horizon)
}
pred_basic_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_basic$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
# Apply basic method
pred_basic_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_basic_methods(., basic_method, r))
}
pred_basic_all <- do.call(rbind, pred_basic_all)
# Save forecast results
write.csv(pred_basic_all, file.path(SCRIPT_PATH, 'submission.csv'), row.names = FALSE)
## Evaluate forecast performance
# Get the true value dataframe
true_sales_all <- list()
for (r in 1:NUM_ROUNDS){
test_df <- read.csv(file.path(TEST_DIR, paste0('test_round_', as.character(r), '.csv')))
true_sales_all[[paste0('Round', r)]] <-
data.frame(round = rep(r, dim(test_df)[1]),
store = test_df$store,
brand = test_df$brand,
week = test_df$week,
truth = round(exp(test_df$logmove)))
}
true_sales_all <- do.call(rbind, true_sales_all)
# Merge prediction and true sales
merged_df <- merge(pred_basic_all, true_sales_all,
by = c('round', 'store', 'brand', 'week'),
all.y = TRUE)
print('MAPE')
print(MAPE(merged_df$prediction, merged_df$truth)*100)
print('MedianAPE')
print(MedianAPE(merged_df$prediction, merged_df$truth)*100)
```

Просмотреть файл

@ -1 +0,0 @@
f23245f9ad589e409afc6082e3464552bb32b596

Просмотреть файл

@ -1,11 +0,0 @@
pkgs <- c(
'optparse',
'dplyr',
'tidyr',
'forecast',
'MLmetrics'
)
install.packages(pkgs)

Просмотреть файл

@ -1,110 +0,0 @@
#!/usr/bin/Rscript
#
# ETS Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly
#
# This script can be executed with the following command
# Rscript <submission folder>/train_score.r --seed <seed value>
# where <seed value> is the random seed value from 1 to 5 (here since the forecast method
# is deterministic, this value will be simply used as a suffix of the output file name).
## Import packages
library(optparse)
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Parse input argument
option_list <- list(
make_option(c('-s', '--seed'), type='integer', default=NULL,
help='random seed value from 1 to 5', metavar='integer')
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
# Paths of the training data and submission folder
DATA_DIR <- './retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data'
TRAIN_DIR <- file.path(DATA_DIR, 'train')
SUBMISSION_DIR <- file.path(dirname(DATA_DIR), 'submissions', 'ETS')
# Generate submission file name
if (is.null(opt$seed)){
output_file_name <- file.path(SUBMISSION_DIR, 'submission.csv')
print('Random seed is not specified. Output file name will be submission.csv.')
} else{
output_file_name <- file.path(SUBMISSION_DIR, paste0('submission_seed_', as.character(opt$seed), '.csv'))
print(paste0('Random seed is specified. Output file name will be submission_seed_',
as.character(opt$seed) , '.csv.'))
}
#### Implement ets method for every store-brand ####
print('Using ETS Method')
pred_ets_all <- list()
## ets method
apply_ets_method <- function(train_sub, r) {
# Trains ETS model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# r (Integer): Index of the forecast round
#
# Returns:
# pred_ets_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
fit_ets <- ets(train_ts)
pred_ets <- forecast(fit_ets, h=pred_horizon)
pred_ets_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_ets$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
# Apply ets method
pred_ets_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_ets_method(., r))
}
# Combine and save forecast results
pred_ets_all <- do.call(rbind, pred_ets_all)
write.csv(pred_ets_all, output_file_name, row.names = FALSE)

Просмотреть файл

@ -1,43 +0,0 @@
## Download base image
FROM ubuntu:16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
#python-pip \
python3-pip \
vim
# Update pip
#RUN pip install --upgrade pip
RUN pip3 install --upgrade pip
## Mount Python dependency file into the docker container and install dependencies
WORKDIR /tmp
ADD ./python_dependencies.txt /tmp
#RUN pip install -r python_dependencies.txt
RUN pip3 install -r python_dependencies.txt
WORKDIR /
RUN rm -rf tmp
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,201 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 10/22/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** LightGBM
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM
## Implementation description
### Modelling approach
In this submission, we implement boosted decision tree model using Python package `lightgbm`, which is a fast, distributed, high performance
gradient boosting framework based on decision tree algorithms.
### Feature engineering
The following features have been used in the implementation of the forecast method:
- datetime features including week, week of the month, and month
- weekly sales of each orange juice in recent weeks
- average sales of each orange juice in recent weeks
- other features including *store*, *brand*, *deal*, *feat* columns and price features
### Hyperparameter tuning
We tune the hyperparameters of the model with HyperDrive which is accessible through Azure ML SDK. A remote compute cluster with 16 CPU cores is created to distribute the computation. The hyperparameters tuned with HyperDrive and their ranges can be found in hyperparameter_tuning.ipynb.
### Description of implementation scripts
* `utils.py`: Python script including utility functions for building the model
* `train_score.py`: Python script that trains the model and generates forecast results for each round
* `train_score.ipynb` (optional): Jupyter notebook that trains the model and visualizes the results
* `train_validate.py` (optional): Python script that does training and validation with the 1st round training data
* `hyperparameter_tuning.ipynb` (optional): Jupyter notebook that tries different model configurations and selects the best model by running
`train_validate.py` script in a remote compute cluster with different sets of hyperparameters
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual
machine and log into the provisioned VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t lightgbm_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM
```
7. Choose a name for a new Docker container (e.g. lightgbm_container) and create it using command:
```bash
cd ~/Forecasting
sudo docker run -it -v ~/Forecasting:/Forecasting --name lightgbm_container lightgbm_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Train the model and make predictions from `/Forecasting` folder by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM Python3
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard D2s v3 (2 vcpus, 8 GB memory, 16 GB temporary storage) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile)
**Key packages/dependencies:**
* Python
- pandas==0.23.1
- scikit-learn==0.19.1
- lightgbm==2.1.2
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *D2s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 35.91%*
*MAPE run 2: 36.28%*
*MAPE run 3: 35.99%*
*MAPE run 4: 36.49%*
*MAPE run 5: 36.57%*
*median MAPE: 36.28%*
**Time:**
*run time 1: 613.33 seconds*
*run time 2: 619.37 seconds*
*run time 3: 655.50 seconds*
*run time 4: 625.10 seconds*
*run time 5: 647.46 seconds*
*median run time: 625.10 seconds*
**Cost:** The hourly cost of the D2s v3 Ubuntu Linux VM in East US Azure region is 0.096 USD, based on the price at the submission date. Thus, the total cost is 625.10/3600 $\times$ 0.096 = $0.0167.

Просмотреть файл

@ -1,5 +0,0 @@
{
"subscription_id": "<subscription id placeholder>",
"resource_group": "tsperf",
"workspace_name": "chhws"
}

Просмотреть файл

@ -1,449 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of LightGBM Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of LightGBM model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains LightGBM models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_lgbm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--bagging-fraction', '0.8'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a CPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"cpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D14_v2\", # CPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'cpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'lightgbm', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--bagging-fraction': 0.8\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--num-leaves': quniform(8, 128, 1),\n",
" '--min-data-in-leaf': quniform(20, 500, 10),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--feature-fraction': uniform(0.2, 1), \n",
" '--bagging-fraction': uniform(0.1, 1), \n",
" '--bagging-freq': quniform(1, 20, 1), \n",
" '--max-rounds': quniform(50, 2000, 10),\n",
" '--max-lag': quniform(3, 40, 1), \n",
" '--window-size': quniform(3, 40, 1), \n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -1,150 +0,0 @@
# coding: utf-8
# Create input features for the boosted decision tree model.
import os
import sys
import math
import itertools
import datetime
import numpy as np
import pandas as pd
import lightgbm as lgb
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import *
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def lagged_features(df, lags):
"""Create lagged features based on time series data.
Args:
df (Dataframe): Input time series data sorted by time
lags (List): Lag lengths
Returns:
fea (Dataframe): Lagged features
"""
df_list = []
for lag in lags:
df_shifted = df.shift(lag)
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
df_list.append(df_shifted)
fea = pd.concat(df_list, axis=1)
return fea
def moving_averages(df, start_step, window_size=None):
"""Compute averages of every feature over moving time windows.
Args:
df (Dataframe): Input features as a dataframe
start_step (Integer): Starting time step of rolling mean
window_size (Integer): Windows size of rolling mean
Returns:
fea (Dataframe): Dataframe consisting of the moving averages
"""
if window_size == None: # Use a large window to compute average over all historical data
window_size = df.shape[0]
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
fea.columns = fea.columns + "_mean"
return fea
def combine_features(df, lag_fea, lags, window_size, used_columns):
"""Combine different features for a certain store-brand.
Args:
df (Dataframe): Time series data of a certain store-brand
lag_fea (List): A list of column names for creating lagged features
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Windows size of rolling mean
used_columns (List): A list of names of columns used in model training (including target variable)
Returns:
fea_all (Dataframe): Dataframe including all features for the specific store-brand
"""
lagged_fea = lagged_features(df[lag_fea], lags)
moving_avg = moving_averages(df[lag_fea], 2, window_size)
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
return fea_all
def make_features(pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list):
"""Create a dataframe of the input features.
Args:
pred_round (Integer): Prediction round
train_dir (String): Path of the training data directory
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Maximum step for computing the moving average
offset (Integer): Length of training data skipped in the retraining
used_columns (List): A list of names of columns used in model training (including target variable)
store_list (Numpy Array): List of all the store IDs
brand_list (Numpy Array): List of all the brand IDs
Returns:
features (Dataframe): Dataframe including all the input features and target variable
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, window_size, used_columns)
)
return features

Просмотреть файл

@ -1,201 +0,0 @@
# coding: utf-8
# Create input features for the boosted decision tree model.
import os
import sys
import math
import datetime
import pandas as pd
from sklearn.pipeline import Pipeline
from common.features.lag import LagFeaturizer
from common.features.rolling_window import RollingWindowFeaturizer
from common.features.stats import PopularityFeaturizer
from common.features.temporal import TemporalFeaturizer
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import df_from_cartesian_product
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
pd.set_option("display.max_columns", None)
def oj_preprocess(df, aux_df, week_list, store_list, brand_list, train_df=None):
df["move"] = df["logmove"].apply(lambda x: round(math.exp(x)))
df = df[["store", "brand", "week", "move"]].copy()
# Create a dataframe to hold all necessary data
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Fill missing values
if train_df is not None:
data_filled = pd.concat(train_df, data_filled)
forecast_creation_time = train_df["week_start"].max()
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
if train_df is not None:
data_filled = data_filled.loc[data_filled["week_start"] > forecast_creation_time].copy()
return data_filled
def make_features(
pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list,
):
"""Create a dataframe of the input features.
Args:
pred_round (Integer): Prediction round
train_dir (String): Path of the training data directory
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Maximum step for computing the moving average
offset (Integer): Length of training data skipped in the retraining
used_columns (List): A list of names of columns used in model training
(including target variable)
store_list (Numpy Array): List of all the store IDs
brand_list (Numpy Array): List of all the brand IDs
Returns:
features (Dataframe): Dataframe including all the input features and
target variable
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
train_df_preprocessed = oj_preprocess(train_df, aux_df, week_list, store_list, brand_list)
df_config = {
"time_col_name": "week_start",
"ts_id_col_names": ["brand", "store"],
"target_col_name": "move",
"frequency": "W",
"time_format": "%Y-%m-%d",
}
temporal_featurizer = TemporalFeaturizer(df_config=df_config, feature_list=["month_of_year", "week_of_month"])
popularity_featurizer = PopularityFeaturizer(
df_config=df_config,
id_col_name="brand",
data_format="wide",
feature_col_name="price",
wide_col_names=[
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
],
output_col_name="price_ratio",
return_feature_col=True,
)
lag_featurizer = LagFeaturizer(df_config=df_config, input_col_names="move", lags=lags, future_value_available=True,)
moving_average_featurizer = RollingWindowFeaturizer(
df_config=df_config,
input_col_names="move",
window_size=window_size,
window_args={"min_periods": 1, "center": False},
future_value_available=True,
rolling_gap=2,
)
feature_engineering_pipeline = Pipeline(
[
("temporal", temporal_featurizer),
("popularity", popularity_featurizer),
("lag", lag_featurizer),
("moving_average", moving_average_featurizer),
]
)
features = feature_engineering_pipeline.transform(train_df_preprocessed)
# Temporary code for result verification
features.rename(
mapper={
"move_lag_2": "move_lag2",
"move_lag_3": "move_lag3",
"move_lag_4": "move_lag4",
"move_lag_5": "move_lag5",
"move_lag_6": "move_lag6",
"move_lag_7": "move_lag7",
"move_lag_8": "move_lag8",
"move_lag_9": "move_lag9",
"move_lag_10": "move_lag10",
"move_lag_11": "move_lag11",
"move_lag_12": "move_lag12",
"move_lag_13": "move_lag13",
"move_lag_14": "move_lag14",
"move_lag_15": "move_lag15",
"move_lag_16": "move_lag16",
"move_lag_17": "move_lag17",
"move_lag_18": "move_lag18",
"move_lag_19": "move_lag19",
"month_of_year": "month",
},
axis=1,
inplace=True,
)
features = features[
[
"store",
"brand",
"week",
"week_of_month",
"month",
"deal",
"feat",
"move",
"price",
"price_ratio",
"move_lag2",
"move_lag3",
"move_lag4",
"move_lag5",
"move_lag6",
"move_lag7",
"move_lag8",
"move_lag9",
"move_lag10",
"move_lag11",
"move_lag12",
"move_lag13",
"move_lag14",
"move_lag15",
"move_lag16",
"move_lag17",
"move_lag18",
"move_lag19",
"move_mean",
]
]
return features

Просмотреть файл

@ -1,5 +0,0 @@
numpy==1.14.2
scipy==1.0.1
pandas==0.23.1
scikit-learn==0.19.1
lightgbm==2.1.2

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,137 +0,0 @@
# coding: utf-8
# Train and score a boosted decision tree model using [LightGBM Python package](https://github.com/Microsoft/LightGBM) from Microsoft,
# which is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms.
import os
import sys
import argparse
import numpy as np
import pandas as pd
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
from make_features import make_features
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def make_predictions(df, model):
"""Predict sales with the trained GBM model.
Args:
df (Dataframe): Dataframe including all needed features
model (Model): Trained GBM model
Returns:
Dataframe including the predicted sales of every store-brand
"""
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
predictions["move"] = predictions["move"].apply(lambda x: round(x))
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, dest="seed", default=1, help="Random seed of GBM model")
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=124, help="# of leaves of the tree")
parser.add_argument(
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=340, help="minimum # of samples in each leaf"
)
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.1, help="learning rate")
parser.add_argument(
"--feature-fraction",
type=float,
dest="feature_fraction",
default=0.65,
help="ratio of features used in each iteration",
)
parser.add_argument(
"--bagging-fraction",
type=float,
dest="bagging_fraction",
default=0.87,
help="ratio of samples used in each iteration",
)
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=19, help="bagging frequency")
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=940, help="# of boosting iterations")
parser.add_argument("--max-lag", type=int, dest="max_lag", default=19, help="max lag of unit sales")
parser.add_argument(
"--window-size", type=int, dest="window_size", default=40, help="window size of moving average of unit sales"
)
args = parser.parse_args()
print(args)
# Data paths
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
SUBMISSION_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "LightGBM")
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Parameters of GBM model
params = {
"objective": "mape",
"num_leaves": args.num_leaves,
"min_data_in_leaf": args.min_data_in_leaf,
"learning_rate": args.learning_rate,
"feature_fraction": args.feature_fraction,
"bagging_fraction": args.bagging_fraction,
"bagging_freq": args.bagging_freq,
"num_rounds": args.max_rounds,
"early_stopping_rounds": 125,
"num_threads": 4,
"seed": args.seed,
}
# Lags and categorical features
lags = np.arange(2, args.max_lag + 1)
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
categ_fea = ["store", "brand", "deal"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
# Train and predict for all forecast rounds
pred_all = []
metric_all = []
for r in range(bs.NUM_ROUNDS):
print("---- Round " + str(r + 1) + " ----")
# Create features
features = make_features(r, TRAIN_DIR, lags, args.window_size, 0, used_columns, store_list, brand_list)
train_fea = features[features.week <= bs.TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
# Drop rows with NaN values
train_fea.dropna(inplace=True)
# Create training set
dtrain = lgb.Dataset(train_fea.drop("move", axis=1, inplace=False), label=train_fea["move"])
if r % 3 == 0:
# Train GBM model
print("Training model...")
bst = lgb.train(params, dtrain, valid_sets=[dtrain], categorical_feature=categ_fea, verbose_eval=False)
# Generate forecasts
print("Making predictions...")
test_fea = features[features.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
pred = make_predictions(test_fea, bst).sort_values(by=["store", "brand", "week"]).reset_index(drop=True)
# Additional columns required by the submission format
pred["round"] = r + 1
pred["weeks_ahead"] = pred["week"] - bs.TRAIN_END_WEEK_LIST[r]
# Keep the predictions
pred_all.append(pred)
# Generate submission
submission = pd.concat(pred_all, axis=0)
submission.rename(columns={"move": "prediction"}, inplace=True)
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
filename = "submission_seed_" + str(args.seed) + ".csv"
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)

Просмотреть файл

@ -1,241 +0,0 @@
# coding: utf-8
# Perform cross validation of a boosted decision tree model on the training data of the 1st forecast round.
import os
import sys
import math
import argparse
import datetime
import itertools
import numpy as np
import pandas as pd
import lightgbm as lgb
from azureml.core import Run
from sklearn.model_selection import train_test_split
from utils import week_of_month, df_from_cartesian_product
def lagged_features(df, lags):
"""Create lagged features based on time series data.
Args:
df (Dataframe): Input time series data sorted by time
lags (List): Lag lengths
Returns:
fea (Dataframe): Lagged features
"""
df_list = []
for lag in lags:
df_shifted = df.shift(lag)
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
df_list.append(df_shifted)
fea = pd.concat(df_list, axis=1)
return fea
def moving_averages(df, start_step, window_size=None):
"""Compute averages of every feature over moving time windows.
Args:
df (Dataframe): Input features as a dataframe
start_step (Integer): Starting time step of rolling mean
window_size (Integer): Windows size of rolling mean
Returns:
fea (Dataframe): Dataframe consisting of the moving averages
"""
if window_size == None: # Use a large window to compute average over all historical data
window_size = df.shape[0]
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
fea.columns = fea.columns + "_mean"
return fea
def combine_features(df, lag_fea, lags, window_size, used_columns):
"""Combine different features for a certain store-brand.
Args:
df (Dataframe): Time series data of a certain store-brand
lag_fea (List): A list of column names for creating lagged features
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Windows size of rolling mean
used_columns (List): A list of names of columns used in model training (including target variable)
Returns:
fea_all (Dataframe): Dataframe including all features for the specific store-brand
"""
lagged_fea = lagged_features(df[lag_fea], lags)
moving_avg = moving_averages(df[lag_fea], 2, window_size)
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
return fea_all
def make_predictions(df, model):
"""Predict sales with the trained GBM model.
Args:
df (Dataframe): Dataframe including all needed features
model (Model): Trained GBM model
Returns:
Dataframe including the predicted sales of a certain store-brand
"""
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
predictions["move"] = predictions["move"].apply(lambda x: round(x))
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data-folder", type=str, dest="data_folder", default=".", help="data folder mounting point")
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=64, help="# of leaves of the tree")
parser.add_argument(
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=50, help="minimum # of samples in each leaf"
)
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.001, help="learning rate")
parser.add_argument(
"--feature-fraction",
type=float,
dest="feature_fraction",
default=1.0,
help="ratio of features used in each iteration",
)
parser.add_argument(
"--bagging-fraction",
type=float,
dest="bagging_fraction",
default=1.0,
help="ratio of samples used in each iteration",
)
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=1, help="bagging frequency")
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=400, help="# of boosting iterations")
parser.add_argument("--max-lag", type=int, dest="max_lag", default=10, help="max lag of unit sales")
parser.add_argument(
"--window-size", type=int, dest="window_size", default=10, help="window size of moving average of unit sales"
)
args = parser.parse_args()
args.feature_fraction = round(args.feature_fraction, 2)
args.bagging_fraction = round(args.bagging_fraction, 2)
print(args)
# Start an Azure ML run
run = Run.get_context()
# Data paths
DATA_DIR = args.data_folder
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Data and forecast problem parameters
TRAIN_START_WEEK = 40
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
TEST_START_WEEK_LIST = list(range(137, 161, 2))
TEST_END_WEEK_LIST = list(range(138, 162, 2))
# The start datetime of the first week in the record
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Parameters of GBM model
params = {
"objective": "mape",
"num_leaves": args.num_leaves,
"min_data_in_leaf": args.min_data_in_leaf,
"learning_rate": args.learning_rate,
"feature_fraction": args.feature_fraction,
"bagging_fraction": args.bagging_fraction,
"bagging_freq": args.bagging_freq,
"num_rounds": args.max_rounds,
"early_stopping_rounds": 125,
"num_threads": 16,
}
# Lags and used column names
lags = np.arange(2, args.max_lag + 1)
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
categ_fea = ["store", "brand", "deal"]
# Train and validate the model using only the first round data
r = 0
print("---- Round " + str(r + 1) + " ----")
# Load training data
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, args.window_size, used_columns)
)
train_fea = features[features.week <= TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
# Drop rows with NaN values
train_fea.dropna(inplace=True)
# Model training and validation
# Create a training/validation split
train_fea, valid_fea, train_label, valid_label = train_test_split(
train_fea.drop("move", axis=1, inplace=False), train_fea["move"], test_size=0.05, random_state=1
)
dtrain = lgb.Dataset(train_fea, train_label)
dvalid = lgb.Dataset(valid_fea, valid_label)
# A dictionary to record training results
evals_result = {}
# Train GBM model
bst = lgb.train(
params, dtrain, valid_sets=[dtrain, dvalid], categorical_feature=categ_fea, evals_result=evals_result
)
# Get final training loss & validation loss
train_loss = evals_result["training"]["mape"][-1]
valid_loss = evals_result["valid_1"]["mape"][-1]
print("Final training loss is {}".format(train_loss))
print("Final validation loss is {}".format(valid_loss))
# Log the validation loss/MAPE
run.log("MAPE", np.float(valid_loss) * 100)

Просмотреть файл

@ -1,41 +0,0 @@
# coding: utf-8
# Utility functions for building the boosted decision tree model.
import pandas as pd
def week_of_month(dt):
"""Get the week of the month for the specified date.
Args:
dt (Datetime): Input date
Returns:
wom (Integer): Week of the month of the input date
"""
from math import ceil
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
wom = int(ceil(adjusted_dom / 7.0))
return wom
def df_from_cartesian_product(dict_in):
"""Generate a Pandas dataframe from Cartesian product of lists.
Args:
dict_in (Dictionary): Dictionary containing multiple lists
Returns:
df (Dataframe): Dataframe corresponding to the Caresian product of the lists
"""
from collections import OrderedDict
from itertools import product
od = OrderedDict(sorted(dict_in.items()))
cart = list(product(*od.values()))
df = pd.DataFrame(cart, columns=od.keys())
return df

Просмотреть файл

@ -1,51 +0,0 @@
## Download base image
FROM ubuntu:16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Install R
ENV R_BASE_VERSION 3.5.1
RUN sh -c 'echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list' \
&& gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 \
&& gpg -a --export E084DAB9 | apt-key add -
RUN apt-get update && apt-get install -y --no-install-recommends r-base=${R_BASE_VERSION}-* \
&& echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site
## Mount R dependency file into the docker container and install dependencies
# Install prerequisites of 'forecast' package
RUN apt-get update && apt-get install -y \
gfortran \
libblas-dev \
liblapack-dev
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-08-27/"))' >> /etc/R/Rprofile.site
ADD ./install_R_dependencies.r /tmp
RUN Rscript install_R_dependencies.r
RUN rm ./install_R_dependencies.r
WORKDIR /
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,192 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 09/01/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** MeanForecast
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast
## Implementation description
### Modelling approach
In this submission, we implement mean forecast method using R package `forecast`.
### Feature engineering
Only the weekly sales of each orange juice has been used in the implementation of the forecast method.
### Hyperparameter tuning
Default hyperparameters of the forecasting algorithm are used. Additionally, the frequency of the weekly sales time series is set to be 52,
since there are approximately 52 weeks in a year.
### Description of implementation scripts
* `train_score.r`: R script that trains the model and evaluate its performance
* `mean_forecast.Rmd` (optional): R markdown that trains the model and visualizes the results
* `mean_forecast.nb.html` (optional): Html file associated with the R markdown file
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast
```
7. Choose a name for a new Docker container (e.g. meanf_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name meanf_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Inside `/Forecasting` folder, train the model and make predictions by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast R
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard D2s v3 (2 vcpus, 8 GB memory, 16 GB temporary storage) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/MeanForecast/Dockerfile)
**Key packages/dependencies:**
* R
- r-base==3.5.1
- forecast==8.1
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *D2s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 70.74%*
*MAPE run 2: 70.74%*
*MAPE run 3: 70.74%*
*MAPE run 4: 70.74%*
*MAPE run 5: 70.74%*
*median MAPE: 70.74%*
**Time:**
*run time 1: 69.85 seconds*
*run time 2: 69.80 seconds*
*run time 3: 68.88 seconds*
*run time 4: 68.54 seconds*
*run time 5: 68.10 seconds*
*median run time: 68.88 seconds*
**Cost:** The hourly cost of the D2s v3 Ubuntu Linux VM in East US Azure region is 0.096 USD, based on the price at the submission date. Thus, the total cost is 68.88/3600 $\times$ 0.096 = $0.0018.
Note that there is no randomness in the forecasts obtained by the above method. Thus, quality values do not change over
different runs.

Просмотреть файл

@ -1,11 +0,0 @@
pkgs <- c(
'optparse',
'dplyr',
'tidyr',
'forecast',
'MLmetrics'
)
install.packages(pkgs)

Просмотреть файл

@ -1,184 +0,0 @@
---
title: "Mean Forecast Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly"
output: html_notebook
---
```{r}
## Import packages
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Get the path of the current script and paths of data directories
SCRIPT_PATH <- dirname(rstudioapi::getSourceEditorContext()$path)
TRAIN_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'train')
TEST_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'test')
```
```{r}
#### Test meanf method on a subset of the data ####
## Import data
r <- 1
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
#head(train_df)
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
## Mean forecast method
train_sub <- filter(train_filled, store=='2', brand=='1')
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_meanf <- meanf(train_ts, h=horizon)
print('Mean forecasts:')
pred_meanf$mean[2:horizon]
plot(pred_meanf, main='Mean Forecasts')
```
```{r}
#### Implement meanf method on all the data ####
basic_method <- 'meanf'
pred_basic_all <- list()
print(paste0('Using ', basic_method))
## Basic methods
apply_basic_methods <- function(train_sub, method, r) {
# Trains a basic model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# method (String): Name of the basic method which can be 'naive', 'snaive',
# 'meanf', 'ets', or 'arima'
# r (Integer): Index of the forecast round
#
# Returns:
# pred_basic_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
if (method == 'naive'){
pred_basic <- naive(train_ts, h=pred_horizon)
} else if (method == 'snaive'){
pred_basic <- snaive(train_ts, h=pred_horizon)
} else if (method == 'meanf'){
pred_basic <- meanf(train_ts, h=pred_horizon)
} else if (method == 'ets') {
fit_ets <- ets(train_ts)
pred_basic <- forecast(fit_ets, h=pred_horizon)
} else if (method == 'arima'){
fit_arima <- auto.arima(train_ts)
pred_basic <- forecast(fit_arima, h=pred_horizon)
}
pred_basic_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_basic$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
# Apply basic method
pred_basic_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_basic_methods(., basic_method, r))
}
pred_basic_all <- do.call(rbind, pred_basic_all)
# Save forecast results
write.csv(pred_basic_all, file.path(SCRIPT_PATH, 'submission.csv'), row.names = FALSE)
## Evaluate forecast performance
# Get the true value dataframe
true_sales_all <- list()
for (r in 1:NUM_ROUNDS){
test_df <- read.csv(file.path(TEST_DIR, paste0('test_round_', as.character(r), '.csv')))
true_sales_all[[paste0('Round', r)]] <-
data.frame(round = rep(r, dim(test_df)[1]),
store = test_df$store,
brand = test_df$brand,
week = test_df$week,
truth = round(exp(test_df$logmove)))
}
true_sales_all <- do.call(rbind, true_sales_all)
# Merge prediction and true sales
merged_df <- merge(pred_basic_all, true_sales_all,
by = c('round', 'store', 'brand', 'week'),
all.y = TRUE)
print('MAPE')
print(MAPE(merged_df$prediction, merged_df$truth)*100)
print('MedianAPE')
print(MedianAPE(merged_df$prediction, merged_df$truth)*100)
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,109 +0,0 @@
#!/usr/bin/Rscript
#
# Mean Forecast Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly
#
# This script can be executed with the following command
# Rscript <submission folder>/train_score.r --seed <seed value>
# where <seed value> is the random seed value from 1 to 5 (here since the forecast method
# is deterministic, this value will be simply used as a suffix of the output file name).
## Import packages
library(optparse)
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Parse input argument
option_list <- list(
make_option(c('-s', '--seed'), type='integer', default=NULL,
help='random seed value from 1 to 5', metavar='integer')
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
# Paths of the training data and submission folder
DATA_DIR <- './retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data'
TRAIN_DIR <- file.path(DATA_DIR, 'train')
SUBMISSION_DIR <- file.path(dirname(DATA_DIR), 'submissions', 'MeanForecast')
# Generate submission file name
if (is.null(opt$seed)){
output_file_name <- file.path(SUBMISSION_DIR, 'submission.csv')
print('Random seed is not specified. Output file name will be submission.csv.')
} else{
output_file_name <- file.path(SUBMISSION_DIR, paste0('submission_seed_', as.character(opt$seed), '.csv'))
print(paste0('Random seed is specified. Output file name will be submission_seed_',
as.character(opt$seed) , '.csv.'))
}
#### Implement meanf method for every store-brand ####
print('Using Mean Forecast Method')
pred_meanf_all <- list()
## meanf method
apply_meanf_method <- function(train_sub, r) {
# Trains Mean Forecast model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# r (Integer): Index of the forecast round
#
# Returns:
# pred_meanf_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
pred_meanf <- meanf(train_ts, h=pred_horizon)
pred_meanf_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_meanf$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
# Apply meanf method
pred_meanf_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_meanf_method(., r))
}
# Combine and save forecast results
pred_meanf_all <- do.call(rbind, pred_meanf_all)
write.csv(pred_meanf_all, output_file_name, row.names = FALSE)

Просмотреть файл

@ -1,85 +0,0 @@
# Problem
Sales forecasting is a key task for the management of retail stores. With the projection of future sales, store managers will be able to optimize
the inventory based on their business goals. This will generate more profitable order fulfillment and reduce the inventory cost.
The task of this benchmark is to forecast orange juice sales of different brands for multiple stores with the Orange Juice (OJ) dataset from R package
`bayesm`. The forecast type is point forecasting. The forecast horizon is 3 weeks ahead and granularity is weekly. There are 12 forecast rounds, each of
which involves forecasting the sales during a target period. The training and test data in each round are specified in the subsection [Training and test data
separation](#training-and-test-data-separation). The table below summarizes the characteristics of this benchmark
| | |
| ----------------------------------- | - |
| **Number of time series** | 913 |
| **Forecast frequency** | every two weeks |
| **Forecast granularity** | weekly |
| **Forecast type** | point |
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/benchmarks/OrangeJuice_Pt_3Weeks_Weekly/sample_submission.csv)
# Data
## Dataset attribution
The OJ dataset is from R package [bayesm](https://cran.r-project.org/web/packages/bayesm/index.html) and is part of the [Dominick's dataset](https://www.chicagobooth.edu/research/kilts/datasets/dominicks).
## Dataset description
This dataset contains the following two tables:
1. Weekly sales of refrigerated orange juice at 83 stores. This table has 106139 rows and 19 columns. It includes weekly sales and prices of 11 orange juice
brands as well as information about profit, deal, and advertisement for each brand. Note that the weekly sales is captured by a column named `logmove` which
corresponds to the natural logarithm of the number of units sold. To get the number of units sold, you need to apply an exponential transform to this column.
2. Demographic information on those stores. This table has 83 rows and 13 columns. For every store, the table describes demographic information of its consumers,
distance to the nearest warehouse store, average distance to the nearest 5 supermarkets, ratio of its sales to the nearest warehouse store, and ratio of its sales
to the average of the nearest 5 stores.
Note that the week number starts from 40 in this dataset, while the full Dominick's dataset has week number from 1 to 400. According to [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx), week 1 starts on 09/14/1989.
Please see pages 40 and 41 of the [bayesm reference manual](https://cran.r-project.org/web/packages/bayesm/bayesm.pdf) and the [Dominick's Data Manual](https://www.chicagobooth.edu/-/media/enterprise/centers/kilts/datasets/dominicks-dataset/dominicks-manual-and-codebook_kiltscenter.aspx) for more details about the data.
## Training and test data separation
For this benchmark, you are provided successive folds of training data in 12 forecast rounds. The goal is to generate forecasts for the forecast periods listed
in the table below, using the available training data:
| **Round** | **Train period start week** | **Train period end week** | **Forecast period start week** | **Forecast period end week** |
| -------- | --------------- | ------------------ | ------------------------- | ----------------------- |
| 1 | 40 | 135 | 137 | 138 |
| 2 | 40 | 137 | 139 | 140 |
| 3 | 40 | 139 | 141 | 142 |
| 4 | 40 | 141 | 143 | 144 |
| 5 | 40 | 143 | 145 | 146 |
| 6 | 40 | 145 | 147 | 148 |
| 7 | 40 | 147 | 149 | 150 |
| 8 | 40 | 149 | 151 | 152 |
| 9 | 40 | 151 | 153 | 154 |
| 10 | 40 | 153 | 155 | 156 |
| 11 | 40 | 155 | 157 | 158 |
| 12 | 40 | 157 | 159 | 160 |
The gap of one week between training period and forecasting period allows the store managers to prepare the
stock to meet the forecasted demand. Besides, we assume that the information about the price, deal, and advertisement up until the forecast period end week is available in each round.
# Format of Forecasts
The forecasts should be in the following format
| round | store | brand | week | weeks_ahead | prediction |
| --------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| ... | ... | ... | ... | ... | ... |
with each of the columns explained below
* round: index of the forecast round
* store: store number
* brand: brand indicator
* week: week of the sales that we forecast
* weeks_ahead: number of weeks ahead that we forecast
* prediction: predicted number of units sold
# Quality
**Evaluation metric**: Mean Absolute Percentage Error (MAPE)

Просмотреть файл

@ -1,77 +0,0 @@
## Download base image
FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends --allow-downgrades --allow-change-held-packages \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
libcudnn7=7.2.1.38-1+cuda9.0 \
libnccl2=2.2.13-1+cuda9.0 \
libfreetype6-dev \
libhdf5-serial-dev \
libpng12-dev \
libzmq3-dev
RUN apt-get update && \
apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && \
apt-get update && \
apt-get install libnvinfer4=4.1.2-1+cuda9.0
# Update pip and setuptools
RUN pip3 install --upgrade pip
RUN pip3 install --upgrade setuptools
# install the numpy due to requirements from ConfigSpace
RUN pip3 install numpy==1.14.5
# install swig for SMAC package
# https://sillycodes.com/quick-tip-couldnt-create-temporary-file/
RUN apt-get clean
RUN mv /var/lib/apt/lists /tmp
RUN mkdir -p /var/lib/apt/lists/partial
RUN apt-get clean
RUN apt-get update
RUN apt-get -y install python-dev python3-dev swig
## Mount Python dependency file into the docker container and install dependencies
WORKDIR /tmp
ADD ./python_dependencies.txt /tmp
RUN pip3 install -r python_dependencies.txt
RUN pip3 install smac==0.9.0
# Fix the symlink issue of tensorflow (https://github.com/tensorflow/tensorflow/issues/10776)
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64
# ENV LD_LIBRARY_PATH='/usr/local/cuda/lib64/stubs/:${LD_LIBRARY_PATH}'
# RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1
WORKDIR /
RUN rm -rf tmp
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,203 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 12/21/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Yiyu Chen
**Submitter(s) email:** yiychen@microsoft.com
**Submission name:** RNN
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN
## Implementation description
### Modelling approach
In this submission, we implement an Encoder-Decoder Recurrent Neural Network (RNN) model using [Tensorflow](https://www.tensorflow.org/) package. The implementation is heavily referencing the winning solution of the [Web Traffic Time Series Forecasting Kaggle competition](https://www.kaggle.com/c/web-traffic-time-series-forecasting) which is hosted on the Github [here](https://github.com/Arturus/kaggle-web-traffic). For more details about the RNN model and its architecture, please see [here](https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md#model-core).
### Feature engineering
The following features have been used in the implementation of the forecast method:
- weekly sales of each orange juice in recent weeks.
- series popularity which is defined as the sales median of each time series.
- orange juice price and price ratio. The price ratio is defined as orange juice price divided by the average orange juice price of the store which measures the price competitiveness of a orange juice brand.
- promotion related features: `feat` and `deal`.
- orange juice brand with One Hot Encoding.
All the features are normalized before feeding into the model.
### Hyperparameter tuning
The hyperparameters are tuned with [SMAC package](https://github.com/automl/SMAC3).
### Description of implementation scripts
* `train_score.py`: Python script that trains the model and generates forecast results for each round
* `hyper_parameter_tuning.py`: Python script for hyperparameter tuning.
* `hparams.py`: Python script contains manual selected hyperparameter and the hyperparameter selected by the hyperparameter tuning script.
* `make_features.py`: Python script contains the function for creating the features.
* `rnn_train.py`: Python script contains the function for creating and training the RNN model.
* `rnn_predict.py`: Python script contains the function for making predictions by loading the saved model.
* `utils.py`: Python script contains all the utility functions used across multiple other scripts.
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t rnn_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN
```
7. Choose a name for a new Docker container (e.g. dcnn_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name rnn_container rnn_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Inside `/Forecasting` folder, train the model and make predictions by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN Python3
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard NC6 (1 GPU, 6 vCPUs, 56 GB memory, 340 GB temporary storage) Ubuntu Linux VM
**Data storage:** Standard HDD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/RNN/Dockerfile)
**Key packages/dependencies:**
* Python
- numpy>=1.7.1
- scikit-learn==0.20.0
- pandas==0.23.0
- tensorflow-gpu==1.10
- smac==0.9.0
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *NC6* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 37.68%*
*MAPE run 2: 37.44%*
*MAPE run 3: 36.95%*
*MAPE run 4: 38.56%*
*MAPE run 5: 37.84%*
*median MAPE: 37.68%*
**Time:**
*run time 1: 667.14 seconds*
*run time 2: 672.98 seconds*
*run time 3: 669.18 seconds*
*run time 4: 666.24 seconds*
*run time 5: 668.99 seconds*
*median run time: 668.99 seconds*
**Cost:** The hourly cost of NC6 Ubuntu Linux VM in East US Azure region is 0.90 USD, based on the price at the submission date. Thus, the total cost is 668.99/3600 $\times$ 0.90 = $0.1672.

Просмотреть файл

@ -1,72 +0,0 @@
"""
This script provides the hyperparameters values. `hparams_manual` is the
hyperparameter selected manually. `hparams_smac` is the hyperparameter
selected by SMAC through running `hyper_parameter_tuning.py`, which is also
the final parameter used by the submission.
"""
hparams_manual = dict(
train_window=60,
batch_size=64,
encoder_rnn_layers=1,
decoder_rnn_layers=1,
rnn_depth=400,
encoder_dropout=0.03,
gate_dropout=0.997,
decoder_input_dropout=[1.0],
decoder_state_dropout=[0.99],
decoder_output_dropout=[0.975],
decoder_variational_dropout=[False],
asgd_decay=None,
max_epoch=20,
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
)
# this is the hyperparameter selected when running 50 trials in SMAC
# hyperparameter tuning.
hparams_smac = dict(
train_window=26,
batch_size=64,
encoder_rnn_layers=1,
decoder_rnn_layers=1,
rnn_depth=387,
encoder_dropout=0.024688459483309007,
gate_dropout=0.980832247298109,
decoder_input_dropout=[0.9975650671957902],
decoder_state_dropout=[0.9743711264734845],
decoder_output_dropout=[0.9732177111192211],
decoder_variational_dropout=[False],
asgd_decay=None,
max_epoch=100,
learning_rate=0.001,
beta1=0.7763754022206656,
beta2=0.7923825287287111,
epsilon=1e-08,
)
# this is the hyperparameter selected when running 100 trials in SMAC
# hyperparameter tuning.
# this turns out to lead to overfitting on the validation data set
# MAPE on validation dataset: ~34%
# MAPE on test dataset: ~44%
hparams_smac_100 = dict(
train_window=52,
batch_size=256,
encoder_rnn_layers=1,
decoder_rnn_layers=1,
rnn_depth=455,
encoder_dropout=0.0040379628855595154,
gate_dropout=0.9704657028012964,
decoder_input_dropout=[0.9706046837200847],
decoder_state_dropout=[0.9853308617869989],
decoder_output_dropout=[0.9779977163697378],
decoder_variational_dropout=[False],
asgd_decay=None,
max_epoch=200,
learning_rate=0.01,
beta1=0.6011027681578323,
beta2=0.9809964662293627,
epsilon=1e-08,
)

Просмотреть файл

@ -1,198 +0,0 @@
"""
This script contains the code the hyperparameter tuning using SMAC package.
The SMAC package (https://github.com/automl/SMAC3) is a tool for algorithm
configuration to optimize the parameters of arbitrary algorithms across a set of
instances. The main core consists of Bayesian Optimization in combination
with a aggressive racing mechanism to efficiently decide which of two
configuration performs better.
"""
# import packages
import os
import inspect
import itertools
import sys
import numpy as np
import pandas as pd
import tensorflow.contrib.training as training
from train_score import create_round_prediction
from utils import *
# Add TSPerf root directory to sys.path
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
tsperf_dir = os.path.join(file_dir, "../../../../")
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
from common.evaluation_utils import MAPE
from smac.configspace import ConfigurationSpace
from smac.scenario.scenario import Scenario
from ConfigSpace.hyperparameters import (
CategoricalHyperparameter,
UniformFloatHyperparameter,
UniformIntegerHyperparameter,
)
from smac.facade.smac_facade import SMAC
LIST_HYPERPARAMETER = ["decoder_input_dropout", "decoder_state_dropout", "decoder_output_dropout"]
data_relative_dir = "../../data"
def eval_function(hparams_dict):
"""
This function takes a haperparameter configuration, trains the
corresponding model on the training data set, creates the predictions,
and returns the evaluated MAPE on the evaluation data set.
"""
# set the data directory
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
data_dir = os.path.join(file_dir, data_relative_dir)
hparams_dict = dict(hparams_dict)
for key in LIST_HYPERPARAMETER:
hparams_dict[key] = [hparams_dict[key]]
# add the value of other hyper parameters which are not tuned
hparams_dict["encoder_rnn_layers"] = 1
hparams_dict["decoder_rnn_layers"] = 1
hparams_dict["decoder_variational_dropout"] = [False]
hparams_dict["asgd_decay"] = None
hparams = training.HParams(**hparams_dict)
# use round 1 training data for hyper parameter tuning to avoid data leakage for later rounds
submission_round = 1
make_features_flag = False
train_model_flag = True
train_back_offset = 3 # equal to predict_window
predict_cut_mode = "eval"
# get prediction
pred_o, train_mape = create_round_prediction(
data_dir,
submission_round,
hparams,
make_features_flag=make_features_flag,
train_model_flag=train_model_flag,
train_back_offset=train_back_offset,
predict_cut_mode=predict_cut_mode,
)
# get rid of prediction at horizon 1
pred_sub = pred_o[:, 1:].reshape((-1))
# evaluate the prediction on last two days in the first round training data
# TODO: get train error and evalution error for different parameters
train_file = os.path.join(data_dir, "train/train_round_{}.csv".format(submission_round))
train = pd.read_csv(train_file, index_col=False)
train_last_week = bs.TRAIN_END_WEEK_LIST[submission_round - 1]
# filter the train to contain ony last two days' data
train = train.loc[train["week"] >= train_last_week - 1]
# create the data frame without missing dates
store_list = train["store"].unique()
brand_list = train["brand"].unique()
week_list = range(train_last_week - 1, train_last_week + 1)
item_list = list(itertools.product(store_list, brand_list, week_list))
item_df = pd.DataFrame.from_records(item_list, columns=["store", "brand", "week"])
train = item_df.merge(train, how="left", on=["store", "brand", "week"])
result = train.sort_values(by=["store", "brand", "week"], ascending=True)
result["prediction"] = pred_sub
result["sales"] = result["logmove"].apply(lambda x: round(np.exp(x)))
# calculate MAPE on the evaluate set
result = result.loc[result["sales"].notnull()]
eval_mape = MAPE(result["prediction"], result["sales"])
return eval_mape
if __name__ == "__main__":
# Build Configuration Space which defines all parameters and their ranges
cs = ConfigurationSpace()
# add parameters to the configuration space
train_window = UniformIntegerHyperparameter("train_window", 3, 70, default_value=60)
cs.add_hyperparameter(train_window)
batch_size = CategoricalHyperparameter("batch_size", [64, 128, 256, 1024], default_value=64)
cs.add_hyperparameter(batch_size)
rnn_depth = UniformIntegerHyperparameter("rnn_depth", 100, 500, default_value=400)
cs.add_hyperparameter(rnn_depth)
encoder_dropout = UniformFloatHyperparameter("encoder_dropout", 0.0, 0.05, default_value=0.03)
cs.add_hyperparameter(encoder_dropout)
gate_dropout = UniformFloatHyperparameter("gate_dropout", 0.95, 1.0, default_value=0.997)
cs.add_hyperparameter(gate_dropout)
decoder_input_dropout = UniformFloatHyperparameter("decoder_input_dropout", 0.95, 1.0, default_value=1.0)
cs.add_hyperparameter(decoder_input_dropout)
decoder_state_dropout = UniformFloatHyperparameter("decoder_state_dropout", 0.95, 1.0, default_value=0.99)
cs.add_hyperparameter(decoder_state_dropout)
decoder_output_dropout = UniformFloatHyperparameter("decoder_output_dropout", 0.95, 1.0, default_value=0.975)
cs.add_hyperparameter(decoder_output_dropout)
max_epoch = CategoricalHyperparameter("max_epoch", [50, 100, 150, 200], default_value=100)
cs.add_hyperparameter(max_epoch)
learning_rate = CategoricalHyperparameter("learning_rate", [0.001, 0.01, 0.1], default_value=0.001)
cs.add_hyperparameter(learning_rate)
beta1 = UniformFloatHyperparameter("beta1", 0.5, 0.9999, default_value=0.9)
cs.add_hyperparameter(beta1)
beta2 = UniformFloatHyperparameter("beta2", 0.5, 0.9999, default_value=0.999)
cs.add_hyperparameter(beta2)
epsilon = CategoricalHyperparameter("epsilon", [1e-08, 0.00001, 0.0001, 0.1, 1], default_value=1e-08)
cs.add_hyperparameter(epsilon)
scenario = Scenario(
{
"run_obj": "quality", # we optimize quality (alternatively runtime)
"runcount-limit": 50, # maximum function evaluations
"cs": cs, # configuration space
"deterministic": "true",
}
)
# test the default configuration works
# eval_function(cs.get_default_configuration())
# import hyper parameters
# TODO: add ema in the code to imporve the performance
smac = SMAC(scenario=scenario, rng=np.random.RandomState(42), tae_runner=eval_function)
incumbent = smac.optimize()
inc_value = eval_function(incumbent)
print("the best hyper parameter sets are:")
print(incumbent)
print("the corresponding MAPE on validation datset is: {}".format(inc_value))
# following are the print out:
# the best hyper parameter sets are:
# Configuration:
# batch_size, Value: 64
# beta1, Value: 0.7763754022206656
# beta2, Value: 0.7923825287287111
# decoder_input_dropout, Value: 0.9975650671957902
# decoder_output_dropout, Value: 0.9732177111192211
# decoder_state_dropout, Value: 0.9743711264734845
# encoder_dropout, Value: 0.024688459483309007
# epsilon, Value: 1e-08
# gate_dropout, Value: 0.980832247298109
# learning_rate, Value: 0.001
# max_epoch, Value: 100
# rnn_depth, Value: 387
# train_window, Value: 26
# the corresponding MAPE is: 0.36703585613035433

Просмотреть файл

@ -1,181 +0,0 @@
"""
This script contains the function for creating features for the RNN model.
"""
# import packages
import itertools
import sys
import inspect
import os
import pandas as pd
import numpy as np
from utils import *
from sklearn.preprocessing import OneHotEncoder
# Add TSPerf root directory to sys.path
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
tsperf_dir = os.path.join(file_dir, "../../../../")
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
data_relative_dir = "../../data"
def make_features(submission_round):
"""
This function makes the features for the data from certain submission
round and save the results to the disk.
Args:
submission_round: integer. The number of submission round.
"""
# read in data
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
data_dir = os.path.join(file_dir, data_relative_dir)
train_file = os.path.join(data_dir, "train/train_round_{}.csv".format(submission_round))
test_file = os.path.join(data_dir, "train/aux_round_{}.csv".format(submission_round))
train = pd.read_csv(train_file, index_col=False)
test = pd.read_csv(test_file, index_col=False)
# select the test data range for test data
train_last_week = train["week"].max()
test = test.loc[test["week"] > train_last_week]
# calculate series popularity
series_popularity = train.groupby(["store", "brand"]).apply(lambda x: x["logmove"].median())
# fill the datetime gaps
# such that every time series have the same length both in train and test
store_list = train["store"].unique()
brand_list = train["brand"].unique()
train_week_list = range(bs.TRAIN_START_WEEK, bs.TRAIN_END_WEEK_LIST[submission_round - 1] + 1)
test_week_list = range(
bs.TEST_START_WEEK_LIST[submission_round - 1] - 1, bs.TEST_END_WEEK_LIST[submission_round - 1] + 1
)
train_item_list = list(itertools.product(store_list, brand_list, train_week_list))
train_item_df = pd.DataFrame.from_records(train_item_list, columns=["store", "brand", "week"])
test_item_list = list(itertools.product(store_list, brand_list, test_week_list))
test_item_df = pd.DataFrame.from_records(test_item_list, columns=["store", "brand", "week"])
train = train_item_df.merge(train, how="left", on=["store", "brand", "week"])
test = test_item_df.merge(test, how="left", on=["store", "brand", "week"])
# sort the train, test, series_popularity by store and brand
train = train.sort_values(by=["store", "brand", "week"], ascending=True)
test = test.sort_values(by=["store", "brand", "week"], ascending=True)
series_popularity = (
series_popularity.reset_index().sort_values(by=["store", "brand"], ascending=True).rename(columns={0: "pop"})
)
# calculate one-hot encoding for brands
enc = OneHotEncoder(categories="auto")
brand_train = np.reshape(train["brand"].values, (-1, 1))
brand_test = np.reshape(test["brand"].values, (-1, 1))
enc = enc.fit(brand_train)
brand_enc_train = enc.transform(brand_train).todense()
brand_enc_test = enc.transform(brand_test).todense()
# calculate price and price_ratio
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
train["price"] = train.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
train["avg_price"] = train[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
train["price_ratio"] = train["price"] / train["avg_price"]
test["price"] = test.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
test["avg_price"] = test[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
test["price_ratio"] = test.apply(lambda x: x["price"] / x["avg_price"], axis=1)
# fill the missing values for feat, deal, price, price_ratio with 0
for cl in ["price", "price_ratio", "feat", "deal"]:
train.loc[train[cl].isna(), cl] = 0
test.loc[test[cl].isna(), cl] = 0
# normalize features:
# 1) series popularity - 1
# 2) brand - 11
# 3) price: price and price_ratio
# 4) promo: feat and deal
series_popularity = series_popularity["pop"].values
series_popularity = (series_popularity - series_popularity.mean()) / np.std(series_popularity)
brand_enc_mean = brand_enc_train.mean(axis=0)
brand_enc_std = brand_enc_train.std(axis=0)
brand_enc_train = (brand_enc_train - brand_enc_mean) / brand_enc_std
brand_enc_test = (brand_enc_test - brand_enc_mean) / brand_enc_std
for cl in ["price", "price_ratio", "feat", "deal"]:
cl_mean = train[cl].mean()
cl_std = train[cl].std()
train[cl] = (train[cl] - cl_mean) / cl_std
test[cl] = (test[cl] - cl_mean) / cl_std
# create the following numpy array
# 1) ts_value_train (#ts, #train_ts_length)
# 2) feature_train (#ts, #train_ts_length, #features)
# 3) feature_test (#ts, #test_ts_length, #features)
ts_number = len(series_popularity)
train_min_time = bs.TRAIN_START_WEEK
train_max_time = bs.TRAIN_END_WEEK_LIST[submission_round - 1]
test_min_time = bs.TEST_START_WEEK_LIST[submission_round - 1] - 1
test_max_time = bs.TEST_END_WEEK_LIST[submission_round - 1]
train_ts_length = train_max_time - train_min_time + 1
test_ts_length = test_max_time - test_min_time + 1
# ts_value_train
# the target variable fed into the neural network are: log(sales + 1), where sales = exp(logmove).
ts_value_train = np.log(np.exp(train["logmove"].values) + 1)
ts_value_train = ts_value_train.reshape((ts_number, train_ts_length))
# fill missing value with zero
ts_value_train = np.nan_to_num(ts_value_train)
# feature_train
series_popularity_train = np.repeat(series_popularity, train_ts_length).reshape((ts_number, train_ts_length, 1))
brand_number = brand_enc_train.shape[1]
brand_enc_train = np.array(brand_enc_train).reshape((ts_number, train_ts_length, brand_number))
price_promo_features_train = train[["price", "price_ratio", "feat", "deal"]].values.reshape(
(ts_number, train_ts_length, 4)
)
feature_train = np.concatenate((series_popularity_train, brand_enc_train, price_promo_features_train), axis=-1)
# feature_test
series_popularity_test = np.repeat(series_popularity, test_ts_length).reshape((ts_number, test_ts_length, 1))
brand_enc_test = np.array(brand_enc_test).reshape((ts_number, test_ts_length, brand_number))
price_promo_features_test = test[["price", "price_ratio", "feat", "deal"]].values.reshape(
(ts_number, test_ts_length, 4)
)
feature_test = np.concatenate((series_popularity_test, brand_enc_test, price_promo_features_test), axis=-1)
# save the numpy arrays
intermediate_data_dir = os.path.join(data_dir, "intermediate/round_{}".format(submission_round))
if not os.path.isdir(intermediate_data_dir):
os.makedirs(intermediate_data_dir)
np.save(os.path.join(intermediate_data_dir, "ts_value_train.npy"), ts_value_train)
np.save(os.path.join(intermediate_data_dir, "feature_train.npy"), feature_train)
np.save(os.path.join(intermediate_data_dir, "feature_test.npy"), feature_test)

Просмотреть файл

@ -1,28 +0,0 @@
# other requirements
pandas==0.23.0
tensorflow-gpu==1.10
# SMAC package pre - requirements
setuptools
cython
numpy>=1.7.1
scipy>=0.18.1
six
psutil
pynisher>=0.4.1
ConfigSpace>=0.4.6,<0.5
# scikit-learn>=0.18.0
scikit-learn==0.20.0
typing
pyrfr>=0.5.0
sphinx
sphinx_rtd_theme
joblib
nose>=1.3.0
pyDOE
sobol_seq
statsmodels
emcee>=2.1.0
george

Просмотреть файл

@ -1,92 +0,0 @@
"""
This script contains the function for creating predictions for the RNN model.
"""
# import packages
import os
import numpy as np
import tensorflow as tf
from utils import *
# define parameters
IS_TRAIN = False
def rnn_predict(
ts_value_train,
feature_train,
feature_test,
hparams,
predict_window,
intermediate_data_dir,
submission_round,
batch_size,
cut_mode="predict",
):
"""
This function creates predictions by loading the trained RNN model.
Args:
ts_value_train: Numpy array which contains the time series value in the
training dataset in shape of (#time series, #train_ts_length)
feature_train: Numpy array which contains the feature values in the
training dataset in shape of (#time series, #train_ts_length,
#features)
feature_test: Numpy array which contains the feature values for the
test dataset in shape of (#time series, #test_ts_length)
hparams: the tensorflow HParams object which contains the
hyperparameter of the RNN model.
predict_window: Integer, predict horizon.
intermediate_data_dir: String, the directory which stores the
intermediate results.
submission_round: Integer, the submission round.
batch_size: Integer, the batch size for making RNN predictions.
cut_mode: 'train', 'eval' or 'predict'.
Returns:
pred_o: Numpy array which contains the predictions in shape of
(#time series, #predict_window)
"""
# build the dataset
root_ds = tf.data.Dataset.from_tensor_slices((ts_value_train, feature_train, feature_test)).repeat(1)
batch = (
root_ds.map(
lambda *x: cut(
*x,
cut_mode=cut_mode,
train_window=hparams.train_window,
predict_window=predict_window,
ts_length=ts_value_train.shape[1],
back_offset=0
)
)
.map(normalize_target)
.batch(batch_size)
)
iterator = batch.make_initializable_iterator()
it_tensors = iterator.get_next()
true_x, true_y, feature_x, feature_y, norm_x, norm_mean, norm_std = it_tensors
# build the model, get the predictions
predictions = build_rnn_model(norm_x, feature_x, feature_y, norm_mean, norm_std, predict_window, IS_TRAIN, hparams)
# init the saver
saver = tf.train.Saver(name="eval_saver", var_list=None)
# read the saver from checkpoint
saver_path = os.path.join(intermediate_data_dir, "cpt_round_{}".format(submission_round))
paths = [p for p in tf.train.get_checkpoint_state(saver_path).all_model_checkpoint_paths]
checkpoint = paths[0]
# run the session
with tf.Session(config=tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))) as sess:
sess.run(iterator.initializer)
saver.restore(sess, checkpoint)
(pred,) = sess.run([predictions])
# invert the prediction back to original scale
pred_o = np.exp(pred) - 1
pred_o = pred_o.astype(int)
return pred_o

Просмотреть файл

@ -1,176 +0,0 @@
"""
This script contains the function for training the RNN model.
"""
import tensorflow as tf
import os
from utils import *
import numpy as np
import shutil
MODE = "train"
IS_TRAIN = True
def rnn_train(
ts_value_train,
feature_train,
feature_test,
hparams,
predict_window,
intermediate_data_dir,
submission_round,
back_offset=0,
):
"""
This function trains the RNN model and saves it to the disk.
Args:
ts_value_train: Numpy array which contains the time series value in the
training dataset in shape of (#time series, #train_ts_length)
feature_train: Numpy array which contains the feature values in the
training dataset in shape of (#time series, #train_ts_length,
#features)
feature_test: Numpy array which contains the feature values for the
test dataset in shape of (#time series, #test_ts_length)
hparams: the tensorflow HParams object which contains the
hyperparameter of the RNN model.
predict_window: Integer, predict horizon.
intermediate_data_dir: String, the directory which stores the
intermediate results.
submission_round: Integer, the submission round.
back_offset: how many data points at end of time series
cannot be used for training.
Returns:
training MAPE.
"""
max_train_empty_percentage = 0.5
max_train_empty = int(round(hparams.train_window * max_train_empty_percentage))
# build the dataset
root_ds = (
tf.data.Dataset.from_tensor_slices((ts_value_train, feature_train, feature_test))
.shuffle(ts_value_train.shape[0], reshuffle_each_iteration=True)
.repeat()
)
batch = (
root_ds.map(
lambda *x: cut(
*x,
cut_mode=MODE,
train_window=hparams.train_window,
predict_window=predict_window,
ts_length=ts_value_train.shape[1],
back_offset=back_offset
)
)
.filter(lambda *x: reject_filter(max_train_empty, *x))
.map(normalize_target)
.batch(hparams.batch_size)
)
iterator = batch.make_initializable_iterator()
it_tensors = iterator.get_next()
true_x, true_y, feature_x, feature_y, norm_x, norm_mean, norm_std = it_tensors
# build the model, get the predictions
predictions = build_rnn_model(norm_x, feature_x, feature_y, norm_mean, norm_std, predict_window, IS_TRAIN, hparams)
# calculate loss on log scale
mae_loss = calc_mae_loss(true_y, predictions)
# calculate differntiable mape loss on original scale, this is the metric to be optimized
mape_loss = calc_differentiable_mape_loss(true_y, predictions)
# calculate rounded mape on original scale, this is the metric which is identitial to the final evaluation metric
mape = calc_rounded_mape(true_y, predictions)
# Sum all losses
total_loss = mape_loss
train_op, glob_norm, ema = make_train_op(
total_loss, hparams.learning_rate, hparams.beta1, hparams.beta2, hparams.epsilon, hparams.asgd_decay
)
train_size = ts_value_train.shape[0]
steps_per_epoch = train_size // hparams.batch_size
global_step = tf.Variable(0, name="global_step", trainable=False)
inc_step = tf.assign_add(global_step, 1)
saver = tf.train.Saver(max_to_keep=1, name="train_saver")
init = tf.global_variables_initializer()
results_mae = []
results_mape = []
results_mape_loss = []
with tf.Session(
config=tf.ConfigProto(allow_soft_placement=True, gpu_options=tf.GPUOptions(allow_growth=False))
) as sess:
sess.run(init)
sess.run(iterator.initializer)
for epoch in range(hparams.max_epoch):
results_epoch_mae = []
results_epoch_mape = []
results_epoch_mape_loss = []
for _ in range(steps_per_epoch):
try:
ops = [inc_step]
ops.extend([train_op])
ops.extend([mae_loss, mape, mape_loss, glob_norm])
# for debug
# ops.extend([predictions, true_x, true_y, feature_x, feature_y, norm_x, norm_mean, norm_std])
results = sess.run(ops)
# get the results
step = results[0]
step_mae = results[2]
step_mape = results[3]
step_mape_loss = results[4]
# for debug
# step_predictions = results[6]
# step_true_x = results[7]
# step_true_y = results[8]
# step_feature_x = results[9]
# step_feature_y = results[10]
# step_norm_x = results[11]
# step_norm_mean = results[12]
# step_norm_std = results[13]
#
# print(
# 'step: {}, MAE: {}, MAPE: {}, MAPE_LOSS: {}'.format(step, step_mae, step_mape, step_mape_loss))
results_epoch_mae.append(step_mae)
results_epoch_mape.append(step_mape)
results_epoch_mape_loss.append(step_mape_loss)
except tf.errors.OutOfRangeError:
break
# append the results
results_mae.append(results_epoch_mae)
results_mape.append(results_epoch_mape)
results_mape_loss.append(results_epoch_mape_loss)
step = results[0]
saver_path = os.path.join(intermediate_data_dir, "cpt_round_{}".format(submission_round))
if os.path.exists(saver_path):
shutil.rmtree(saver_path)
saver.save(sess, os.path.join(saver_path, "cpt"), global_step=step, write_state=True)
# look at the training results
# examine step_mae and step_mape_loss
# print('MAE in epochs')
# print(np.mean(results_mae, axis=1))
# print('MAPE LOSS in epochs')
# print(np.mean(results_mape_loss, axis=1))
# print('MAPE in epochs')
# print(np.mean(results_mape, axis=1))
return np.mean(results_mape, axis=1)[-1]

Просмотреть файл

@ -1,189 +0,0 @@
"""
This script trains the RNN model and creates the predictions for each
submission round required by the benchmark.
"""
# import packages
import os
import inspect
import itertools
import argparse
import sys
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.contrib.training as training
from rnn_train import rnn_train
from rnn_predict import rnn_predict
from make_features import make_features
import hparams
from utils import *
# Add TSPerf root directory to sys.path
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
tsperf_dir = os.path.join(file_dir, "../../../../")
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
data_relative_dir = "../../data"
def create_round_prediction(
data_dir,
submission_round,
hparams,
make_features_flag=True,
train_model_flag=True,
train_back_offset=0,
predict_cut_mode="predict",
random_seed=1,
):
"""
This function trains the model and creates the predictions for a certain
submission round.
"""
# conduct feature engineering and save related numpy array to disk
if make_features_flag:
make_features(submission_round=submission_round)
# read the numpy arrays output from the make_features.py
# file_dir = './prototypes/retail_rnn_model'
intermediate_data_dir = os.path.join(data_dir, "intermediate/round_{}".format(submission_round))
ts_value_train = np.load(os.path.join(intermediate_data_dir, "ts_value_train.npy"))
feature_train = np.load(os.path.join(intermediate_data_dir, "feature_train.npy"))
feature_test = np.load(os.path.join(intermediate_data_dir, "feature_test.npy"))
# convert the dtype to float32 to suffice tensorflow cudnn_rnn requirements.
ts_value_train = ts_value_train.astype(dtype="float32")
feature_train = feature_train.astype(dtype="float32")
feature_test = feature_test.astype(dtype="float32")
# define parameters
# constant
predict_window = feature_test.shape[1]
# train the rnn model
if train_model_flag:
tf.reset_default_graph()
tf.set_random_seed(seed=random_seed)
train_error = rnn_train(
ts_value_train,
feature_train,
feature_test,
hparams,
predict_window,
intermediate_data_dir,
submission_round,
back_offset=train_back_offset,
)
# make prediction
tf.reset_default_graph()
pred_batch_size = 1024
pred_o = rnn_predict(
ts_value_train,
feature_train,
feature_test,
hparams,
predict_window,
intermediate_data_dir,
submission_round,
pred_batch_size,
cut_mode=predict_cut_mode,
)
return pred_o, train_error
def create_round_submission(
data_dir,
submission_round,
hparams,
make_features_flag=True,
train_model_flag=True,
train_back_offset=0,
predict_cut_mode="predict",
random_seed=1,
):
"""
This function trains the model and creates the submission in pandas
DataFrame for a certain submission round.
"""
pred_o, _ = create_round_prediction(
data_dir,
submission_round,
hparams,
make_features_flag=make_features_flag,
train_model_flag=train_model_flag,
train_back_offset=train_back_offset,
predict_cut_mode=predict_cut_mode,
random_seed=random_seed,
)
# get rid of prediction at horizon 1
pred_sub = pred_o[:, 1:].reshape((-1))
# arrange the predictions into pd.DataFrame
# read in the test_file for this round
train_file = os.path.join(data_dir, "train/train_round_{}.csv".format(submission_round))
test_file = os.path.join(data_dir, "train/aux_round_{}.csv".format(submission_round))
train = pd.read_csv(train_file, index_col=False)
test = pd.read_csv(test_file, index_col=False)
train_last_week = bs.TRAIN_END_WEEK_LIST[submission_round - 1]
store_list = train["store"].unique()
brand_list = train["brand"].unique()
test_week_list = range(
bs.TEST_START_WEEK_LIST[submission_round - 1], bs.TEST_END_WEEK_LIST[submission_round - 1] + 1
)
test_item_list = list(itertools.product(store_list, brand_list, test_week_list))
test_item_df = pd.DataFrame.from_records(test_item_list, columns=["store", "brand", "week"])
test = test_item_df.merge(test, how="left", on=["store", "brand", "week"])
submission = test.sort_values(by=["store", "brand", "week"], ascending=True)
submission["round"] = submission_round
submission["weeks_ahead"] = submission["week"] - train_last_week
submission["prediction"] = pred_sub
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
return submission
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, dest="seed", default=1, help="random seed")
args = parser.parse_args()
random_seed = args.seed
# set the data directory
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
data_dir = os.path.join(file_dir, data_relative_dir)
# import hyper parameters
# TODO: add ema in the code to imporve the performance
hparams_dict = hparams.hparams_smac
hparams = training.HParams(**hparams_dict)
num_round = len(bs.TEST_END_WEEK_LIST)
pred_all = pd.DataFrame()
for R in range(1, num_round + 1):
print("create submission for round {}...".format(R))
round_submission = create_round_submission(data_dir, R, hparams, random_seed=random_seed)
pred_all = pred_all.append(round_submission)
file_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
submission_dir = file_dir
if not os.path.isdir(submission_dir):
os.makedirs(submission_dir)
submission_file = os.path.join(submission_dir, "submission_seed_{}.csv".format(str(random_seed)))
pred_all.to_csv(submission_file, index=False)

Просмотреть файл

@ -1,415 +0,0 @@
import numpy as np
import tensorflow as tf
from tensorflow.python.util import nest
import tensorflow.contrib.layers as layers
import tensorflow.contrib.rnn as rnn
import tensorflow.contrib.cudnn_rnn as cudnn_rnn
RNN = cudnn_rnn.CudnnGRU
GRAD_CLIP_THRESHOLD = 10
# input pipe utils
def cut(
ts_value_train_slice,
feature_train_slice,
feature_test_slice,
train_window,
predict_window,
ts_length,
cut_mode="train",
back_offset=0,
):
"""
Cut each element of the tensorflow dataset into x and y for supervised
learning.
Args:
ts_value_train_slice: shape of (#train_ts_length,)
feature_train_slice: shape of (#train_ts_length, #features)
feature_test_slice: shape of (#test_ts_length, #features)
cut_mode: 'train', 'eval' or 'predict'.
back_offset: how many data points at end of time series
cannot be used for training.
set back_offset = predict_window for training
during hyper parameter tuning.
Returns:
an element of a tensorflow dataset which contains:
true_x: (#train_window,)
true_y: (#predict_window,)
feature_x: (#train_window, #features)
feature_y: (#predict_window, #features)
"""
if cut_mode in ["train", "eval"]:
if cut_mode == "train":
min_start_idx = 0
max_start_idx = (ts_length - back_offset) - (train_window + predict_window)
train_start = tf.random_uniform((), min_start_idx, max_start_idx, dtype=tf.int32)
elif cut_mode == "eval":
train_start = ts_length - (train_window + predict_window)
train_end = train_start + train_window
test_start = train_end
test_end = test_start + predict_window
true_x = ts_value_train_slice[train_start:train_end]
true_y = ts_value_train_slice[test_start:test_end]
feature_x = feature_train_slice[train_start:train_end]
feature_y = feature_train_slice[test_start:test_end]
else:
train_start = ts_length - train_window
train_end = ts_length
true_x = ts_value_train_slice[train_start:train_end]
true_y = tf.fill((predict_window,), np.nan)
feature_x = feature_train_slice[train_start:train_end]
feature_y = feature_test_slice
return true_x, true_y, feature_x, feature_y
def reject_filter(max_train_empty, true_x, *args):
"""
Rejects time series having too many zero data points (more than
max_train_empty)
"""
zeros_x = tf.reduce_sum(tf.to_int32(tf.equal(true_x, 0.0)))
keep = zeros_x <= max_train_empty
return keep
def normalize_target(true_x, true_y, feature_x, feature_y):
"""
Normalize the target variable.
"""
masked_true_x = tf.boolean_mask(true_x, tf.logical_not(tf.is_nan(true_x)))
norm_mean = tf.reduce_mean(masked_true_x)
norm_std = tf.sqrt(tf.reduce_mean(tf.squared_difference(masked_true_x, norm_mean)))
norm_x = (true_x - norm_mean) / norm_std
return true_x, true_y, feature_x, feature_y, norm_x, norm_mean, norm_std
def make_encoder(time_inputs, is_train, hparams):
"""
Builds the encoder part of the RNN model.
Args:
time_inputs: The input to the encoder with shape (batch, time, features)
is_train: whether it is during training or prediction. The dropout is
only applied during training.
hparams: the tensorflow HParams object which contains the
hyperparameter of the RNN model.
Returns:
rnn_out: the output of the RNN encoder with shape of (time, batch,
rnn_depth)
rnn_state: the final output state of the RNN encoder with shape of (
num_layers, batch, rnn_depth)
"""
def build_rnn():
return RNN(
num_layers=hparams.encoder_rnn_layers,
num_units=hparams.rnn_depth,
kernel_initializer=tf.initializers.random_uniform(minval=-0.05, maxval=0.05),
direction="unidirectional",
dropout=hparams.encoder_dropout if is_train else 0,
)
cuda_model = build_rnn()
# [batch, time, features] -> [time, batch, features]
time_first = tf.transpose(time_inputs, [1, 0, 2])
rnn_time_input = time_first
# rnn_out: (time, batch, rnn_depth)
# rnn_state: (num_layers, batch, rnn_depth)
rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input)
return rnn_out, rnn_state
def convert_cudnn_state_v2(h_state, hparams, dropout=1.0):
"""
Converts RNN state tensor from cuDNN representation to TF RNNCell
compatible representation.
Args:
h_state: tensor [num_layers, batch_size, depth]
hparams: the tensorflow HParams object which contains the
hyperparameter of the RNN model.
dropout: The dropout rate between encoder and decoder.
Returns:
The input to the decoder, which is the TF cell representation matching
RNNCell.state_size structure for compatible cell.
"""
def squeeze(seq):
return tuple(seq) if len(seq) > 1 else seq[0]
def wrap_dropout(structure):
if dropout < 1.0:
return nest.map_structure(lambda x: tf.nn.dropout(x, keep_prob=dropout), structure)
else:
return structure
# Cases:
# decoder_layer = encoder_layers, straight mapping
# encoder_layers > decoder_layers: get outputs of upper encoder layers
# encoder_layers < decoder_layers: feed encoder outputs to lower decoder layers, feed zeros to top layers
h_layers = tf.unstack(h_state)
if hparams.encoder_rnn_layers >= hparams.decoder_rnn_layers:
return squeeze(wrap_dropout(h_layers[hparams.encoder_rnn_layers - hparams.decoder_rnn_layers :]))
else:
lower_inputs = wrap_dropout(h_layers)
upper_inputs = [
tf.zeros_like(h_layers[0]) for _ in range(hparams.decoder_rnn_layers - hparams.encoder_rnn_layers)
]
return squeeze(lower_inputs + upper_inputs)
def default_init():
# replica of tf.glorot_uniform_initializer(seed=seed)
return layers.variance_scaling_initializer(factor=1.0, mode="FAN_AVG", uniform=True)
def decoder(encoder_state, prediction_inputs, previous_y, hparams, is_train, predict_window):
"""
Build the decoder part for the RNN model.
Args:
encoder_state: shape [batch_size, encoder_rnn_depth]
prediction_inputs: features for prediction days,
tensor[batch_size, time, input_depth]
previous_y: Last day pageviews, shape [batch_size]
hparams: the tensorflow HParams object which contains the
hyperparameter of the RNN model.
is_train: whether it is during training or prediction. The dropout is
only applied during training.
predict_window: the horizon of the prediction.
Returns:
The time series predictions with length of predict_window.
"""
def build_cell(idx):
with tf.variable_scope("decoder_cell", initializer=default_init()):
cell = rnn.GRUBlockCell(hparams.rnn_depth)
has_dropout = (
hparams.decoder_input_dropout[idx] < 1
or hparams.decoder_state_dropout[idx] < 1
or hparams.decoder_output_dropout[idx] < 1
)
if is_train and has_dropout:
input_size = prediction_inputs.shape[-1].value + 1 if idx == 0 else hparams.rnn_depth
cell = rnn.DropoutWrapper(
cell,
dtype=tf.float32,
input_size=input_size,
variational_recurrent=hparams.decoder_variational_dropout[idx],
input_keep_prob=hparams.decoder_input_dropout[idx],
output_keep_prob=hparams.decoder_output_dropout[idx],
state_keep_prob=hparams.decoder_state_dropout[idx],
)
return cell
if hparams.decoder_rnn_layers > 1:
cells = [build_cell(idx) for idx in range(hparams.decoder_rnn_layers)]
cell = rnn.MultiRNNCell(cells)
else:
cell = build_cell(0)
nest.assert_same_structure(encoder_state, cell.state_size)
# [batch_size, time, input_depth] -> [time, batch_size, input_depth]
inputs_by_time = tf.transpose(prediction_inputs, [1, 0, 2])
# Stop condition for decoding loop
def cond_fn(time, prev_output, prev_state, array_targets: tf.TensorArray, array_outputs: tf.TensorArray):
return time < predict_window
# FC projecting layer to get single predicted value from RNN output
def project_output(tensor):
return tf.layers.dense(tensor, 1, name="decoder_output_proj", kernel_initializer=default_init())
def loop_fn(time, prev_output, prev_state, array_targets: tf.TensorArray, array_outputs: tf.TensorArray):
"""
Main decoder loop.
Args:
time: time series step number.
prev_output: Output(prediction) from previous step.
prev_state: RNN state tensor from previous step.
array_targets: Predictions, each step will append new value to
this array.
array_outputs: Raw RNN outputs (for regularization losses)
Returns:
(time + 1, projected_output, state, array_targets, array_outputs)
projected_output: the prediction for this step.
state: the updated state for this step.
array_targets: the updated targets array.
array_outputs: the updated hidden states array.
"""
# RNN inputs for current step
features = inputs_by_time[time]
# [batch, predict_window, readout_depth * n_heads] -> [batch, readout_depth * n_heads]
# Append previous predicted value to input features
next_input = tf.concat([prev_output, features], axis=1)
# Run RNN cell
output, state = cell(next_input, prev_state)
# Make prediction from RNN outputs
projected_output = project_output(output)
# Append step results to the buffer arrays
array_targets = array_targets.write(time, projected_output)
# Increment time and return
return time + 1, projected_output, state, array_targets, array_outputs
# Initial values for loop
loop_init = [
tf.constant(0, dtype=tf.int32),
tf.expand_dims(previous_y, -1),
encoder_state,
tf.TensorArray(dtype=tf.float32, size=predict_window),
tf.constant(0),
]
# Run the loop
_, _, _, targets_ta, outputs_ta = tf.while_loop(cond_fn, loop_fn, loop_init)
# Get final tensors from buffer arrays
targets = targets_ta.stack()
# [time, batch_size, 1] -> [time, batch_size]
targets = tf.squeeze(targets, axis=-1)
return targets
def decode_predictions(decoder_readout, norm_mean, norm_std):
"""
Reverts normalization on the prediction.
Args:
decoder_readout: Decoder output, shape [predict_window, batch]
norm_mean: normalized mean for this time series sample.
norm_std: normalized standard deviation for this time series sample.
Returns:
The de-normalized prediction in original data scale.
"""
# [n_days, batch] -> [batch, n_days]
batch_readout = tf.transpose(decoder_readout)
batch_std = tf.expand_dims(norm_std, -1)
batch_mean = tf.expand_dims(norm_mean, -1)
return batch_readout * batch_std + batch_mean
def build_rnn_model(norm_x, feature_x, feature_y, norm_mean, norm_std, predict_window, is_train, hparams):
"""
For a single supervised learning time series sample, feed the input
features and historical time series value into the RNN model, and create
the predictions for this time series sample.
"""
# build the encoder-decoder RNN model
# make encoder
x_all_features = tf.concat([tf.expand_dims(norm_x, -1), feature_x], axis=-1)
encoder_output, h_state = make_encoder(x_all_features, is_train, hparams)
# convert the encoder state
encoder_state = convert_cudnn_state_v2(h_state, hparams, dropout=hparams.gate_dropout if is_train else 1.0)
# Run decoder
decoder_targets = decoder(
encoder_state, feature_y, norm_x[:, -1], hparams, is_train=is_train, predict_window=predict_window
)
# get predictions
predictions = decode_predictions(decoder_targets, norm_mean, norm_std)
return predictions
def calc_mae_loss(true_y, predictions):
"""
Calculate the MAE loss.
"""
# calculate loss
mask = tf.logical_not(tf.math.equal(true_y, tf.zeros_like(true_y)))
# Fill NaNs by zeros (can use any value)
# Assign zero weight to zeros, will not calculate loss for those true_y.
weights = tf.to_float(mask)
mae_loss = tf.losses.absolute_difference(labels=true_y, predictions=predictions, weights=weights)
return mae_loss
def calc_differentiable_mape_loss(true_y, predictions):
"""
Calculate the differentiable MAPE loss.
"""
# calculate loss
mask = tf.logical_not(tf.math.equal(true_y, tf.zeros_like(true_y)))
# Fill NaNs by zeros (can use any value)
# Assign zero weight to zeros, will not calculate loss for those true_y.
weights = tf.to_float(mask)
# mape_loss
epsilon = 0.1 # Smoothing factor, helps SMAPE to be well-behaved near zero
true_o = tf.expm1(true_y)
pred_o = tf.expm1(predictions)
mape_loss_origin = tf.abs(pred_o - true_o) / (tf.abs(true_o) + epsilon)
mape_loss = tf.losses.compute_weighted_loss(mape_loss_origin, weights, loss_collection=None)
return mape_loss
def calc_rounded_mape(true_y, predictions):
"""
Calculate the rounded MAPE.
"""
# mape
true_o1 = tf.round(tf.expm1(true_y))
pred_o1 = tf.maximum(tf.round(tf.expm1(predictions)), 0.0)
raw_mape = tf.abs(pred_o1 - true_o1) / tf.abs(true_o1)
raw_mape_mask = tf.is_finite(raw_mape)
raw_mape_weights = tf.to_float(raw_mape_mask)
raw_mape_filled = tf.where(raw_mape_mask, raw_mape, tf.zeros_like(raw_mape))
mape = tf.losses.compute_weighted_loss(raw_mape_filled, raw_mape_weights, loss_collection=None)
return mape
def make_train_op(loss, learning_rate, beta1, beta2, epsilon, ema_decay=None, prefix=None):
"""
Creates the training operation which updates the gradient using the
AdamOptimizer.
"""
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, beta1=beta1, beta2=beta2, epsilon=epsilon)
glob_step = tf.train.get_global_step()
# Add regularization losses
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
total_loss = loss + reg_losses if reg_losses else loss
# Clip gradients
grads_and_vars = optimizer.compute_gradients(total_loss)
gradients, variables = zip(*grads_and_vars)
clipped_gradients, glob_norm = tf.clip_by_global_norm(gradients, GRAD_CLIP_THRESHOLD)
sgd_op, glob_norm = optimizer.apply_gradients(zip(clipped_gradients, variables)), glob_norm
# Apply SGD averaging
if ema_decay:
ema = tf.train.ExponentialMovingAverage(decay=ema_decay, num_updates=glob_step)
if prefix:
# Some magic to handle multiple models trained in single graph
ema_vars = [var for var in variables if var.name.startswith(prefix)]
else:
ema_vars = variables
update_ema = ema.apply(ema_vars)
with tf.control_dependencies([sgd_op]):
training_op = tf.group(update_ema)
else:
training_op = sgd_op
ema = None
return training_op, glob_norm, ema

Просмотреть файл

@ -1,51 +0,0 @@
## Download base image
FROM ubuntu:16.04
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
bzip2 \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Install R
ENV R_BASE_VERSION 3.5.1
RUN sh -c 'echo "deb http://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/" >> /etc/apt/sources.list' \
&& gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 \
&& gpg -a --export E084DAB9 | apt-key add -
RUN apt-get update && apt-get install -y --no-install-recommends r-base=${R_BASE_VERSION}-* \
&& echo 'options(repos = c(CRAN = "https://cloud.r-project.org"))' >> /etc/R/Rprofile.site
## Mount R dependency file into the docker container and install dependencies
# Install prerequisites of 'forecast' package
RUN apt-get update && apt-get install -y \
gfortran \
libblas-dev \
liblapack-dev
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-08-27/"))' >> /etc/R/Rprofile.site
ADD ./install_R_dependencies.r /tmp
RUN Rscript install_R_dependencies.r
RUN rm ./install_R_dependencies.r
WORKDIR /
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -1,192 +0,0 @@
# Implementation submission form
## Submission details
**Submission date**: 09/01/2018
**Benchmark name:** OrangeJuice_Pt_3Weeks_Weekly
**Submitter(s):** Chenhui Hu
**Submitter(s) email:** chenhhu@microsoft.com
**Submission name:** SeasonalNaive
**Submission path:** retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive
## Implementation description
### Modelling approach
In this submission, we implement seasonal naive forecast method using R package `forecast`.
### Feature engineering
Only the weekly sales of each orange juice has been used in the implementation of the forecast method.
### Hyperparameter tuning
Default hyperparameters of the forecasting algorithm are used. Additionally, the frequency of the weekly sales time series is set to be 52,
since there are approximately 52 weeks in a year.
### Description of implementation scripts
* `train_score.r`: R script that trains the model and evaluate its performance
* `seasonal_naive.Rmd` (optional): R markdown that trains the model and visualizes the results
* `seasonal_naive.nb.html` (optional): Html file associated with the R markdown file
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation. To do this, you need
to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda. Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by
```bash
conda env create --file ./common/conda_dependencies.yml
```
This will create a conda environment with the Python and R packages listed in `conda_dependencies.yml` being installed. The conda
environment name is also defined in the yml file.
3. Activate the conda environment and download the Orange Juice dataset. Use command `source activate tsperf` to activate the conda environment. Then, download the Orange Juice dataset by running the following command from `~/Forecasting` directory
```bash
Rscript ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/download_data.r
```
This will create a data directory `./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data` and store the dataset in this directory. The dataset has two csv files - `yx.csv` and `storedemo.csv` which contain the sales information and store demographic information, respectively.
4. From `~/Forecasting` directory, run the following command to generate the training data and testing data for each forecast period:
```bash
python ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/common/serve_folds.py --test --save
```
This will generate 12 csv files named `train_round_#.csv` and 12 csv files named `test_round_#.csv` in two subfolders `/train` and
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive
```
7. Choose a name for a new Docker container (e.g. snaive_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name snaive_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
access to the source code in the container.
8. Inside `/Forecasting` folder, train the model and make predictions by running
```bash
cd /Forecasting
source ./common/train_score_vm ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive R
```
This will generate 5 `submission_seed_<seed number>.csv` files in the submission directory, where \<seed number\>
is between 1 and 5. This command will also output 5 running times of train_score.py. The median of the times
reported in rows starting with 'real' should be compared against the wallclock time declared in benchmark
submission. After generating the forecast results, you can exit the Docker container by command `exit`.
9. Activate conda environment again by `source activate tsperf`. Then, evaluate the benchmark quality by running
```bash
source ./common/evaluate ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly
```
This command will output 5 benchmark quality values (MAPEs). Their median should be compared against the
benchmark quality declared in benchmark submission.
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US
**Hardware:** Standard D2s v3 (2 vcpus, 8 GB memory, 16 GB temporary storage) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/SeasonalNaive/Dockerfile)
**Key packages/dependencies:**
* R
- r-base==3.5.1
- forecast==8.1
## Resource deployment instructions
We use Azure Linux VM to develop the baseline methods. Please follow the instructions below to deploy the resource.
* Azure Linux VM deployment
- Create an Azure account and log into [Azure portal](portal.azure.com/)
- Refer to the steps [here](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro) to deploy a Data
Science Virtual Machine for Linux (Ubuntu). Select *D2s_v3* as the virtual machine size.
## Implementation evaluation
**Quality:**
*MAPE run 1: 165.06%*
*MAPE run 2: 165.06%*
*MAPE run 3: 165.06%*
*MAPE run 4: 165.06%*
*MAPE run 5: 165.06%*
*median MAPE: 165.06%*
**Time:**
*run time 1: 159.44 seconds*
*run time 2: 160.45 seconds*
*run time 3: 162.24 seconds*
*run time 4: 158.73 seconds*
*run time 5: 160.83 seconds*
*median run time: 160.45 seconds*
**Cost:** The hourly cost of the D2s v3 Ubuntu Linux VM in East US Azure region is 0.096 USD, based on the price at the submission date. Thus, the total cost is 160.45/3600 $\times$ 0.096 = $0.0043.
Note that there is no randomness in the forecasts obtained by the above method. Thus, quality values do not change over
different runs.

Просмотреть файл

@ -1,11 +0,0 @@
pkgs <- c(
'optparse',
'dplyr',
'tidyr',
'forecast',
'MLmetrics'
)
install.packages(pkgs)

Просмотреть файл

@ -1,184 +0,0 @@
---
title: "Seasonal Naive Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly"
output: html_notebook
---
```{r}
## Import packages
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Get the path of the current script and paths of data directories
SCRIPT_PATH <- dirname(rstudioapi::getSourceEditorContext()$path)
TRAIN_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'train')
TEST_DIR <- file.path(dirname(dirname(SCRIPT_PATH)), 'data', 'test')
```
```{r}
#### Test snaive method on a subset of the data ####
## Import data
r <- 1
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
#head(train_df)
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
## Seasonal naive method
train_sub <- filter(train_filled, store=='2', brand=='1')
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_snaive <- snaive(train_ts, h=horizon)
print('Seasonal naive forecasts:')
pred_snaive$mean[2:horizon]
plot(pred_snaive, main='Seasonal Naive')
```
```{r}
#### Implement snaive method on all the data ####
basic_method <- 'snaive'
pred_basic_all <- list()
print(paste0('Using ', basic_method))
## Basic methods
apply_basic_methods <- function(train_sub, method, r) {
# Trains a basic model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# method (String): Name of the basic method which can be 'naive', 'snaive',
# 'meanf', 'ets', or 'arima'
# r (Integer): Index of the forecast round
#
# Returns:
# pred_basic_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
if (method == 'naive'){
pred_basic <- naive(train_ts, h=pred_horizon)
} else if (method == 'snaive'){
pred_basic <- snaive(train_ts, h=pred_horizon)
} else if (method == 'meanf'){
pred_basic <- meanf(train_ts, h=pred_horizon)
} else if (method == 'ets') {
fit_ets <- ets(train_ts)
pred_basic <- forecast(fit_ets, h=pred_horizon)
} else if (method == 'arima'){
fit_arima <- auto.arima(train_ts)
pred_basic <- forecast(fit_arima, h=pred_horizon)
}
pred_basic_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_basic$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
head(train_filled)
print('Number of rows with missing values:')
print(sum(!complete.cases(train_filled)))
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
head(train_filled)
print('Number of rows with missing values after filling:')
print(sum(!complete.cases(train_filled)))
# Apply basic method
pred_basic_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_basic_methods(., basic_method, r))
}
pred_basic_all <- do.call(rbind, pred_basic_all)
# Save forecast results
write.csv(pred_basic_all, file.path(SCRIPT_PATH, 'submission.csv'), row.names = FALSE)
## Evaluate forecast performance
# Get the true value dataframe
true_sales_all <- list()
for (r in 1:NUM_ROUNDS){
test_df <- read.csv(file.path(TEST_DIR, paste0('test_round_', as.character(r), '.csv')))
true_sales_all[[paste0('Round', r)]] <-
data.frame(round = rep(r, dim(test_df)[1]),
store = test_df$store,
brand = test_df$brand,
week = test_df$week,
truth = round(exp(test_df$logmove)))
}
true_sales_all <- do.call(rbind, true_sales_all)
# Merge prediction and true sales
merged_df <- merge(pred_basic_all, true_sales_all,
by = c('round', 'store', 'brand', 'week'),
all.y = TRUE)
print('MAPE')
print(MAPE(merged_df$prediction, merged_df$truth)*100)
print('MedianAPE')
print(MedianAPE(merged_df$prediction, merged_df$truth)*100)
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,109 +0,0 @@
#!/usr/bin/Rscript
#
# Seasonal Naive Method for Retail Forecasting Benchmark - OrangeJuice_Pt_3Weeks_Weekly
#
# This script can be executed with the following command
# Rscript <submission folder>/train_score.r --seed <seed value>
# where <seed value> is the random seed value from 1 to 5 (here since the forecast method
# is deterministic, this value will be simply used as a suffix of the output file name).
## Import packages
library(optparse)
library(dplyr)
library(tidyr)
library(forecast)
library(MLmetrics)
## Define parameters
NUM_ROUNDS <- 12
TRAIN_START_WEEK <- 40
TRAIN_END_WEEK_LIST <- seq(135, 157, 2)
TEST_START_WEEK_LIST <- seq(137, 159, 2)
TEST_END_WEEK_LIST <- seq(138, 160, 2)
# Parse input argument
option_list <- list(
make_option(c('-s', '--seed'), type='integer', default=NULL,
help='random seed value from 1 to 5', metavar='integer')
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
# Paths of the training data and submission folder
DATA_DIR <- './retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data'
TRAIN_DIR <- file.path(DATA_DIR, 'train')
SUBMISSION_DIR <- file.path(dirname(DATA_DIR), 'submissions', 'SeasonalNaive')
# Generate submission file name
if (is.null(opt$seed)){
output_file_name <- file.path(SUBMISSION_DIR, 'submission.csv')
print('Random seed is not specified. Output file name will be submission.csv.')
} else{
output_file_name <- file.path(SUBMISSION_DIR, paste0('submission_seed_', as.character(opt$seed), '.csv'))
print(paste0('Random seed is specified. Output file name will be submission_seed_',
as.character(opt$seed) , '.csv.'))
}
#### Implement snaive method for every store-brand ####
print('Using Seasonal Naive Method')
pred_snaive_all <- list()
## snaive method
apply_snaive_method <- function(train_sub, r) {
# Trains Seasonal Naive model to forecast sales of each store-brand in a certain round.
#
# Args:
# train_sub (Dataframe): Training data of a certain store-brand
# r (Integer): Index of the forecast round
#
# Returns:
# pred_snaive_df (Dataframe): Predicted sales of the current store-brand
cur_store <- train_sub$store[1]
cur_brand <- train_sub$brand[1]
train_ts <- ts(train_sub[c('logmove')], frequency = 52)
pred_snaive <- snaive(train_ts, h=pred_horizon)
pred_snaive_df <- data.frame(round = rep(r, pred_steps),
store = rep(cur_store, pred_steps),
brand = rep(cur_brand, pred_steps),
week = pred_weeks,
weeks_ahead = pred_weeks_ahead,
prediction = round(exp(pred_snaive$mean[2:pred_horizon])))
}
for (r in 1:NUM_ROUNDS) {
print(paste0('---- Round ', r, ' ----'))
pred_horizon <- TEST_END_WEEK_LIST[r] - TRAIN_END_WEEK_LIST[r]
pred_steps <- TEST_END_WEEK_LIST[r] - TEST_START_WEEK_LIST[r] + 1
pred_weeks <- TEST_START_WEEK_LIST[r]:TEST_END_WEEK_LIST[r]
pred_weeks_ahead <- pred_weeks - TRAIN_END_WEEK_LIST[r]
## Import training data
train_df <- read.csv(file.path(TRAIN_DIR, paste0('train_round_', as.character(r), '.csv')))
## Fill missing values
store_list <- unique(train_df$store)
brand_list <- unique(train_df$brand)
week_list <- TRAIN_START_WEEK:TRAIN_END_WEEK_LIST[r]
data_grid <- expand.grid(store = store_list,
brand = brand_list,
week = week_list)
train_filled <- merge(data_grid, train_df,
by = c('store', 'brand', 'week'),
all.x = TRUE)
train_filled <- train_filled[,c('store','brand','week','logmove')]
# Fill missing logmove
train_filled <-
train_filled %>%
group_by(store, brand) %>%
arrange(week) %>%
fill(logmove) %>%
fill(logmove, .direction = 'up')
# Apply snaive method
pred_snaive_all[[paste0('Round', r)]] <-
train_filled %>%
group_by(store, brand) %>%
do(apply_snaive_method(., r))
}
# Combine and save forecast results
pred_snaive_all <- do.call(rbind, pred_snaive_all)
write.csv(pred_snaive_all, output_file_name, row.names = FALSE)

Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше