First Release of Forecasting Repo (#181)

* Handled edge case where ts_id_col_names is None

* Split long line into separate lines

* Added notebook template

* Added a test yml file

* Added yml file for python unit test pipeline

* Minor update

* Minor update

* Minor update

* Minor update

* Removed triggers

* Removed triggers

* Created a base ts estimator and inherit BaseTSFeaturizer from the BaseTSEstimator.

* Refactored featurizer class hierachy.

* Added week of month method.

* add script to source entire

* formatting

* source only test files

* Inherit temporal featurizers from BaseTSFeaturizer.

* Minor update.

* Replaced max_test_timestamp with max_horizon

* Refactored rolling window featurizers.

* Renamed hour_of_year feature to normalized_hour_of_year

* Inherit all normalizers from base normalizer class.

* address review comments for the PR of contributing

* minor update

* address review comments for PR of r test pipeline

* add a test yml file

* Remove checking target column existence, because testing data may not have the target column.

* Create setter and getter of ts_id_col_names.

* Fixed bug caused by unexpected behavior of pandas.shift

* Some code cleanup.

* Updated some featurizer names.

* Some minor changes in df_config and feature configs.

* Some minor changes in feature names.

* Added usage examples in docstring.

* Computation time update after feature engineering refactoring.

* Removed setting frequency.

* Added docstring to convert_to_tsdf function.

* Removed frequency in convert_to_tsdf call.

* Fixed week_of_month function.

* Added popularity featurizer

* Added utility function for checking Iterable but not string.

* Updated LightGBM feature engineering code to use new feature engineering classes.

* Improved checking whether input column names are Iterable and conver to list.

* Made future_value_available a read-only property.

* Minor docstring update.

* Removed extra space in docstring examples.

* Made some methods staticmethods.

* Minor QRF result update after feature engineering code change.

* Removed calling of validate_file and added catching of the exception

* Update python_unit_tests_base.yml for Azure Pipelines [skip ci]

Updated path of the test results

* Test if the download link is wrong

* Fixed minor format issues.

* Fixed minor format issues.

* Fixed formatting issues.

* Fixed line length.

* Removed data files before downloading and checked dimensions of energy data

* Removed the change made for testing

* Changed folder structure of tests and added table to show build status

* Added missing files

* Updated based on review comments

* new folder structure

* add repo metrics

* remove prototypes folder

* add models placeholder

* adjust featurizers to the new structure of folders

* changes in README and evaluation files

* adjust data download to new folders

* delete unnecessary files

* energy load baseline model with new folders

* delete data files

* fix links in benchmarks file

* fix bug

* adjust GBM, QRF and FNN submissions to the new folder structure

* Replace pd.to_timedelta with pd.offsets.

* Added get_offset_by_frequency helper function.

* fix small bugs

* fix small bugs

* Update TSCVSplitter.

* refactored high-level folders

* added a placeholder folder for PR/issue templates

* added subfolders under notebooks/

* updated tests folder

* renamed notebooks/ to examples/

* Update to CONTRIBUTING instructions (#34)

* style checking and formatting files

* git hook installation guide

* issue and PR templates

* minor change

*  working with github instructions

* added specific issue templates

* addressed PR comments

* addressed Chenhui's comment

* addressing chenhuis comments

* conda environment file (#36)

* conda environment file

* updated environment file

* updated instructions for installing conda env

* Vapaunic/lib (#37)

* initial core for forecasting library

* syncing with new structure

* __init__ files in modules

* renamed lib directory

* Added legal headers and some formatting of py files

* restructured benchmarking directory in lib

* fixed imports, warnings, legal headers

* more import fixes and legal headers

* updated instructions with package installation

* barebones library README

* moved energy benchmark to contrib

* formatting changes plus more legal headers

* Added license to the setup

* moved .swp file to contrib, not sure we need to keep it at all

* added missing headers and a brief snipet to README file

* minor wording change in readme

* Chenhui/cpu unit test pipeline (#38)

* address review comments

* added full conda path

* minor change

* added conda to PATH

* added build status in README

* removed energy data prep placeholder notebook

* moved out data energy explore notebook into contrib

* moved data download script to tools/

* Added getting started section to readme

* Added rbase and rbayesm to conda environment

* modified data download script

* added instructions for data download

* renamed data download script

* fixing issues with test pipeline

* parsing issue in yml file

* cleaning up ci test yaml file for more diagnostic info

* fixed a missing argument in instructions

* removed retail directory under dataset module

* moved feature_engineering.py to the feature engineering module

* moved evaluate.py to evaluation module

* combined benchmark settings into a single file

* moved download sript to the package and modified the tests

* modified instructions

* fixed the build pipeline yml

* fix to the pipeline yml

* fix to the pipeline yml

* moved serve_folds into ojdata.py

* removed data_schema.py file as all content moved to ojdata.py

* fixed split_train_test in ojdata.py

* moved retail_data_schema into ojdata.py

* moved all oj utilities to ojdata.py

* removed paths from benchmark_settings

* fixed up a docstring

* quick fix a typo

* removed benchmark_settings

* parameterized experiment settings

* refactored experiment settings

* Fixed docstrings

* addressed chenhuis comment around round file naming

* renamed experiment to forecast settings

* Chenhui/light gbm quick start (#40)

* initial example notebook for lightgbm

* reduced to one round forecast

* added text

* added text

* added text

* moved week_of_month to feature engineering utils

* moved df_from_cartesian_product to feature utils

* moved functions to feature utils

* moved functions to feature utils

* added lightgbm model utils

* updated plots

* added text and renamed predict function

* reduced print out frequency in model training

* moved data visualization code to utils

* added text

* updated plot function and added docstring

* renamed the notebook

* updated text

* added NOTICE file, currently empty as we're not redistributing any packages

* Chenhui/add scrapbook (#43)

* added scrapbook support

* Added gitpython to environtment.yml file

* added git_repo_path function to utils

* updated notebook

* added test for lightgbm notebook

* included testing of notebooks

* resolve test error

* resolve test error

* added kernel name

* updated kernel name

* trying installing bayesm from cmd

* trying installing bayesm from cmd

* trying installing bayesm from cmd

* excluded notebook test

* excluded notebook test

* added lapack.so link fix

* included notebook tests

* excluded files for notebook test

Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com>

* added integration test

* added initial data prep notebook

* updated notebook

* updated notebook

* updated notebook

* updated url

* init

* model parameters

* removed blank quick start notebooks

* removed blank modeling notebooks

* removed blank evaluation notebooks

* Removed blank model selection notebooks

* removed blank o16n notebooks

* removed outdated text from contrib/README

* removed outdated swp file

* updating .gitignore

* removed change log, as we don't plan to maintain this

* Excluding irrelevant directories

* fix settings

* separated out the setup guide

* fix settings

* simplemodel init

* typo

* add rproj file

* Renaming forecasting_lib to fclib (#59)

* renamed forecasting_lib directory

* modified references to forecasting_lib

* Vapaunic/envname (#61)

* renamed conda env

* modified setup instructions

* minor change in contributing guide

* keep top-level gitignore only

* formatting fixes

* Chenhui/add automl example (#62)

* added multiple linear models and example notebook for AutoML

* removed commented code

* address review comments

* minor update to the notebook

* minor update to the notebook

* added text

* changed types in lightgbm to be consistent with the rest of the code

* modified docstrings in multiple_linear_regression.py

* updated ci yaml files

* changed import statement in confest.py

* updated gitpython version to the latest

Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com>

* Vapaunic/split bug (#65)

* fixed a yield bug

* removed two blank files

* modified split data function to auto-calculate the splits based on the parameters

* removed forecast_settings module

* removed unused parameter

* modified splitting function to use non-overlapping testing

* tested the split function after the update

* minor fix

* defaults changed in split function

* modified lightgbm example with new split function

* modified automl example (needs verification)

* modified data explore notebook

* quick fix:

* updated data preparation notebook

* changed defaults in split function

* Addressed changes in lightgbm

* addressed issues in automl notebook

* fixed typo in lightgbm plot

* first images of time series split

* updated the pictures

* updated evaluation periods (#66)

* Chenhui/env setup script (#67)

* added a shell script for setting up environment

* changed yaml to yml

* added comments and updated SETUP.md

* modified data preparation notebook with images

* moved r exploration notebook to contrib directory

* modified data explore notebook, updated info about the data, and removed reference to TSPerf

* addressed review feedback and fixed the explore notebook

* Chenhui/multiround lightgbm (#68)

* added initial multiround notebook for lightgbm

* updated data splitting

* updated text

* updated week list

* addressed review comments

* added pyramid-automl to conda file

* first draft of arima notebook

* replace pyramid with pmdarima

* Added a complete function

* minor type

* forecasting across many stores/brands

* complete arima notebook

* renamed data preparation/exploration notebooks

* added git clone to setup

* addressed PR comments

* typo

* Arima to ARIMA

* fixed docstring in plot function

* fixed a bug in MAPE calculation and added plotting

* fixed a bug in predict

* modeling arima on log scale

* Fixing AML Example Notebook (#84)

* Cleaning notebook output, adding get_or_create workspace call, and fixing get_or_create AmlCompute

* Add regression-based models (#64)

* modelling updates

* code tweak

* rebuild

* update mape

* update mape 2

* new forecasting structure

* update eval

* rebuild dataprep

* rebuild with profit

* rm profit

* add plot

* typo

* tidy up

* expand readme

* oops

* clarified setup guide (#94)

* Update SETUP (#95)

minor fix

* Cleaned up unused files and directories (#96)

* removed non-used files

* moved docs into a docs/ dir

* fixed broken links

* Chenhui/dilated cnn example and utils (#76)

* added initial model util file for DCNN

* initial notebook

* added feature utils for DCNN

* upadted evaluation and visualization

* removed plot function

* replaced PRED_HORIZON, PRED_STEPS by HORIZON, GAP

* removed log dir if it exists

* updated model utils

* generalized categorical features in dcnn model util

* generalized network definition

* update training code

* format with blackcellmagic

* address review comments and added README

* Chenhui/add ci tests (#146)

* Update conda env with versions (#99)

* 💥

* revert

* minor changes

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* Adding missing Jupyter Extension (#90)

* Update environment.yml

* specified version

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* fix links to examples/ (#104)

* Chenhui/rename notebooks and update automl notebook (#106)

* removed unused module

* added outputs in automl notebook

* fixed a notebook name

* Arima multi-round notebook (#91)

* working arima model

* final auto arima example

* added tqdm to requirements

* addressed review comments

* Revert "Chenhui/rename notebooks and update automl notebook (#106)" (#107)

This reverts commit 032c91d9bfa389f22ae1f1f2150913a4f063bd18 [formerly 15d25213dc].

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* Fixing data download issue (#109)

* removed dependency on __file__ from data download, doesn't work in jupyter

* changed aux to auxdata

* fixe data download function

* fixed path

* auxdata -> auxi

* adding tl;dr directions for setup to README.md (#88)

* adding tl;dr directions for setup to README.md

* added a bit more text

* Cleaned up obsolete (tsperf) code in fclib (#112)

* moved out tsperf files from evaluation module

* moved out tsperf tuning code

* removed more unused files

* Addressing documentation related issues (#111)

* Added conda activate to the setup readme

* added instructions for starting jupyter to setup

* minor

* deleted duplicate instructions

* addressed PR comments

* Chenhui/rename notebooks and updated AutoML example (#108)

* removed unused module

* added outputs in automl notebook

* fixed a notebook name

* updated pytest file

* address review comments

* reran notebook with blackcellmagic

* adding pylint  (#93)

* adding tl;dr directions for setup to README.md

* removing pylint hook and pylint_junit from the env file

* removed pylint config file

* Chenhui/update example folder (#115)

* restructure examples folder

* updated readme

* added readme

* minor update

* removed R folder

* minor change

* fixed a broken link

* another broken link

* fixing notebook tests

* Chenhui/fix aux file path (#118)

* fixed figure links

* changed to auxi_i.csv

* minor change

* [MINOR] Small changes to Arima notebooks (#121)

* fixed a broken link

* minor text changes

* Documentation (#120)

* added target audience section

* added intro on forecasting

* Added fclib documentation

* improved examples readme

* address comments

* added info about the dataset

* added items to be ignored (#123)

* added items to be ignored

* added *.log and score.py

* Chenhui/toplevel readme (#127)

* added content table

* added references

* added external repo links

* minor update

* Chenhui/tune deploy lgbm (#122)

* added notebook and utils

* updated readme links

* fix data path

* updated text

* group imports

* minor update

* using azureml utils to create workspace and compute (#126)

* using azureml utils to create workspace and compute

* group imports

* Download ojdata directly from github (#128)

* new function to download and load oj data directly from bayesm repo

* removed bayesm

* new R function to only load the data

* removed download R function

* minor fix

* added documentation to load_oj_data.R

* added requests to requirements

* fixed a syntax error (#130)

* fix setup.md link (#129)

* fix setup.md link

* mention related use cases

* Vapaunic/cgbuild (#133)

* added files to generate reqs.txt and the ci yml file

* Added notice generation task

* Checking if notice is there

* Update component_governance.yml for Azure Pipelines

* check in notice file

* Update component_governance.yml for Azure Pipelines

* fixed heading

* Chenhui/windows setup (#131)

* initial test

* added batch script and instructions

* align image to center

* adjust image size

* added text

* adjust image size

* address comments

* Readds R material (#116)

* redo R stuff in new dirs

* dirname fixup

* add Rproj file

* rebuild

* fixups

* roxygenise

* copyright notice

* dataprep

* updated yaml

* more updates

* more tweaks

* reg models

* update reg models

* more updates

* reword

* rendered prophet html

* name fix

* add lintr file

* move stuff

* renamed use case folder (#138)

* renamed use case folder

* dirname change

* updated readme

* added notebooks

* fix ci test

* Vapaunic/featutils (#137)

* moved feature engineering module to contrib

* removed lag submod

* cleaned up feature engineering

* rebuild R notebooks (#139)

* Chenhui/toplevel readme (#140)

* added content table

* added references

* added external repo links

* minor update

* updated setup instructions

* added text

* align text

* removed duplicated Content section

* address review comments

* Chenhui/hyperdrive example update (#142)

* removed blackcellmagic

* removed utils under aml_scripts and updated notebook

* added notebook path

* added ci test of lightgbm multi round example

* make forecast round as parameter

* Make -Agent Name

* resolve duplicated function name

* increased time limit and reduce number of rounds

* increase time limit

* added parameters tag to multiround lightgbm and dilatedcnn

* README change (#147)

* minor change

* hide tags

* hide tags

* added parameters tag

* Revert "Chenhui/add ci tests (#146)" (#149)

This reverts commit de7a19cfa7637476b9ebfc92f5c18a26a8eca4da [formerly f8bd22733c].

* Chenhui/add ci tests (#150)

* Update conda env with versions (#99)

* 💥

* revert

* minor changes

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* Adding missing Jupyter Extension (#90)

* Update environment.yml

* specified version

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* fix links to examples/ (#104)

* Chenhui/rename notebooks and update automl notebook (#106)

* removed unused module

* added outputs in automl notebook

* fixed a notebook name

* Arima multi-round notebook (#91)

* working arima model

* final auto arima example

* added tqdm to requirements

* addressed review comments

* Revert "Chenhui/rename notebooks and update automl notebook (#106)" (#107)

This reverts commit 032c91d9bfa389f22ae1f1f2150913a4f063bd18 [formerly 15d25213dc].

Co-authored-by: Chenhui Hu <chenhhu@microsoft.com>

* Fixing data download issue (#109)

* removed dependency on __file__ from data download, doesn't work in jupyter

* changed aux to auxdata

* fixe data download function

* fixed path

* auxdata -> auxi

* adding tl;dr directions for setup to README.md (#88)

* adding tl;dr directions for setup to README.md

* added a bit more text

* Cleaned up obsolete (tsperf) code in fclib (#112)

* moved out tsperf files from evaluation module

* moved out tsperf tuning code

* removed more unused files

* Addressing documentation related issues (#111)

* Added conda activate to the setup readme

* added instructions for starting jupyter to setup

* minor

* deleted duplicate instructions

* addressed PR comments

* Chenhui/rename notebooks and updated AutoML example (#108)

* removed unused module

* added outputs in automl notebook

* fixed a notebook name

* updated pytest file

* address review comments

* reran notebook with blackcellmagic

* adding pylint  (#93)

* adding tl;dr directions for setup to README.md

* removing pylint hook and pylint_junit from the env file

* removed pylint config file

* Chenhui/update example folder (#115)

* restructure examples folder

* updated readme

* added readme

* minor update

* removed R folder

* minor change

* fixed a broken link

* another broken link

* fixing notebook tests

* Chenhui/fix aux file path (#118)

* fixed figure links

* changed to auxi_i.csv

* minor change

* [MINOR] Small changes to Arima notebooks (#121)

* fixed a broken link

* minor text changes

* Documentation (#120)

* added target audience section

* added intro on forecasting

* Added fclib documentation

* improved examples readme

* address comments

* added info about the dataset

* added items to be ignored (#123)

* added items to be ignored

* added *.log and score.py

* Chenhui/toplevel readme (#127)

* added content table

* added references

* added external repo links

* minor update

* Chenhui/tune deploy lgbm (#122)

* added notebook and utils

* updated readme links

* fix data path

* updated text

* group imports

* minor update

* using azureml utils to create workspace and compute (#126)

* using azureml utils to create workspace and compute

* group imports

* Download ojdata directly from github (#128)

* new function to download and load oj data directly from bayesm repo

* removed bayesm

* new R function to only load the data

* removed download R function

* minor fix

* added documentation to load_oj_data.R

* added requests to requirements

* fixed a syntax error (#130)

* fix setup.md link (#129)

* fix setup.md link

* mention related use cases

* Vapaunic/cgbuild (#133)

* added files to generate reqs.txt and the ci yml file

* Added notice generation task

* Checking if notice is there

* Update component_governance.yml for Azure Pipelines

* check in notice file

* Update component_governance.yml for Azure Pipelines

* fixed heading

* Chenhui/windows setup (#131)

* initial test

* added batch script and instructions

* align image to center

* adjust image size

* added text

* adjust image size

* address comments

* Readds R material (#116)

* redo R stuff in new dirs

* dirname fixup

* add Rproj file

* rebuild

* fixups

* roxygenise

* copyright notice

* dataprep

* updated yaml

* more updates

* more tweaks

* reg models

* update reg models

* more updates

* reword

* rendered prophet html

* name fix

* add lintr file

* move stuff

* renamed use case folder (#138)

* renamed use case folder

* dirname change

* updated readme

* added notebooks

* fix ci test

* Vapaunic/featutils (#137)

* moved feature engineering module to contrib

* removed lag submod

* cleaned up feature engineering

* rebuild R notebooks (#139)

* Chenhui/toplevel readme (#140)

* added content table

* added references

* added external repo links

* minor update

* updated setup instructions

* added text

* align text

* removed duplicated Content section

* address review comments

* Chenhui/hyperdrive example update (#142)

* removed blackcellmagic

* removed utils under aml_scripts and updated notebook

* added notebook path

* added ci test of lightgbm multi round example

* make forecast round as parameter

* Make -Agent Name

* resolve duplicated function name

* increased time limit and reduce number of rounds

* increase time limit

* added parameters tag to multiround lightgbm and dilatedcnn

* README change (#147)

* minor change

* hide tags

* hide tags

* added parameters tag

* Revert "Chenhui/add ci tests (#150)" (#151)

This reverts commit 357453234088f2ebb8453bd8cd77527a1c6c2130 [formerly 21846168a7].

* Chenhui/Add CI tests for notebooks

This reverts commit 8a99549da8b9096b65130fd2f6634e2a217b2dd9 [formerly 89e986fe2c].

* minor update

* Added CI tests for example notebooks

* Update component governance pipeline

* Update component governance pipeline

* add ignored items

* Readds R material (#116)

* Chenhui/windows setup (#131)

* Vapaunic/featutils (#137)

* Chenhui/add CI tests for notebooks

* Vapaunic/arimaint (#154)


* modified conftests to add arima

* added tests

* modified notebooks with parameters

* Chenhui/code improvments (#157)

* updated docstring

* pinged package versions

* minor improvements

* minor improvement

* modified metrics to take any iterable (#158)

* improvement: using Ray to parallelize arima fitting (#159)

* using Ray to parallelize arima fitting

* added ray as dependency

* text about ray, disable warnings, and minor stuff

* scipy 1.4.1 or above

* reverting scipy, azuremlsdk issue

* minor mod

Co-authored-by: Vanja Paunic <15053814+vapaunic@users.noreply.github.com>

* chenhui/improve ray output (#166)

* modified arima multiround to run with ray (#167)

* Chenhui/improve doc (#168)

* minor changes

* remove redundancy

* updated text

* improved text in model tuning and deployment notebook

* clarify the data used

* updated text

* added description of the script

* add explanation of gaps in the curve

* add explanation of gaps in the curve

* updated text

* fix typos

* improve documentation and format

* Addressing a few issues around package dependencies (#169)

* syncronizing utils with other OSS AI repos

* exclude xlrd, leftover from tsperf

* exclude urlib3, leftover from tsperf

* moving tqdm to fclib as only used by lib at the moment

* included fclib dependencies in requirements.txt

* lower bounded package versions that we dont need specific versions of

* lower bound gitpython

* Chenhui/improve checking of run completion (#170)

* Chenhui/added ray dashboard (#171)

* Chenhui/update diagram (#172)

* update multiround training diagram

* minor change

* update diagram and minor change

* Addressing doc related issues (#173)

* taking out inventory optimization link

* pulled contributing out of docs

* Chenhui/ray windows (#177)

* add util to check if module exists

* use ray if available or use sequential training

* updated text

* updated text

* reduce code redundancy

* Chenhui/setup scripts (#178)

* move ray to linux setup script

* remove duplicated azureml-sdk to avoid errors

* add ray to ci yaml files

* update azureml-sdk

* update manual setup instructions

* minor change

* Chenhui/content table (#179)

* update readme

* minor change

* minor update

* Chenhui/multiround arima (#180)

* use ray if it is installed

* update text and reran notebook

* add reference

* Chenhui/dilatedcnn windows (#184)

* resolve format issues

* update log path and tensorboard path

* remove subprocess import

* fix path

* change env name to resolve pipeline failures

* Chenhui/hyperdrive windows (#185)

* resolve format issues

* update log path and tensorboard path

* remove subprocess import

* fetch common utils from chenhui/dilatedcnn_windows

* update notebook

* removed explain module and added notebooks module

* get updated ci yml files

* updated kernel name

* Chenhui/enhancement (#186)

* modified module_path

* updated tensorboard section

* rerun notebook

* only submit local run if python path is found

* minor change and rerun notebook

* updated content section (#187)

* updated content section

* minor change

* address comments

* add links

Co-authored-by: Hong Lu <honglu@microsoft.com>
Co-authored-by: ZhouFang928 <ZhouFang928@users.noreply.github.com>
Co-authored-by: pechyony <pechyony@outlook.com>
Co-authored-by: Ubuntu <chenhui@chhdsvmnc6.hyjxgt1qggauhj0g0g2jh3guwb.bx.internal.cloudapp.net>
Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com>
Co-authored-by: Hong Ooi <hongooi@microsoft.com>
Co-authored-by: Daniel Ciborowski <dciborow@microsoft.com>
Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Former-commit-id: 6098ecf68c
This commit is contained in:
Chenhui Hu 2020-04-06 16:17:18 -04:00 коммит произвёл GitHub
Родитель a409804093
Коммит 0607fd568f
369 изменённых файлов: 27322 добавлений и 65345 удалений

21
.flake8 Normal file
Просмотреть файл

@ -0,0 +1,21 @@
[flake8]
max-line-length = 120
max-complexity = 18
select = B,C,E,F,W,T4,B9
ignore =
# slice notation whitespace, invalid
E203
# too many leading # for block comment
E266
# module level import not at top of file
E402
# line break before binary operator
W503
# blank line contains whitespace
W293
# line too long
E501
# trailing white spaces
W291
# missing white space after ,
E231

25
.github/ISSUE_TEMPLATE.md поставляемый Normal file
Просмотреть файл

@ -0,0 +1,25 @@
### Description
<!--- Describe your issue/bug/request in detail -->
### In which platform does it happen?
<!--- Describe the platform where the issue is happening (use a list if needed) -->
<!--- For example: -->
<!--- * Azure Ubuntu Data Science Virtual Machine. -->
<!--- * Other platforms. -->
### How do we replicate the issue?
<!--- Please be specific as possible (use a list if needed). -->
<!--- For example: -->
<!--- * Create a conda environment for gpu -->
<!--- * Run unit test `test_timer.py` -->
<!--- * ... -->
### Expected behavior (i.e. solution)
<!--- For example: -->
<!--- * The tests for the timer should pass successfully. -->
### Other Comments

27
.github/ISSUE_TEMPLATE/bug_report.md поставляемый Normal file
Просмотреть файл

@ -0,0 +1,27 @@
---
name: Bug report
about: Create a report to help us improve
title: "[BUG] "
labels: 'bug'
assignees: ''
---
### Description
<!--- Describe your bug in detail -->
### How do we replicate the bug?
<!--- Please be specific as possible (use a list if needed). -->
<!--- For example: -->
<!--- * Create a conda environment for gpu -->
<!--- * Run unit test `test_timer.py` -->
<!--- * ... -->
### Expected behavior (i.e. solution)
<!--- For example: -->
<!--- * The tests for the timer should pass successfully. -->
### Other Comments

19
.github/ISSUE_TEMPLATE/feature_request.md поставляемый Normal file
Просмотреть файл

@ -0,0 +1,19 @@
---
name: Feature request
about: Suggest an idea for this project
title: "[FEATURE] "
labels: 'enhancement'
assignees: ''
---
### Description
<!--- Describe your expected feature in detail -->
### Expected behavior with the suggested feature
<!--- For example: -->
<!--- *Adding algorithm xxx will help people understand more about xxx use case scenarios. -->
### Other Comments

14
.github/ISSUE_TEMPLATE/general_ask.md поставляемый Normal file
Просмотреть файл

@ -0,0 +1,14 @@
---
name: General ask
about: Technical/non-technical asks about the repo
title: "[ASK] "
labels: ''
assignees: ''
---
### Description
<!--- Describe your general ask in detail -->
### Other Comments

15
.github/PULL_REQUEST_TEMPLATE.md поставляемый Normal file
Просмотреть файл

@ -0,0 +1,15 @@
### Description
<!--- Describe your changes in detail -->
<!--- Why is this change required? What problem does it solve? -->
### Related Issues
<!--- If it fixes an open issue, please link to the issue here. -->
### Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
- [ ] My code follows the code style of this project, as detailed in our [contribution guidelines](../CONTRIBUTING.md).
- [ ] I have added tests.
- [ ] I have updated the documentation accordingly.

33
.gitignore поставляемый
Просмотреть файл

@ -1,5 +1,28 @@
**/__pycache__
**/.ipynb_checkpoints
data/*
energy_load/GEFCom2017-D_Prob_MT_hourly/data/*
**/__pycache__
**/.ipynb_checkpoints
*.egg-info/
.vscode/
*.pkl
*.h5
# Data
ojdata/*
*.Rdata
# AML Config
aml_config/
.azureml/
.config/
# Pytests
.pytest_cache/
# File for model deployment
score.py
# Environments
myenv.yml
# Logs
logs/
*.log

18
.lintr Normal file
Просмотреть файл

@ -0,0 +1,18 @@
linters: with_defaults(
infix_spaces_linter = NULL,
spaces_left_parentheses_linter = NULL,
open_curly_linter = NULL,
line_length_linter = NULL,
camel_case_linter = NULL,
object_name_linter = NULL,
object_usage_linter = NULL,
object_length_linter = NULL,
trailing_blank_lines_linter = NULL,
absolute_paths_linter = NULL,
commented_code_linter = NULL,
implicit_integer_linter = NULL,
extraction_operator_linter = NULL,
single_quotes_linter = NULL,
pipe_continuation_linter = NULL,
cyclocomp_linter = NULL
)

17
.pre-commit-config.yaml Normal file
Просмотреть файл

@ -0,0 +1,17 @@
repos:
- repo: https://github.com/psf/black
rev: stable
hooks:
- id: black
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v1.2.3
hooks:
- id: flake8
- repo: local
hooks:
- id: jupytext
name: jupytext
entry: jupytext --from ipynb --pipe black --check flake8
pass_filenames: true
files: .ipynb
language: python

139
CONTRIBUTING.md Normal file
Просмотреть файл

@ -0,0 +1,139 @@
# Contribution Guidelines
Contribution are welcome! Here's a few things to know:
* [Setup](./SETUP.md)
* [Microsoft Contributor License Agreement](#microsoft-contributor-license-agreement)
* [Steps to Contributing](#steps-to-contributing)
* [Coding Guidelines](#forecasting-team-contribution-guidelines)
* [Code of Conduct](#code-of-conduct)
## Setup
To get started, navigate to the [Setup Guide](./SETUP.md), which lists instructions on how to set up your environment and dependencies.
## Microsoft Contributor License Agreement
Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
## Steps to Contributing
Here are the basic steps to get started with your first contribution. Please reach out with any questions.
1. Use [open issues](https://github.com/Microsoft/Forecasting/issues) to discuss the proposed changes. Create an issue describing changes if necessary to collect feedback. Also, please use provided labels to tag issues so everyone can easily sort issues of interest.
2. [Fork the repo](https://help.github.com/articles/fork-a-repo/) so you can make and test local changes.
3. Create a new branch for the issue. We suggest prefixing the branch with your username and then a descriptive title, e.g. chenhui/python_test_pipeline.
5. Make code changes.
6. Ensure unit tests pass and code style / formatting is consistent (see [wiki](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines#python-and-docstrings-style) for more details).
7. We use [pre-commit](https://pre-commit.com/) package to run our pre-commit hooks. We use [black](https://github.com/ambv/black) formatter and [flake8](https://pypi.org/project/flake8/) for linting on each commit. In order to set up pre-commit on your machine, follow the steps here, please note that you only need to run these steps the first time you use `pre-commit` for this project.
* Update your conda environment, `pre-commit` is part of the yaml file or just do
```
$ pip install pre-commit
```
* Set up `pre-commit` by running following command, this will put pre-commit under your .git/hooks directory.
```
$ pre-commit install
```
> Note: Git hooks to install are specified in the pre-commit configuration file `.pre-commit-config.yaml`. Settings used by `black` and `flake8` are specified in `pyproject.toml` and `.flake8` files, respectively.
* When you've made changes on local files and are ready to commit, run
```
$ git commit -m "message"
```
* Each time you commit, git will run the pre-commit hooks on any python files that are getting committed and are part of the git index. If `black` modifies/formats the file, or if `flake8` finds any linting errors, the commit will not succeed. You will need to stage the file again if `black` changed the file, or fix the issues identified by `flake8` and and stage it again.
* To run pre-commit on all files just run
```
$ pre-commit run --all-files
```
8. Create a pull request (PR) against __`staging`__ branch.
We use `staging` branch to land all new features, so please remember to create the Pull Request against `staging`. To work with GitHub, please see the next section for more detail about our [working with GitHub](#working-with-github).
Once the features included in a milestone are complete we will merge `staging` into `master` branch and make a release. See the wiki for more detail about our [merge strategy](https://github.com/Microsoft/Forecasting/wiki/Strategy-to-merge-the-code-to-master-branch).
### Working with GitHub
1. All development is done in a branch off from the `staging` and named following this convention: `<user>/<topic>`.
To create a new branch, run this command:
```shell
$ git checkout -b <user>/<topic>
```
When done making the changes locally, push your branch to the server, but make sure to sync with the remote first.
```
$ git pull origin staging
$ git push origin <your branch>
```
2. To merge a new branch into the `staging` branch, please open a pull request.
3. The person who opens a PR should complete the PR, once it has been reviewed and all comments addressed.
4. We will use *Squash and Merge* when completing PRs, to maintain a clean merge history on the repo.
5. When a branch is merged into the `staging`, it must be deleted from the remote repository.
```shell
# Delete local branch
$ git branch -d <your branch>
# Delete remote branch
$ git push origin --delete <your branch>
```
## Coding Guidelines
We strive to maintain high quality code to make it easy to understand, use, and extend. We also work hard to maintain a friendly and constructive environment. We've found that having clear expectations on the development process and consistent style helps to ensure everyone can contribute and collaborate effectively.
Please review the [coding guidelines](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines) wiki page to see more details about the expectations for development approach and style.
## Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
Apart from the official Code of Conduct developed by Microsoft, in the Forecasting team we adopt the following behaviors, to ensure a great working environment:
#### Do not point fingers
Lets be constructive.
<details>
<summary><em>Click here to see some examples</em></summary>
"This method is missing docstrings" instead of "YOU forgot to put docstrings".
</details>
#### Provide code feedback based on evidence
When making code reviews, try to support your ideas based on evidence (papers, library documentation, stackoverflow, etc) rather than your personal preferences.
<details>
<summary><em>Click here to see some examples</em></summary>
"When reviewing this code, I saw that the Python implementation the metrics are based on classes, however, [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) and [tensorflow](https://www.tensorflow.org/api_docs/python/tf/metrics) use functions. We should follow the standard in the industry."
</details>
#### Ask questions do not give answers
Try to be empathic.
<details>
<summary><em>Click here to see some examples</em></summary>
* Would it make more sense if ...?
* Have you considered this ... ?
</details>

21
LICENSE Normal file
Просмотреть файл

@ -0,0 +1,21 @@
MIT License
Copyright (c) Microsoft Corporation. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE

17
NOTICE.txt Normal file
Просмотреть файл

@ -0,0 +1,17 @@
NOTICES AND INFORMATION
Do Not Translate or Localize
This software incorporates material from third parties.
Microsoft makes certain open source code available at https://3rdpartysource.microsoft.com,
or you may send a check or money order for US $5.00, including the product name,
the open source component name, platform, and version number, to:
Source Code Compliance Team
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
USA
Notwithstanding any other terms, you may reverse engineer this software to the extent
required to debug changes to any libraries licensed under the GNU Lesser General Public License.

153
README.md
Просмотреть файл

@ -1,69 +1,98 @@
# TSPerf
# Forecasting Best Practices
TSPerf is a repository of time-series forecasting models with a comprehensive comparison of their performance over provided benchmark data sets, implemented on Azure. Model implementations are compared by forecasting accuracy, training and scoring time and cost on Azure compute. Each implementation includes all the necessary instructions and tools that ensure its reproducibility. We envision TSPerf to become a central repository of time-series forecasting that provides wide coverage of time-series algorithms, from the very simple to the state of the art in the industry. The roadmap of TSPerf can be found [here](docs/roadmap.md).
Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.
This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.
The examples and best practices are provided as [Python Jupyter notebooks and R markdown files](examples) and [a library of utility functions](fclib). We hope that these examples and utilities can significantly reduce the “time to market” by simplifying the experience from defining the business problem to the development of solutions by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.
The following table summarizes benchmarks that are currently included in TSPerf.
Benchmark | Dataset | Benchmark directory
--------------------------------------------|------------------------|---------------------------------------------
Probabilistic electricity load forecasting | GEFCom2017 | `energy_load/GEFCom2017-D_Prob_MT_Hourly`
Retail sales forecasting | Orange Juice dataset | `retail_sales/OrangeJuice_Pt_3Weeks_Weekly`
A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.
## Probabilistic energy forecasting performance board
The following table lists the current submision for the energy forecasting and their respective performances.
Submission Name | Pinball Loss | Training and Scoring Time (sec) | Training and Scoring Cost($) | Architecture | Framework | Algorithm | Uni/Multivariate | External Feature Support
--------------------------------------------------------------------------------|----------------|-----------------------------------|--------------------------------|-----------------------------------------------|------------------------------------|---------------------------------------|--------------------|--------------------------
[Baseline](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Fbaseline) | 84.11 | 444 | 0.0474 | Linux DSVM (Standard D8s v3 - Premium SSD) | quantreg package of R | Linear Quantile Regression | Multivariate | Yes
[GBM](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2FGBM) | 78.71 | 888 | 0.0947 | Linux DSVM (Standard D8s v3 - Premium SSD) | gbm package of R | Gradient Boosting Decision Tree | Multivariate | Yes
[QRF](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Fqrf) | 76.48 | 22709 | 19.03 | Linux DSVM (F72s v2 - Premium SSD) | scikit-garden package of Python | Quantile Regression Forest | Multivariate | Yes
[FNN](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Ffnn) | 79.27 | 4604 | 0.4911 | Linux DSVM (Standard D8s v3 - Premium SSD) | qrnn package of R | Quantile Regression Neural Network | Multivariate | Yes
The following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:
![EnergyPBLvsTime](./docs/images/Energy-Cost.png)
## Retail sales forecasting performance board
The following table lists the current submision for the retail forecasting and their respective performances.
Submission Name | MAPE (%) | Training and Scoring Time (sec) | Training and Scoring Cost ($) | Architecture | Framework | Algorithm | Uni/Multivariate | External Feature Support
--------------------------------------------------------------------------------------------|------------|-----------------------------------|---------------------------------|----------------------------------------------|------------------------------|---------------------------------------------------------------------|--------------------|--------------------------
[Baseline](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2Fbaseline) | 109.67 | 114.06 | 0.003 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Naive Forecast | Univariate | No
[AutoARIMA](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FARIMA) | 70.80 | 265.94 | 0.0071 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Auto ARIMA | Multivariate | Yes
[ETS](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FETS) | 70.99 | 277 | 0.01 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | ETS | Multivariate | No
[MeanForecast](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FMeanForecast) | 70.74 | 69.88 | 0.002 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Mean forecast | Univariate | No
[SeasonalNaive](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FSeasonalNaive) | 165.06 | 160.45 | 0.004 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Seasonal Naive | Univariate | No
[LightGBM](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FLightGBM) | 36.28 | 625.10 | 0.0167 | Linux DSVM (Standard D2s v3 - Premium SSD) | lightGBM package of Python | Gradient Boosting Decision Tree | Multivariate | Yes
[DilatedCNN](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FDilatedCNN) | 37.09 | 413 | 0.1032 | Ubuntu VM(NC6 - Standard HDD) | Keras and Tensorflow | Python + Dilated convolutional neural network | Multivariate | Yes
[RNN Encoder-Decoder](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FRNN) | 37.68 | 669 | 0.2 | Ubuntu VM(NC6 - Standard HDD) | Tensorflow | Python + Encoder-decoder architecture of recurrent neural network | Multivariate | Yes
The following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:
![EnergyPBLvsTime](./docs/images/Retail-Cost.png)
## Content
The following is a summary of models and methods for developing forecasting solutions covered in this repository. The [examples](examples) are organized according to use cases. Currently, we focus on a retail sales forecasting use case as it is widely used in [assortment planning](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1569&context=edissertations), [inventory optimization](https://en.wikipedia.org/wiki/Inventory_optimization), and [price optimization](https://en.wikipedia.org/wiki/Price_optimization). To enable high-throughput forecasting scenarios, we have included examples for forecasting multiple time series with distributed training techniques such as Ray in Python, parallel package in R, and multi-threading in LightGBM.
| Model | Language | Description |
|---------------------------------------------------------------------------------------------------|----------|-------------------------------------------------------------------------------------------------------------|
| [Auto ARIMA](examples/grocery_sales/python/00_quick_start/autoarima_single_round.ipynb) | Python | Auto Regressive Integrated Moving Average (ARIMA) model that is automatically selected |
| [Linear Regression](examples/grocery_sales/python/00_quick_start/azure_automl_single_round.ipynb) | Python | Linear regression model trained on lagged features of the target variable and external features |
| [LightGBM](examples/grocery_sales/python/00_quick_start/lightgbm_single_round.ipynb) | Python | Gradient boosting decision tree implemented with LightGBM package for high accuracy and fast speed |
| [DilatedCNN](examples/grocery_sales/python/02_model/dilatedcnn_multi_round.ipynb) | Python | Dilated Convolutional Neural Network that captures long-range temporal flow with dilated causal connections |
| [Mean Forecast](examples/grocery_sales/R/02_basic_models.Rmd) | R | Simple forecasting method based on historical mean |
| [ARIMA](examples/grocery_sales/R/02a_reg_models.Rmd) | R | ARIMA model without or with external features |
| [ETS](examples/grocery_sales/R/02_basic_models.Rmd) | R | Exponential Smoothing algorithm with additive errors |
| [Prophet](examples/grocery_sales/R/02b_prophet_models.Rmd) | R | Automated forecasting procedure based on an additive model with non-linear trends |
The repository also comes with AzureML-themed notebooks and best practices recipes to accelerate the development of scalable, production-grade forecasting solutions on Azure. In particular, we have the following examples for forecasting with Azure AutoML as well as tuning and deploying a forecasting model on Azure.
| Method | Language | Description |
|-----------------------------------------------------------------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------|
| [Azure AutoML](examples/grocery_sales/python/00_quick_start/azure_automl_single_round.ipynb) | Python | AzureML service that automates model development process and identifies the best machine learning pipeline |
| [HyperDrive](examples/grocery_sales/python/03_model_tune_deploy/azure_hyperdrive_lightgbm.ipynb) | Python | AzureML service for tuning hyperparameters of machine learning models in parallel on cloud |
| [AzureML Web Service](examples/grocery_sales/python/03_model_tune_deploy/azure_hyperdrive_lightgbm.ipynb) | Python | AzureML service for deploying a model as a web service on Azure Container Instances |
## Getting Started in Python
To quickly get started with the repository on your local machine, use the following commands.
1. Install Anaconda with Python >= 3.6. [Miniconda](https://conda.io/miniconda.html) is a quick way to get started.
2. Clone the repository
```
git clone https://github.com/microsoft/forecasting
cd forecasting/
```
3. Run setup scripts to create conda environment. Please execute one of the following commands from the root of Forecasting repo based on your operating system.
- Linux
```
./tools/environment_setup.sh
```
- Windows
```
tools\environment_setup.bat
```
Note that for Windows you need to run the batch script from Anaconda Prompt. The script creates a conda environment `forecasting_env` and installs the forecasting utility library `fclib`.
4. Start the Jupyter notebook server
```
jupyter notebook
```
5. Run the [LightGBM single-round](examples/oj_retail/python/00_quick_start/lightgbm_single_round.ipynb) notebook under the `00_quick_start` folder. Make sure that the selected Jupyter kernel is `forecasting_env`.
If you have any issues with the above setup, or want to find more detailed instructions on how to set up your environment and run examples provided in the repository, on local or a remote machine, please navigate to the [Setup Guide](./docs/SETUP.md).
## Getting Started in R
We assume you already have R installed on your machine. If not, simply follow the [instructions on CRAN](https://cloud.r-project.org/) to download and install R.
The recommended editor is [RStudio](https://rstudio.com), which supports interactive editing and previewing of R notebooks. However, you can use any editor or IDE that supports RMarkdown. In particular, [Visual Studio Code](https://code.visualstudio.com) with the [R extension](https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r) can be used to edit and render the notebook files. The rendered `.nb.html` files can be viewed in any modern web browser.
The examples use the [Tidyverts](https://tidyverts.org) family of packages, which is a modern framework for time series analysis that builds on the widely-used [Tidyverse](https://tidyverse.org) family. The Tidyverts framework is still under active development, so it's recommended that you update your packages regularly to get the latest bug fixes and features.
## Target Audience
Our target audience for this repository includes data scientists and machine learning engineers with varying levels of knowledge in forecasting as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world forecasting problems.
## Contributing
We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our [Contributing Guide](CONTRIBUTING.md).
## Reference
The following is a list of related repositories that you may find helpful.
| | |
|------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| [Deep Learning for Time Series Forecasting](https://github.com/Azure/DeepLearningForTimeSeriesForecasting) | A collection of examples for using deep neural networks for time series forecasting with Keras. |
| [Microsoft AI Github](https://github.com/microsoft/ai) | Find other Best Practice projects, and Azure AI designed patterns in our central repository. |
## Build Status
| Build | Branch | Status |
|---------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=master) |
| **Linux CPU** | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=staging)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=staging) |

36
R_utils/cluster.R Normal file
Просмотреть файл

@ -0,0 +1,36 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Creates a local background cluster for parallel computations
#'
#' @param ncores The number of nodes (cores) for the cluster. The default is 2 less than the number of physical cores.
#' @param libs The packages to load on each node, as a character vector.
#' @param useXDR For most platforms, this can be left at its default `FALSE` value.
#' @return
#' A cluster object.
make_cluster <- function(ncores=NULL, libs=character(0), useXDR=FALSE)
{
if(is.null(ncores))
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
cl <- parallel::makeCluster(ncores, type="PSOCK", useXDR=useXDR)
res <- try(parallel::clusterCall(
cl,
function(libs)
{
for(lib in libs) library(lib, character.only=TRUE)
},
libs
), silent=TRUE)
if(inherits(res, "try-error"))
parallel::stopCluster(cl)
else cl
}
#' Deletes a local background cluster
#'
#' @param cl The cluster object, as returned from `make_cluster`.
destroy_cluster <- function(cl)
{
try(parallel::stopCluster(cl), silent=TRUE)
}

50
R_utils/model_eval.R Normal file
Просмотреть файл

@ -0,0 +1,50 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Computes forecast values on a dataset
#'
#' @param mable A mable (model table) as returned by `fabletools::model`.
#' @param newdata The dataset for which to compute forecasts.
#' @param ... Further arguments to `fabletools::forecast`.
#' @return
#' A tsibble, with one column per model type in `mable`, and one column named `.response` containing the response variable from `newdata`.
get_forecasts <- function(mable, newdata, ...)
{
fcast <- forecast(mable, new_data=newdata, ...)
keyvars <- key_vars(fcast)
keyvars <- keyvars[-length(keyvars)]
indexvar <- index_var(fcast)
fcastvar <- as.character(attr(fcast, "response")[[1]])
fcast <- fcast %>%
as_tibble() %>%
pivot_wider(
id_cols=all_of(c(keyvars, indexvar)),
names_from=.model,
values_from=all_of(fcastvar))
select(newdata, !!keyvars, !!indexvar, !!fcastvar) %>%
rename(.response=!!fcastvar) %>%
inner_join(fcast)
}
#' Evaluate quality of forecasts given a criterion
#'
#' @param fcast_df A tsibble as returned from `get_forecasts`.
#' @param gof A goodness-of-fit function. The default is to use `fabletools::MAPE`, which computes the mean absolute percentage error.
#' @return
#' A single-row data frame with the computed goodness-of-fit statistic for each model.
eval_forecasts <- function(fcast_df, gof=fabletools::MAPE)
{
if(!is.function(gof))
gof <- get(gof, mode="function")
resp <- fcast_df$.response
keyvars <- key_vars(fcast_df)
indexvar <- index_var(fcast_df)
fcast_df %>%
as_tibble() %>%
select(-all_of(c(keyvars, indexvar, ".response"))) %>%
summarise_all(
function(x, .actual) gof(x - .actual, .actual=.actual),
.actual=resp
)
}

25
R_utils/save_objects.R Normal file
Просмотреть файл

@ -0,0 +1,25 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
#' Loads serialised objects relating to a given forecasting example into the current workspace
#'
#' @param example The particular forecasting example.
#' @param file The name of the file (with extension).
#' @return
#' This function is run for its side effect, namely loading the given file into the global environment.
load_objects <- function(example, file)
{
examp_dir <- here::here("examples", example, "R")
load(file.path(examp_dir, file), envir=globalenv())
}
#' Saves R objects for a forecasting example to a file
#'
#' @param ... Objects to save, as unquoted names.
#' @param example The particular forecasting example.
#' @param file The name of the file (with extension).
save_objects <- function(..., example, file)
{
examp_dir <- here::here("examples", example, "R")
save(..., file=file.path(examp_dir, file))
}

Двоичные данные
assets/time_series_split_multiround.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 125 KiB

Двоичные данные
assets/time_series_split_singleround.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 52 KiB

1
codeofconduct.md Normal file
Просмотреть файл

@ -0,0 +1 @@
[Our Code of Conduct](https://opensource.microsoft.com/codeofconduct/faq/)

Просмотреть файл

@ -1,112 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
import csvtomd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
### Generating performance charts
#################################################
#Function to plot a performance chart
def plot_perf(x,y,df):
# extract submission name from submission URL
labels = df.apply(lambda x: x['Submission Name'][1:].split(']')[0], axis=1)
fig = plt.scatter(x=df[x],y=df[y], label=labels, s=150, alpha = 0.5,
c= ['b', 'g', 'r', 'c', 'm', 'y', 'k'])
plt.xlabel(x)
plt.ylabel(y)
plt.title(y + ' by ' + x)
offset = (max(df[y]) - min(df[y]))/50
for i,name in enumerate(labels):
ax = df[x][i]
ay = df[y][i] + offset * (-2.5 + i % 5)
plt.text(ax, ay, name, fontsize=10)
return(fig)
### Printing the Readme.md file
############################################
readmefile = '../Readme.md'
#Wrtie header
#print(file=open(readmefile))
print('# TSPerf\n', file=open(readmefile, "w"))
print('TSPerf is a collection of implementations of time-series forecasting algorithms in Azure cloud and comparison of their performance over benchmark datasets. \
Algorithm implementations are compared by model accuracy, training and scoring time and cost. Each implementation includes all the necessary \
instructions and tools that ensure its reproducibility.', file=open(readmefile, "a"))
print('The following table summarizes benchmarks that are currently included in TSPerf.\n', file=open(readmefile, "a"))
#Read the benchmark table the CSV file and converrt to a table in md format
with open('Benchmarks.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
print('A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, \
can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of \
implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.\n', file=open(readmefile, "a"))
### Write the Energy section
#============================
print('## Probabilistic energy forecasting performance board\n\n', file=open(readmefile, "a"))
print('The following table lists the current submision for the energy forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
#Read the energy perfromane board from the CSV file and converrt to a table in md format
with open('TSPerfBoard-Energy.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
#Read Energy Performance Board CSV file
df = pd.read_csv('TSPerfBoard-Energy.csv', engine='python')
#df
#Plot ,'Pinball Loss' by 'Training and Scoring Cost($)' chart
fig4 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
fig4 = plot_perf('Training and Scoring Cost($)','Pinball Loss',df)
plt.savefig('../docs/images/Energy-Cost.png')
#insetting the performance charts
print('\n\nThe following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
print('![EnergyPBLvsTime](./docs/images/Energy-Cost.png)' ,file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
#print the retail sales forcsating section
#========================================
print('## Retail sales forecasting performance board\n\n', file=open(readmefile, "a"))
print('The following table lists the current submision for the retail forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
#Read the energy perfromane board from the CSV file and converrt to a table in md format
with open('TSPerfBoard-Retail.csv', 'r') as f:
table = csvtomd.csv_to_table(f, ',')
print(csvtomd.md_table(table), file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
#Read Retail Performane Board CSV file
df = pd.read_csv('TSPerfBoard-Retail.csv', engine='python')
#df
#Plot MAPE (%) by Training and Scoring Cost ($) chart
fig2 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
fig2 = plot_perf('Training and Scoring Cost ($)','MAPE (%)',df)
plt.savefig('../docs/images/Retail-Cost.png')
#insetting the performance charts
print('\n\nThe following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
print('![EnergyPBLvsTime](./docs/images/Retail-Cost.png)' ,file=open(readmefile, "a"))
print('\n\n\n',file=open(readmefile, "a"))
print('A new Readme.md file has been generated successfuly.')

Просмотреть файл

@ -1,17 +0,0 @@
name: tsperf
channels:
- defaults
- r
- conda-forge
dependencies:
- python=3.6
- numpy=1.15.0
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- jupyter=1.0.0
- r-essentials=3.5.1
- matplotlib=2.2.3
- pip:
- csvtomd==0.3.0

Просмотреть файл

@ -1 +0,0 @@
5/cVuditI8OEN7ADztEWg6k+91MTQVbt

Просмотреть файл

@ -1,118 +0,0 @@
import datetime
import pandas as pd
from dateutil.relativedelta import relativedelta
ALLOWED_TIME_COLUMN_TYPES = [pd.Timestamp, pd.DatetimeIndex,
datetime.datetime, datetime.date]
def is_datetime_like(x):
"""Function that checks if a data frame column x is of a datetime type."""
return any(isinstance(x, col_type)
for col_type in ALLOWED_TIME_COLUMN_TYPES)
def get_datetime_col(df, datetime_colname):
"""
Helper function for extracting the datetime column as datetime type from
a data frame.
Args:
df: pandas DataFrame containing the column to convert
datetime_colname: name of the column to be converted
Returns:
pandas.Series: converted column
Raises:
Exception: if datetime_colname does not exist in the dateframe df.
Exception: if datetime_colname cannot be converted to datetime type.
"""
if datetime_colname in df.index.names:
datetime_col = df.index.get_level_values(datetime_colname)
elif datetime_colname in df.columns:
datetime_col = df[datetime_colname]
else:
raise Exception('Column or index {0} does not exist in the data '
'frame'.format(datetime_colname))
if not is_datetime_like(datetime_col):
try:
datetime_col = pd.to_datetime(df[datetime_colname])
except:
raise Exception('Column or index {0} can not be converted to '
'datetime type.'.format(datetime_colname))
return datetime_col
def get_month_day_range(date):
"""
Returns the first date and last date of the month of the given date.
"""
# Replace the date in the original timestamp with day 1
first_day = date + relativedelta(day=1)
# Replace the date in the original timestamp with day 1
# Add a month to get to the first day of the next month
# Subtract one day to get the last day of the current month
last_day = date + relativedelta(day=1, months=1, days=-1, hours=23)
return first_day, last_day
def split_train_validation(df, fct_horizon, datetime_colname):
"""
Splits the input dataframe into train and validate folds based on the forecast
creation time (fct) and forecast horizon specified by fct_horizon.
Args:
df: The input data frame to split.
fct_horizon: list of tuples in the format of
(fct, (forecast_horizon_start, forecast_horizon_end))
datetime_colname: name of the datetime column
Note: df[datetime_colname] needs to be a datetime type.
"""
i_round = 0
for fct, horizon in fct_horizon:
i_round += 1
train = df.loc[df[datetime_colname] < fct, ].copy()
validation = df.loc[(df[datetime_colname] >= horizon[0]) &
(df[datetime_colname] <= horizon[1]), ].copy()
yield i_round, train, validation
def add_datetime(input_datetime, unit, add_count):
"""
Function to add a specified units of time (years, months, weeks, days,
hours, or minutes) to the input datetime.
Args:
input_datetime: datatime to be added to
unit: unit of time, valid values: 'year', 'month', 'week',
'day', 'hour', 'minute'.
add_count: number of units to add
Returns:
New datetime after adding the time difference to input datetime.
Raises:
Exception: if invalid unit is provided. Valid units are:
'year', 'month', 'week', 'day', 'hour', 'minute'.
"""
if unit == 'year':
new_datetime = input_datetime + relativedelta(years=add_count)
elif unit == 'month':
new_datetime = input_datetime + relativedelta(months=add_count)
elif unit == 'week':
new_datetime = input_datetime + relativedelta(weeks=add_count)
elif unit == 'day':
new_datetime = input_datetime + relativedelta(days=add_count)
elif unit == 'hour':
new_datetime = input_datetime + relativedelta(hours=add_count)
elif unit == 'minute':
new_datetime = input_datetime + relativedelta(minutes=add_count)
else:
raise Exception('Invalid backtest step unit, {}, provided. Valid '
'step units are year, month, week, day, hour, and minute'
.format(unit))
return new_datetime

3
contrib/README.md Normal file
Просмотреть файл

@ -0,0 +1,3 @@
# Contrib
Independent or incubating algorithms and utilities are candidates for the `contrib` folder. This folder will house contributions which may not easily fit into the core repository or need time to refactor the code and add necessary tests.

Просмотреть файл

@ -1,8 +1,6 @@
## Download base image
FROM continuumio/anaconda3:4.4.0
#ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/conda_dependencies.yml /tmp/conda_dependencies.yml
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
#ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/install_R_dependencies.R /tmp/install_R_dependencies.R
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
@ -13,12 +11,9 @@ RUN apt-get install -y --no-install-recommends \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
@ -26,26 +21,20 @@ RUN apt-get install -y --no-install-recommends \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
vim
## Create and activate conda environment
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
## Install R
ENV R_BASE_VERSION 3.5.1
RUN apt-get install -y aptitude
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
&& aptitude install -y debian-keyring debian-archive-keyring
RUN apt-get remove -y binutils
RUN apt-get update \
&& apt-get install -t unstable -y --no-install-recommends \
r-base=${R_BASE_VERSION}-*
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
@ -62,7 +51,7 @@ RUN Rscript install_R_dependencies.R
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /TSPerf
WORKDIR /TSPerf
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -12,7 +12,7 @@
**Submission name:** GBM
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
## Implementation description
@ -35,105 +35,105 @@ The data of January - April of 2016 were used as validation dataset for some min
### Description of implementation scripts
* `feature_engineering.py`: Python script for computing features and generating feature files.
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Gradient Boosting Machine model for quantile regression task and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_predict.R` five times to generate five submission files and measure model running time.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repository to the home directory of your machine
2. Clone the Forecasting repository to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
From the `~/Forecasting` directory on the VM create a conda environment named `tsperf` by running:
From the `~/Forecasting` directory on the VM create a conda environment named `tsperf` by running:
```bash
conda env create --file ./common/conda_dependencies.yml
```
```bash
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
3. Download and extract data **on the VM**.
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
4. Prepare Docker container for model training and predicting.
5. Prepare Docker container for model training and predicting.
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
4.1 Log into Azure Container Registry (ACR)
4.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
4.2 Build a local Docker image
```bash
sudo docker build -t gbm_image benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
```
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
6. Train and predict **within Docker container**
4.2 Pull the Docker image from ACR to your VM
6.1 Start a Docker container from the image
```bash
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image:v1
```
5. Train and predict **within Docker container**
5.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name gbm_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name gbm_container gbm_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
5.2 Train and predict
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/train_score_vm.sh > out.txt &
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/train_score_vm.sh > out.txt &
```
After generating the forecast results, you can exit the Docker container with command `exit`.
6. Model evaluation **on the VM**
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash ./common/evaluate submissions/GBM energy_load/GEFCom2017_D_Prob_MT_hourly
bash tsperf/benchmarking/evaluate GBM tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.6
- python==3.7
* R
- r-base==3.5.1
- r-base==3.5.3
- gbm==2.1.3
- data.table==1.11.4
@ -145,34 +145,34 @@ Please follow the instructions below to deploy the Linux DSVM.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 78.71
* Pinball loss run 2: 78.72
* Pinball loss run 3: 78.69
* Pinball loss run 4: 78.71
* Pinball loss run 5: 78.71
* Pinball loss run 1: 78.85
* Pinball loss run 2: 78.84
* Pinball loss run 3: 78.86
* Pinball loss run 4: 78.76
* Pinball loss run 5: 78.82
Median Pinball loss: **78.71**
Median Pinball loss: **78.84**
**Time:**
* Run time 1: 878 seconds
* Run time 2: 888 seconds
* Run time 3: 894 seconds
* Run time 4: 894 seconds
* Run time 5: 878 seconds
* Run time 1: 268 seconds
* Run time 2: 269 seconds
* Run time 3: 269 seconds
* Run time 4: 269 seconds
* Run time 5: 266 seconds
Median run time: **888 seconds**
Median run time: **269 seconds**
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date. Thus, the total cost is `888/3600 * 0.3840 = $0.0947`.
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date. Thus, the total cost is `269/3600 * 0.3840 = $0.0287`.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 9.57
Round 2: 18.17
Round 3: 17.83
Round 4: 8.58
Round 5: 7.54
Round 6: 6.96
Round 1: 9.55
Round 2: 18.24
Round 3: 17.90
Round 4: 8.27
Round 5: 7.22
Round 6: 6.80
**Ranking in the qualifying round of GEFCom2017 competition**
4

Просмотреть файл

@ -0,0 +1,66 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Gradient Boosting Machines model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -6,4 +6,5 @@ dependencies:
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -0,0 +1,7 @@
pkgs <- c(
'data.table',
'gbm',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -3,10 +3,9 @@ This script inserts the TSPerf directory into sys.path, so that scripts can impo
"""
import os, sys
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -0,0 +1,101 @@
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('gbm')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("gbm", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/submission_seed_', seed_value, '.csv', sep=""))
normalize_columns = list( 'DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
result_all = list()
N_ROUNDS = 6
for (iR in 1:N_ROUNDS){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
ntrees = 1000
shrinkage = 0.005
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
set.seed(seed_value)
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles) {
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
gbmModel = gbm(formula = DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
distribution = list(name = "quantile", alpha = tau),
data = train_df_sub,
n.trees = ntrees,
shrinkage = shrinkage)
gbmPredictions = predict(object = gbmModel,
newdata = test_df_sub,
n.trees = ntrees,
type = "response") * test_df_sub$load_ratio
result$Prediction = gbmPredictions
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,14 +1,14 @@
#!/bin/bash
path=energy_load/GEFCom2017_D_Prob_MT_hourly
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/submissions/fnn/feature_engineering.py --submission fnn
python $path/GBM/compute_features.py --submission GBM
echo 'Training and predicting...'
Rscript $path/submissions/fnn/train_predict.R $i
Rscript $path/GBM/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'

Просмотреть файл

@ -22,7 +22,7 @@ The table below summarizes the benchmark problem definition:
| **Forecast granularity** | hourly |
| **Forecast type** | probabilistic, 9 quantiles: 10th, 20th, ...90th percentiles|
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/reference/submission.csv)
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/benchmarks/GEFCom2017_D_Prob_MT_hourly/sample_submission.csv)
# Data
### Dataset attribution
@ -31,8 +31,7 @@ A template of the submission file can be found [here](https://github.com/Microso
### Dataset description
1. The data files can be downloaded from ISO New England website via the
[zonal information page of the energy, load and demand reports](https://www
.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info). If you
[zonal information page of the energy, load and demand reports](https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info). If you
are outside United States, you may need a VPN to access the data. Use columns
A, B, D, M and N in the worksheets of "YYYY SMD Hourly Data" files, where YYYY
represents the year. Detailed information of each column can be found in the
@ -68,6 +67,45 @@ using the available training data:
| 5 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-03-01 01:00:00 | 2017-03-31 00:00:00 |
| 6 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-04-01 01:00:00 | 2017-04-30 00:00:00 |
### Feature engineering
A common feature engineering script, common/feature_engineering.py, is provided to be used by individual submissions.
Below is an example of using this script.
The feature configuration list is used to specify the features to be computed by the compute_features function.
Each feature configuration is a tuple in the format of (feature_name, featurizer_args).
* feature_name is used to determine the featurizer to use, see FEATURE_MAP in
common/feature_engineering.py.
* featurizer_args is a dictionary of arguments passed to the featurizer.
```python
from energy_load.GEFCom2017_D_Prob_MT_hourly.common.feature_engineering\
import compute_features
DF_CONFIG = {
'time_col_name': 'Datetime',
'grain_col_name': 'Zone',
'value_col_name': 'DEMAND',
'frequency': 'hourly',
'time_format': '%Y-%m-%d %H:%M:%S'
}
feature_config_list = \
[('temporal', {'feature_list': ['hour_of_day', 'month_of_year']}),
('annual_fourier', {'n_harmonics': 3}),
('weekly_fourier', {'n_harmonics': 3}),
('previous_year_load_lag',
{'input_col_name': 'DEMAND', 'output_col_name': 'load_lag'}),
('previous_year_dry_bulb_lag',
{'input_col_name': 'DryBulb', 'output_col_name': 'dry_bulb_lag'})]
TRAIN_DATA_DIR = './data/train'
TEST_DATA_DIR = './data/test'
OUTPUT_DIR = './data/features'
compute_features(TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG,
feature_config_list,
filter_by_month=True)
```
# Model Evaluation
**Evaluation metric**: Pinball loss

Просмотреть файл

@ -1,24 +1,19 @@
## Download base image
FROM continuumio/anaconda3:4.4.0
# ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/conda_dependencies.yml /tmp/conda_dependencies.yml
# ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/install_R_dependencies.R /tmp/install_R_dependencies.R
ADD ./conda_dependencies.yml /tmp/conda_dependencies.yml
ADD ./install_R_dependencies.R /tmp/install_R_dependencies.R
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
@ -26,42 +21,37 @@ RUN apt-get install -y --no-install-recommends \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Create and activate conda environment
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
## Install R
ENV R_BASE_VERSION 3.5.1
RUN apt-get install -y aptitude
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
&& aptitude install -y debian-keyring debian-archive-keyring
RUN apt-get remove -y binutils
RUN apt-get update \
&& apt-get install -t unstable -y --no-install-recommends \
r-base=${R_BASE_VERSION}-*
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev
libopenblas-dev \
g++
## Mount R dependency file into the docker container and install dependencies
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
RUN Rscript install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /TSPerf
WORKDIR /TSPerf
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -12,7 +12,7 @@
**Submission name:** baseline
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
## Implementation description
@ -36,16 +36,16 @@ No parameter tuning was done.
### Description of implementation scripts
* `feature_engineering.py`: Python script for computing features and generating feature files.
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Quantile Regression models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_predict.R` five times to generate five submission files and measure model running time.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
@ -60,81 +60,80 @@ VM.
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file ./common/conda_dependencies.yml
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
3. Download and extract data **on the VM**.
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
4. Prepare Docker container for model training and predicting.
4.1 Log into Azure Container Registry (ACR)
5. Prepare Docker container for model training and predicting.
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
5.2 Build a local Docker image
```bash
sudo docker build -t baseline_image benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
```
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Train and predict **within Docker container**
4.2 Pull the Docker image from ACR to your VM
6.1 Start a Docker container from the image
```bash
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
```
5. Train and predict **within Docker container**
5.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name baseline_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
sudo docker run -it -v ~/Forecasting:/Forecasting --name baseline_container baseline_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
5.2 Train and predict
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/train_score_vm.sh
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/train_score_vm.sh
```
After generating the forecast results, you can exit the Docker container by command `exit`.
6. Model evaluation **on the VM**
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash ./common/evaluate submissions/baseline energy_load/GEFCom2017_D_Prob_MT_hourly
```
```bash
source activate tsperf
cd ~/Forecasting
bash tsperf/benchmarking/evaluate baseline tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
**Platform:** Azure Cloud
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.6
- python==3.7
* R
- r-base==3.5.1
- r-base==3.5.3
- quantreg==5.34
- data.table==1.10.4.3
@ -147,43 +146,43 @@ Please follow the instructions below to deploy the Linux DSVM.
**Quality:**
Note there is no randomness in this baseline model, so the model quality is the same for all five runs.
* Pinball loss run 1: 84.11
* Pinball loss run 1: 84.12
* Pinball loss run 2: 84.11
* Pinball loss run 2: 84.12
* Pinball loss run 3: 84.11
* Pinball loss run 3: 84.12
* Pinball loss run 4: 84.11
* Pinball loss run 4: 84.12
* Pinball loss run 5: 84.11
* Pinball loss run 5: 84.12
* Median Pinball loss: 84.11
* Median Pinball loss: 84.12
**Time:**
* Run time 1: 425 seconds
* Run time 1: 188 seconds
* Run time 2: 462 seconds
* Run time 2: 185 seconds
* Run time 3: 441 seconds
* Run time 3: 185 seconds
* Run time 4: 458 seconds
* Run time 4: 189 seconds
* Run time 5: 444 seconds
* Run time 5: 189 seconds
* Median run time: **444 seconds**
* Median run time: **188 seconds**
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
Thus, the total cost is 444/3600 * 0.3840 = $0.0474.
Thus, the total cost is 188/3600 * 0.3840 = $0.0201.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: -6.67
Round 2: 20.25
Round 3: 20.04
Round 2: 20.26
Round 3: 20.05
Round 4: -5.61
Round 5: -6.45
Round 6: 11.22
Round 6: 11.21
**Ranking in the qualifying round of GEFCom2017 competition**
10

Просмотреть файл

@ -0,0 +1,67 @@
"""
This script uses
tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/feature_engineering.py to
compute a list of features needed by the Quantile Regression model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -6,4 +6,5 @@ dependencies:
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -0,0 +1,7 @@
pkgs <- c(
'data.table',
'quantreg',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -5,10 +5,9 @@ localpath.py file.
"""
import os, sys
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -0,0 +1,87 @@
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('quantreg')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("quantreg", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/submission_seed_', seed_value, '.csv', sep=""))
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
result_all = list()
for (iR in 1:6){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles){
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
model = rq(DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
data=train_df_sub, tau = tau)
result$Prediction = predict(model, test_df_sub) * test_df_sub$load_ratio
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,15 +1,15 @@
#!/bin/bash
path=energy_load/GEFCom2017_D_Prob_MT_hourly
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/submissions/baseline/feature_engineering.py --submission baseline
python $path/baseline/compute_features.py --submission baseline
echo 'Training and predicting...'
Rscript $path/submissions/baseline/train_predict.R $i
Rscript $path/baseline/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done
done

Просмотреть файл

@ -423,4 +423,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}

Просмотреть файл

@ -1,21 +1,19 @@
## Download base image
FROM continuumio/anaconda3:4.4.0
ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/conda_dependencies.yml /tmp/conda_dependencies.yml
ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/install_R_dependencies.R /tmp/install_R_dependencies.R
FROM rocker/r-base
ADD ./conda_dependencies.yml /tmp
ADD ./install_R_dependencies.R /tmp
WORKDIR /tmp
## Install basic packages
RUN apt-get update && apt-get install -y --no-install-recommends \
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
wget \
zlib1g-dev \
libssl-dev \
libssh2-1-dev \
libcurl4-openssl-dev \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
libffi-dev \
@ -23,34 +21,28 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
checkinstall \
ca-certificates \
curl \
lsb-release \
apt-utils \
python3-pip \
vim
## Create and activate conda environment
RUN conda update -y conda
# Install miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH="/root/miniconda/bin:${PATH}"
## Create conda environment
RUN conda update -y conda
RUN conda env create --file conda_dependencies.yml
## Install R
ENV R_BASE_VERSION 3.5.1
RUN apt-get install -y aptitude
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
&& aptitude install -y debian-keyring debian-archive-keyring
RUN apt-get remove -y binutils
RUN apt-get update \
&& apt-get install -t unstable -y --no-install-recommends \
r-base=${R_BASE_VERSION}-*
# Install prerequisites of R packages
RUN apt-get install -y \
gfortran \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev
libopenblas-dev \
g++
## Mount R dependency file into the docker container and install dependencies
# Use a MRAN snapshot URL to download packages archived on a specific date
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
@ -59,7 +51,7 @@ RUN Rscript install_R_dependencies.R
RUN rm install_R_dependencies.R
RUN rm conda_dependencies.yml
RUN mkdir /TSPerf
WORKDIR /TSPerf
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -12,7 +12,7 @@
**Submission name:** Quantile Regression Neural Network
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
## Implementation description
@ -36,14 +36,14 @@ The data of January - April of 2016 were used as validation dataset for some min
### Description of implementation scripts
Train and Predict:
* `feature_engineering.py`: Python script for computing features and generating feature files.
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_predict.R`: R script that trains Quantile Regression Neural Network models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py` and `train_predict.R` five times to generate five submission files and measure model running time.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
Tune hyperparameters using R:
* `cv_settings.json`: JSON script that sets cross validation folds.
* `train_validate.R`: R script that trains Quantile Regression Neural Network models and evaluate the loss on validation data of each cross validation round and forecast round with a set of hyperparameters and calculate the average loss. This script is used for grid search on vm.
* `train_validate_vm.sh`: Bash script that runs `feature_engineering.py` and `train_validate.R` multiple times to generate cross validation result files and measure model tuning time.
* `train_validate_vm.sh`: Bash script that runs `compute_features.py` and `train_validate.R` multiple times to generate cross validation result files and measure model tuning time.
Tune hyperparameters using AzureML HyperDrive:
* `cv_settings.json`: JSON script that sets cross validation folds.
@ -53,10 +53,10 @@ Tune hyperparameters using AzureML HyperDrive:
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
VM.
1. Clone the Forecasting repo to the home directory of your machine
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
@ -72,92 +72,91 @@ VM.
* [Git Credential Managers](https://docs.microsoft.com/en-us/vsts/repos/git/set-up-credential-managers?view=vsts)
* [Authenticate with SSH](https://docs.microsoft.com/en-us/vsts/repos/git/use-ssh-keys-to-authenticate?view=vsts)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file ./common/conda_dependencies.yml
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
3. Download and extract data **on the VM**.
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
4. Prepare Docker container for model training and predicting.
> NOTE: To execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
5. Prepare Docker container for model training and predicting.
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
4.1 Log into Azure Container Registry (ACR)
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
5.2 Build a local Docker image
```bash
sudo docker build -t fnn_image benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
```
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
6. Tune Hyperparameters **within Docker container** or **with AzureML hyperdrive**.
4.2 Pull the Docker image from ACR to your VM
6.1.1 Start a Docker container from the image
```bash
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
```
5. Tune Hyperparameters **within Docker container** or **with AzureML hyperdrive**.
5.1.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_cv_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_cv_container fnn_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
5.1.2 Train and validate
6.1.2 Train and validate
```
source activate tsperf
cd /Forecasting
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/train_validate_vm.sh >& cv_out.txt &
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_validate_vm.sh >& cv_out.txt &
```
After generating the cross validation results, you can exit the Docker container by command `exit`.
5.2 Do hyperparameter tuning with AzureML hyperdrive
6.2 Do hyperparameter tuning with AzureML hyperdrive
To tune hyperparameters with AzureML hyperdrive, you don't need to create a local Docker container. You can do feature engineering on the VM by the command
```
cd ~/Forecasting
source activate tsperf
python energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/feature_engineering.py
python benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/compute_features.py
```
and then run through the jupyter notebook `hyperparameter_tuning.ipynb` on the VM with the conda env `tsperf` as the jupyter kernel.
Based on the average pinball loss obtained at each set of hyperparameters, you can choose the best set of hyperparameters and use it in the Rscript of `train_predict.R`.
6. Train and predict **within Docker container**.
7. Train and predict **within Docker container**.
6.1 Start a Docker container from the image
7.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_container fnn_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
6.2 Train and predict
7.2 Train and predict
```
source activate tsperf
cd /Forecasting
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/train_score_vm.sh >& out.txt &
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_score_vm.sh >& out.txt &
```
The last command will take about 7 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
@ -168,12 +167,12 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
to connect to the running container and check the status of the run.
After generating the forecast results, you can exit the Docker container by command `exit`.
7. Model evaluation **on the VM**.
8. Model evaluation **on the VM**.
```bash
source activate tsperf
cd ~/Forecasting
bash ./common/evaluate submissions/fnn energy_load/GEFCom2017_D_Prob_MT_hourly
bash tsperf/benchmarking/evaluate fnn tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
@ -182,13 +181,13 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
**Resource location:** East US region
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile)
**Key packages/dependencies:**
* Python
- python==3.6
- python==3.7
* R
- r-base==3.5.1
- r-base==3.5.3
- qrnn==2.0.2
- data.table==1.10.4.3
- rjson==0.2.20 (optional for cv)
@ -202,43 +201,43 @@ Please follow the instructions below to deploy the Linux DSVM.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 79.27
* Pinball loss run 1: 79.54
* Pinball loss run 2: 79.32
* Pinball loss run 2: 78.32
* Pinball loss run 3: 79.25
* Pinball loss run 3: 80.06
* Pinball loss run 4: 79.24
* Pinball loss run 4: 80.12
* Pinball loss run 5: 79.32
* Pinball loss run 5: 80.13
* Median Pinball loss: 79.27
* Median Pinball loss: 80.06
**Time:**
* Run time 1: 4611 seconds
* Run time 1: 1092 seconds
* Run time 2: 4604 seconds
* Run time 2: 1085 seconds
* Run time 3: 4587 seconds
* Run time 3: 1062 seconds
* Run time 4: 4630 seconds
* Run time 4: 1083 seconds
* Run time 5: 4583 seconds
* Run time 5: 1110 seconds
* Median run time: 4604 seconds
* Median run time: 1085 seconds
**Cost:**
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
Thus, the total cost is 4604/3600 * 0.3840 = $0.4911.
Thus, the total cost is 1085/3600 * 0.3840 = $0.1157.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 6.64
Round 2: 20.13
Round 3: 19.75
Round 4: 5.01
Round 5: 4.21
Round 6: 10.68
Round 1: 6.13
Round 2: 19.20
Round 3: 18.86
Round 4: 3.84
Round 5: 2.76
Round 6: 11.10
**Ranking in the qualifying round of GEFCom2017 competition**
4

Просмотреть файл

@ -0,0 +1,70 @@
"""
This script passes the input arguments of AzureML job to the R script train_validate_aml.R,
and then passes the output of train_validate_aml.R back to AzureML.
"""
import subprocess
import os
import sys
import getopt
import pandas as pd
from datetime import datetime
from azureml.core import Run
import time
start_time = time.time()
run = Run.get_submitted_run()
base_command = "Rscript train_validate_aml.R"
if __name__ == "__main__":
opts, args = getopt.getopt(
sys.argv[1:], "", ["path=", "cv_path=", "n_hidden_1=", "n_hidden_2=", "iter_max=", "penalty="]
)
for opt, arg in opts:
if opt == "--path":
path = arg
elif opt == "--cv_path":
cv_path = arg
elif opt == "--n_hidden_1":
n_hidden_1 = arg
elif opt == "--n_hidden_2":
n_hidden_2 = arg
elif opt == "--iter_max":
iter_max = arg
elif opt == "--penalty":
penalty = arg
time_stamp = datetime.now().strftime("%Y%m%d%H%M%S")
task = " ".join(
[
base_command,
"--path",
path,
"--cv_path",
cv_path,
"--n_hidden_1",
n_hidden_1,
"--n_hidden_2",
n_hidden_2,
"--iter_max",
iter_max,
"--penalty",
penalty,
"--time_stamp",
time_stamp,
]
)
process = subprocess.call(task, shell=True)
# process.communicate()
# process.wait()
output_file_name = "cv_output_" + time_stamp + ".csv"
result = pd.read_csv(os.path.join(cv_path, output_file_name))
APL = result["loss"].mean()
print(APL)
print("--- %s seconds ---" % (time.time() - start_time))
run.log("average pinball loss", APL)

Просмотреть файл

@ -0,0 +1,68 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Feed-forward Neural Network model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR,
TEST_DATA_DIR,
OUTPUT_DIR,
DF_CONFIG,
feature_config_list,
filter_by_month=True,
compute_load_ratio=True,
)

Просмотреть файл

@ -6,4 +6,5 @@ dependencies:
- numpy=1.15.1
- pandas=0.23.4
- xlrd=1.1.0
- urllib3=1.21.1
- urllib3=1.21.1
- scikit-learn=0.20.3

Просмотреть файл

@ -1243,9 +1243,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:tsperf]",
"display_name": "drlnd",
"language": "python",
"name": "conda-env-tsperf-py"
"name": "drlnd"
},
"language_info": {
"codemirror_mode": {

Просмотреть файл

@ -0,0 +1,7 @@
pkgs <- c(
'data.table',
'qrnn',
'doParallel'
)
install.packages(pkgs)

Просмотреть файл

@ -4,10 +4,9 @@ all the modules in TSPerf. Each submission folder needs its own localpath.py fil
"""
import os, sys
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -0,0 +1,107 @@
#!/usr/bin/Rscript
#
# This script trains the Quantile Regression Neural Network model and predicts on each data
# partition per zone and hour at each quantile point.
args = commandArgs(trailingOnly=TRUE)
seed_value = args[1]
library('data.table')
library('qrnn')
library('doParallel')
n_cores = detectCores()
cl <- parallel::makeCluster(n_cores)
parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.only = TRUE))
registerDoParallel(cl)
# Specify data directory
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
train_dir = file.path(data_dir, 'train')
test_dir = file.path(data_dir, 'test')
train_file_prefix = 'train_round_'
test_file_prefix = 'test_round_'
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/submission_seed_', seed_value, '.csv', sep=""))
# Data and forecast parameters
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
# Train and predict
result_all = list()
for (iR in 1:6){
print(paste('Round', iR))
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
train_df = fread(train_file)
test_df = fread(test_file)
for (c in normalize_columns){
min_c = min(train_df[, ..c])
max_c = max(train_df[, ..c])
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
}
zones = unique(train_df[, Zone])
hours = unique(train_df[, hour_of_day])
all_zones_hours = expand.grid(zones, hours)
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
set.seed(seed_value)
z = all_zones_hours[i, 'Zone']
h = all_zones_hours[i, 'hour_of_day']
train_df_sub = train_df[Zone == z & hour_of_day == h]
test_df_sub = test_df[Zone == z & hour_of_day == h]
train_x <- as.matrix(train_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
drop=FALSE])
train_y <- as.matrix(train_df_sub[, c('DEMAND'), drop=FALSE])
test_x <- as.matrix(test_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
drop=FALSE])
result_all_quantiles = list()
q_counter = 1
for (tau in quantiles){
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
model = qrnn2.fit(x=train_x, y=train_y,
n.hidden=8, n.hidden2=4,
tau=tau, Th=tanh,
iter.max=1,
penalty=0)
result$Prediction = qrnn2.predict(model, x=test_x) * test_df_sub$load_ratio
result$q = tau
result_all_quantiles[[q_counter]] = result
q_counter = q_counter + 1
}
rbindlist(result_all_quantiles)
}
result_all[[iR]] = result_all_zones_hours
}
result_final = rbindlist(result_all)
# Sort the quantiles
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
result_final$Prediction = round(result_final$Prediction)
fwrite(result_final, output_file)

Просмотреть файл

@ -1,14 +1,14 @@
#!/bin/bash
path=energy_load/GEFCom2017_D_Prob_MT_hourly
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/submissions/GBM/feature_engineering.py --submission GBM
python $path/fnn/compute_features.py --submission fnn
echo 'Training and predicting...'
Rscript $path/submissions/GBM/train_predict.R $i
Rscript $path/fnn/train_predict.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'

Просмотреть файл

@ -20,7 +20,7 @@ parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.on
registerDoParallel(cl)
# Specify data directory
data_dir = 'energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/data/features'
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
train_dir = file.path(data_dir, 'train')
train_file_prefix = 'train_round_'
@ -45,10 +45,10 @@ for (j in 1:length(parameter_names)){
output_file_name = paste(output_file_name, parameter_names[j], parameter_values[j], sep="_")
}
output_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', output_file_name, sep=""))
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
# Define cross validation split settings
cv_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', 'cv_settings.json', sep=""))
cv_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', 'cv_settings.json', sep=""))
cv_settings = fromJSON(file=cv_file)
# Parameters of model
@ -58,13 +58,13 @@ iter.max = as.integer(param_grid[parameter_set, 'iter.max'])
penalty = as.integer(param_grid[parameter_set, 'penalty'])
# Data and forecast parameters
features = c('LoadLag', 'DryBulbLag',
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
normalize_columns = list('LoadLag', 'DryBulbLag')
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
subset_columns_train = c(features, 'DEMAND')
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'LoadRatio')
@ -97,7 +97,7 @@ for (i in 1:length(cv_settings)){
validation_data = cvdata_df[Datetime >= validation_range[1] & Datetime <= validation_range[2]]
zones = unique(validation_data$Zone)
hours = unique(validation_data$Hour)
hours = unique(validation_data$hour_of_day)
for (c in normalize_columns){
min_c = min(train_data[, ..c])
@ -106,9 +106,9 @@ for (i in 1:length(cv_settings)){
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
}
validation_data$AverageLoadRatio = rowMeans(validation_data[,c('LoadRatio_10', 'LoadRatio_11', 'LoadRatio_12',
'LoadRatio_13', 'LoadRatio_14', 'LoadRatio_15', 'LoadRatio_16')], na.rm=TRUE)
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(Hour, MonthOfYear)]
validation_data$AverageLoadRatio = rowMeans(validation_data[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(hour_of_day, month_of_year)]
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
print(paste('Zone', z))
@ -117,8 +117,8 @@ for (i in 1:length(cv_settings)){
hour_counter = 1
for (h in hours){
train_df_sub = train_data[Zone == z & Hour == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & Hour == h, ..subset_columns_validation]
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
@ -165,7 +165,7 @@ print(paste('Average Pinball Loss:', average_PL))
output_file_name = paste(output_file_name, 'APL', average_PL, sep="_")
output_file_name = paste(output_file_name, '.csv', sep="")
output_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', output_file_name, sep=""))
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
fwrite(result_final, output_file)

Просмотреть файл

@ -59,7 +59,7 @@ cv_settings = fromJSON(file=cv_file)
# Data and forecast parameters
normalize_columns = list('LoadLag', 'DryBulbLag')
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
quantiles = seq(0.1, 0.9, by = 0.1)
@ -101,26 +101,26 @@ for (i in 1:length(cv_settings)){
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
}
validation_data$AverageLoadRatio = rowMeans(validation_data[, c('LoadRatio_10', 'LoadRatio_11', 'LoadRatio_12',
'LoadRatio_13', 'LoadRatio_14', 'LoadRatio_15', 'LoadRatio_16')], na.rm=TRUE)
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(Hour, MonthOfYear)]
validation_data$average_load_ratio = rowMeans(validation_data[, c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
validation_data[, load_ratio:=mean(average_load_ratio), by=list(Hour, month_of_year)]
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
print(paste('Zone', z))
features = c('LoadLag', 'DryBulbLag',
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
subset_columns_train = c(features, 'DEMAND')
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'LoadRatio')
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'load_ratio')
result_all_hours = list()
hour_counter = 1
for (h in hours){
train_df_sub = train_data[Zone == z & Hour == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & Hour == h, ..subset_columns_validation]
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
@ -140,7 +140,7 @@ for (i in 1:length(cv_settings)){
iter.max=iter.max,
penalty=penalty)
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$LoadRatio
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$load_ratio
result$DEMAND = validation_df_sub$DEMAND
result$loss = pinball_loss(tau, validation_df_sub$DEMAND, result$Prediction)
result$q = tau

Просмотреть файл

@ -1,14 +1,14 @@
#!/bin/bash
path=energy_load/GEFCom2017_D_Prob_MT_hourly
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 40`;
do
echo "Parameter Set $i"
start=`date +%s`
echo 'Creating features...'
python $path/submissions/fnn/feature_engineering.py --submission fnn
python $path/fnn/compute_features.py --submission fnn
echo 'Training and validation...'
Rscript $path/submissions/fnn/train_validate.R $i
Rscript $path/fnn/train_validate.R $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'

Просмотреть файл

@ -1,5 +1,5 @@
## Download base image
FROM continuumio/anaconda3:4.4.0
FROM continuumio/anaconda3:5.3.0
ADD ./conda_dependencies.yml /tmp
WORKDIR /tmp
@ -14,7 +14,6 @@ RUN apt-get install -y --no-install-recommends \
libreadline-gplv2-dev \
libncursesw5-dev \
libsqlite3-dev \
tk-dev \
libgdbm-dev \
libc6-dev \
libbz2-dev \
@ -35,7 +34,7 @@ RUN conda env create --file conda_dependencies.yml
RUN rm conda_dependencies.yml
RUN mkdir /TSPerf
WORKDIR /TSPerf
RUN mkdir /Forecasting
WORKDIR /Forecasting
ENTRYPOINT ["/bin/bash"]

Просмотреть файл

@ -12,7 +12,7 @@
**Submission name:** Quantile Random Forest
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
## Implementation description
@ -42,81 +42,78 @@ We used 2 validation time frames, the first one in January-April 2015, the secon
### Description of implementation scripts
* `feature_engineering.py`: Python script for computing features and generating feature files.
* `compute_features.py`: Python script for computing features and generating feature files.
* `train_score.py`: Python script that trains Quantile Random Forest models and predicts on each round of test data.
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_score.py` five times to generate five submission files and measure model running time.
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_score.py` five times to generate five submission files and measure model running time.
### Steps to reproduce results
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux Data Science Virtual Machine and log into it.
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux Data Science Virtual Machine and log into it.
1. Clone the Forecasting repo to the home directory of your machine
2. Clone the Forecasting repo to the home directory of your machine
```bash
cd ~
git clone https://github.com/Microsoft/Forecasting.git
```
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
Use one of the following options to securely connect to the Git repo:
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
For this method, the clone command becomes
```bash
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
```
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
```bash
cd ~/Forecasting
conda env create --file ./common/conda_dependencies.yml
conda env create --file tsperf/benchmarking/conda_dependencies.yml
```
3. Download and extract data **on the VM**.
4. Download and extract data **on the VM**.
```bash
source activate tsperf
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
```
4. Prepare Docker container for model training and predicting.
4.1 Log into Azure Container Registry (ACR)
5. Prepare Docker container for model training and predicting.
5.1 Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
5.2 Build a local Docker image
```bash
sudo docker build -t qrf_image benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
```
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
4.2 Pull the Docker image from ACR to your VM
6. Train and predict **within Docker container**
6.1 Start a Docker container from the image
```bash
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image:v1
```
5. Train and predict **within Docker container**
5.1 Start a Docker container from the image
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name qrf_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name qrf_container qrf_image
```
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
5.2 Train and predict
6.2 Train and predict
```
source activate tsperf
cd /Forecasting
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/train_score_vm.sh >& out.txt &
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf/train_score_vm.sh >& out.txt &
```
The last command will take about 31 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
@ -127,12 +124,12 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
to connect to the running container and check the status of the run.
After generating the forecast results, you can exit the Docker container by command `exit`.
6. Model evaluation **on the VM**
7. Model evaluation **on the VM**
```bash
source activate tsperf
cd ~/Forecasting
bash ./common/evaluate submissions/qrf energy_load/GEFCom2017_D_Prob_MT_hourly
bash tsperf/benchmarking/evaluate qrf tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
```
## Implementation resources
@ -141,7 +138,7 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
**Resource location:** East US region
**Hardware:** F72s v2 (72 vcpus, 144 GB memory) Ubuntu Linux VM
**Data storage:** Standard SSD
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile)
**Key packages/dependencies:**
* Python
@ -162,43 +159,43 @@ Please follow the instructions below to deploy the Linux DSVM.
## Implementation evaluation
**Quality:**
* Pinball loss run 1: 76.48
* Pinball loss run 1: 76.29
* Pinball loss run 2: 76.49
* Pinball loss run 2: 76.29
* Pinball loss run 3: 76.43
* Pinball loss run 3: 76.18
* Pinball loss run 4: 76.47
* Pinball loss run 4: 76.23
* Pinball loss run 5: 76.6
* Pinball loss run 5: 76.38
* Median Pinball loss: 76.48
* Median Pinball loss: 76.29
**Time:**
* Run time 1: 22289 seconds
* Run time 1: 20119 seconds
* Run time 2: 22493 seconds
* Run time 2: 20489 seconds
* Run time 3: 22859 seconds
* Run time 3: 20616 seconds
* Run time 4: 22709 seconds
* Run time 4: 20297 seconds
* Run time 5: 23197 seconds
* Run time 5: 20322 seconds
* Median run time: 22709 seconds (6.3 hours)
* Median run time: 20322 seconds (5.65 hours)
**Cost:**
The hourly cost of the F72s v2 Ubuntu Linux VM in East US Azure region is 3.045 USD, based on the price at the submission date.
Thus, the total cost is 22709/3600 * 3.045 = 19.21 USD.
Thus, the total cost is 20322/3600 * 3.045 = 17.19 USD.
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
Round 1: 16.84
Round 2: 14.98
Round 3: 12.08
Round 4: 14.97
Round 5: 16.16
Round 6: -2.52
Round 1: 16.89
Round 2: 14.93
Round 3: 12.34
Round 4: 14.95
Round 5: 16.19
Round 6: -0.32
**Ranking in the qualifying round of GEFCom2017 competition**
3

Просмотреть файл

@ -0,0 +1,94 @@
"""
This script uses
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
compute a list of features needed by the Quantile Regression model.
"""
import os
import sys
import getopt
import localpath
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
print("Data directory used: {}".format(DATA_DIR))
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
DF_CONFIG = {
"time_col_name": "Datetime",
"ts_id_col_names": "Zone",
"target_col_name": "DEMAND",
"frequency": "H",
"time_format": "%Y-%m-%d %H:%M:%S",
}
HOLIDAY_COLNAME = "Holiday"
# Feature configuration list used to specify the features to be computed by
# compute_features.
# Each feature configuration is a tuple in the format of (feature_name,
# featurizer_args)
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
# featurizer_args is a dictionary of arguments passed to the
# featurizer
feature_config_list = [
(
"temporal",
{
"feature_list": [
"hour_of_day",
"day_of_week",
"day_of_month",
"normalized_hour_of_year",
"week_of_year",
"month_of_year",
]
},
),
("annual_fourier", {"n_harmonics": 3}),
("weekly_fourier", {"n_harmonics": 3}),
("daily_fourier", {"n_harmonics": 2}),
("normalized_date", {}),
("normalized_datehour", {}),
("normalized_year", {}),
("day_type", {"holiday_col_name": HOLIDAY_COLNAME}),
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
("previous_year_temp_lag", {"input_col_names": ["DryBulb", "DewPnt"], "round_agg_result": True},),
(
"recent_load_lag",
{"input_col_names": "DEMAND", "start_week": 10, "window_size": 4, "agg_count": 8, "round_agg_result": True,},
),
(
"recent_temp_lag",
{
"input_col_names": ["DryBulb", "DewPnt"],
"start_week": 10,
"window_size": 4,
"agg_count": 8,
"round_agg_result": True,
},
),
]
if __name__ == "__main__":
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
for opt, arg in opts:
if opt == "--submission":
submission_folder = arg
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
if not os.path.isdir(output_data_dir):
os.mkdir(output_data_dir)
OUTPUT_DIR = os.path.join(output_data_dir, "features")
if not os.path.isdir(OUTPUT_DIR):
os.mkdir(OUTPUT_DIR)
compute_features(
TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG, feature_config_list, filter_by_month=False,
)

Просмотреть файл

@ -9,3 +9,4 @@ dependencies:
- urllib3=1.21.1
- scikit-garden=0.1.3
- joblib=0.12.5
- scikit-learn=0.20.3

Просмотреть файл

@ -13,6 +13,7 @@ from skgarden.quantile.tree import DecisionTreeQuantileRegressor
from skgarden.quantile.ensemble import generate_sample_indices
from ensemble_parallel_utils import weighted_percentile_vectorized
class BaseForestQuantileRegressor(ForestRegressor):
"""Training and scoring of Quantile Regression Random Forest
@ -34,6 +35,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
a weight of zero when estimator j is fit, then the value is -1.
"""
def fit(self, X, y):
"""Builds a forest from the training set (X, y).
@ -68,8 +70,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
Returns self.
"""
# apply method requires X to be of dtype np.float32
X, y = check_X_y(
X, y, accept_sparse="csc", dtype=np.float32, multi_output=False)
X, y = check_X_y(X, y, accept_sparse="csc", dtype=np.float32, multi_output=False)
super(BaseForestQuantileRegressor, self).fit(X, y)
self.y_train_ = y
@ -78,8 +79,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
for i, est in enumerate(self.estimators_):
if self.bootstrap:
bootstrap_indices = generate_sample_indices(
est.random_state, len(y))
bootstrap_indices = generate_sample_indices(est.random_state, len(y))
else:
bootstrap_indices = np.arange(len(y))
@ -87,8 +87,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
y_train_leaves = est.y_train_leaves_
for curr_leaf in np.unique(y_train_leaves):
y_ind = y_train_leaves == curr_leaf
self.y_weights_[i, y_ind] = (
est_weights[y_ind] / np.sum(est_weights[y_ind]))
self.y_weights_[i, y_ind] = est_weights[y_ind] / np.sum(est_weights[y_ind])
self.y_train_leaves_[i, bootstrap_indices] = y_train_leaves[bootstrap_indices]
@ -167,21 +166,24 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
oob_prediction_ : array of shape = [n_samples]
Prediction computed with out-of-bag estimate on the training set.
"""
def __init__(self,
n_estimators=10,
criterion='mse',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False):
def __init__(
self,
n_estimators=10,
criterion="mse",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features="auto",
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False,
):
"""Initialize RandomForestQuantileRegressor class
Args:
@ -271,16 +273,23 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
super(RandomForestQuantileRegressor, self).__init__(
base_estimator=DecisionTreeQuantileRegressor(),
n_estimators=n_estimators,
estimator_params=("criterion", "max_depth", "min_samples_split",
"min_samples_leaf", "min_weight_fraction_leaf",
"max_features", "max_leaf_nodes",
"random_state"),
estimator_params=(
"criterion",
"max_depth",
"min_samples_split",
"min_samples_leaf",
"min_weight_fraction_leaf",
"max_features",
"max_leaf_nodes",
"random_state",
),
bootstrap=bootstrap,
oob_score=oob_score,
n_jobs=n_jobs,
random_state=random_state,
verbose=verbose,
warm_start=warm_start)
warm_start=warm_start,
)
self.criterion = criterion
self.max_depth = max_depth
@ -289,5 +298,3 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes

Просмотреть файл

@ -3,6 +3,7 @@
import numpy as np
def weighted_percentile_vectorized(a, quantiles, weights=None, sorter=None):
"""Returns the weighted percentile of a at q given weights.
@ -69,8 +70,7 @@ def weighted_percentile_vectorized(a, quantiles, weights=None, sorter=None):
percentiles = np.zeros_like(quantiles)
for i, q in enumerate(quantiles):
if q > 100 or q < 0:
raise ValueError("q should be in-between 0 and 100, "
"got %d" % q)
raise ValueError("q should be in-between 0 and 100, " "got %d" % q)
start = np.searchsorted(partial_sum, q) - 1
if start == len(sorted_cum_weights) - 1:

Просмотреть файл

@ -0,0 +1,12 @@
"""
This script inserts the TSPerf directory into sys.path, so that scripts can import
all the modules in TSPerf. Each submission folder needs its own localpath.py file.
"""
import os, sys
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
if TSPERF_DIR not in sys.path:
sys.path.insert(0, TSPERF_DIR)

Просмотреть файл

@ -0,0 +1,80 @@
# This script performs training and scoring with Quantile Random Forest model
from os.path import join
import argparse
import pandas as pd
from numpy import arange
from ensemble_parallel import RandomForestQuantileRegressor
# get seed value
parser = argparse.ArgumentParser()
parser.add_argument(
"--data-folder", type=str, dest="data_folder", help="data folder mounting point",
)
parser.add_argument(
"--output-folder", type=str, dest="output_folder", help="output folder mounting point",
)
parser.add_argument("--seed", type=int, dest="seed", help="random seed")
args = parser.parse_args()
# initialize location of input and output files
data_dir = join(args.data_folder, "features")
train_dir = join(data_dir, "train")
test_dir = join(data_dir, "test")
output_file = join(args.output_folder, "submission_seed_{}.csv".format(args.seed))
# do 6 rounds of forecasting, at each round output 9 quantiles
n_rounds = 6
quantiles = arange(0.1, 1, 0.1)
# schema of the output
y_test = pd.DataFrame(columns=["Datetime", "Zone", "Round", "q", "Prediction"])
for i in range(1, n_rounds + 1):
print("Round {}".format(i))
# read training and test files for the current round
train_file = join(train_dir, "train_round_{}.csv".format(i))
train_df = pd.read_csv(train_file)
test_file = join(test_dir, "test_round_{}.csv".format(i))
test_df = pd.read_csv(test_file)
# train and test for each hour separately
for hour in arange(0, 24):
print(hour)
# select training sets
train_df_hour = train_df[(train_df["hour_of_day"] == hour)]
# create one-hot encoding of Zone
# (scikit-garden works only with numerical columns)
train_df_hour = pd.get_dummies(train_df_hour, columns=["Zone"])
# remove column that are not useful (Datetime) or are not
# available in the test set (DEMAND, DryBulb, DewPnt)
X_train = train_df_hour.drop(columns=["Datetime", "DEMAND", "DryBulb", "DewPnt"]).values
y_train = train_df_hour["DEMAND"].values
# train a model
rfqr = RandomForestQuantileRegressor(
random_state=args.seed, n_jobs=-1, n_estimators=1000, max_features="sqrt", max_depth=12,
)
rfqr.fit(X_train, y_train)
# select test set
test_df_hour = test_df[test_df["hour_of_day"] == hour]
y_test_baseline = test_df_hour[["Datetime", "Zone"]]
test_df_cat = pd.get_dummies(test_df_hour, columns=["Zone"])
X_test = test_df_cat.drop(columns=["Datetime"]).values
# generate forecast for each quantile
percentiles = rfqr.predict(X_test, quantiles * 100)
for j, quantile in enumerate(quantiles):
y_test_round_quantile = y_test_baseline.copy(deep=True)
y_test_round_quantile["Round"] = i
y_test_round_quantile["q"] = quantile
y_test_round_quantile["Prediction"] = percentiles[:, j]
y_test = pd.concat([y_test, y_test_round_quantile])
# store forecasts
y_test.to_csv(output_file, index=False)

Просмотреть файл

@ -0,0 +1,15 @@
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
for i in `seq 1 5`;
do
echo "Run $i"
start=`date +%s`
echo 'Creating features...'
python $path/qrf/compute_features.py --submission qrf
echo 'Training and predicting...'
python $path/qrf/train_score.py --data-folder $path/qrf/data --output-folder $path/qrf --seed $i
end=`date +%s`
echo 'Running time '$((end-start))' seconds'
done
echo 'Training and scoring are completed'

Просмотреть файл

@ -87,27 +87,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Log into Azure Container Registry (ACR):
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Pull a Docker image from ACR using the following command
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA
```
7. Choose a name for a new Docker container (e.g. arima_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name arima_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name arima_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
@ -145,7 +143,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile)
**Key packages/dependencies:**
* R

Просмотреть файл

@ -94,27 +94,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Log into Azure Container Registry (ACR):
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Pull a Docker image from ACR using the following command
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
sudo docker build -t dcnn_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN
```
7. Choose a name for a new Docker container (e.g. dcnn_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name dcnn_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name dcnn_container dcnn_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
@ -152,7 +150,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
**Data storage:** Standard HDD
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile)
**Key packages/dependencies:**
* Python

Просмотреть файл

@ -1,445 +1,445 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of Dilated CNN Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of Dilated CNN model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_dcnn')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--dropout-rate', '0.2'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a GPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"gpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\", # GPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'gpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--dropout-rate': 0.2\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--seq-len': quniform(5, 40, 1),\n",
" '--dropout-rate': uniform(0, 0.4),\n",
" '--batch-size': choice(32, 64),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--epochs': quniform(2, 80, 1)\n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of Dilated CNN Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of Dilated CNN model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_dcnn')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--dropout-rate', '0.2'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a GPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"gpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\", # GPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'gpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--dropout-rate': 0.2\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--seq-len': quniform(5, 40, 1),\n",
" '--dropout-rate': uniform(0, 0.4),\n",
" '--batch-size': choice(32, 64),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--epochs': quniform(2, 80, 1)\n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,88 @@
# coding: utf-8
# Create input features for the Dilated Convolutional Neural Network (CNN) model.
import os
import sys
import math
import datetime
import numpy as np
import pandas as pd
# Append TSPerf path to sys.path
tsperf_dir = "."
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import *
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def make_features(pred_round, train_dir, pred_steps, offset, store_list, brand_list):
"""Create a dataframe of the input features.
Args:
pred_round (Integer): Prediction round
train_dir (String): Path of the training data directory
pred_steps (Integer): Number of prediction steps
offset (Integer): Length of training data skipped in the retraining
store_list (Numpy Array): List of all the store IDs
brand_list (Numpy Array): List of all the brand IDs
Returns:
data_filled (Dataframe): Dataframe including the input features
data_scaled (Dataframe): Dataframe including the normalized features
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled.drop("week_start", axis=1, inplace=True)
# Normalize the dataframe of features
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
return data_filled, data_scaled

Просмотреть файл

@ -0,0 +1,223 @@
# coding: utf-8
# Train and score a Dilated Convolutional Neural Network (CNN) model using Keras package with TensorFlow backend.
import os
import sys
import keras
import random
import argparse
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import optimizers
from keras.layers import *
from keras.models import Model, load_model
from keras.callbacks import ModelCheckpoint
# Append TSPerf path to sys.path (assume we run the script from TSPerf directory)
tsperf_dir = "."
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import *
from make_features import make_features
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
# Model definition
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
"""Create a Dilated CNN model.
Args:
seq_len (Integer): Input sequence length
kernel_size (Integer): Kernel size of each convolutional layer
n_filters (Integer): Number of filters in each convolutional layer
n_outputs (Integer): Number of outputs in the last layer
Returns:
Keras Model object
"""
# Sequential input
seq_in = Input(shape=(seq_len, n_input_series))
# Categorical input
cat_fea_in = Input(shape=(2,), dtype="uint8")
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
# Dilated convolutional layers
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
seq_in
)
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
# Skip connections
c4 = concatenate([c1, c3])
# Output of convolutional layers
conv_out = Conv1D(8, 1, activation="relu")(c4)
conv_out = Dropout(args.dropout_rate)(conv_out)
conv_out = Flatten()(conv_out)
# Concatenate with categorical features
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
x = Dense(16, activation="relu")(x)
output = Dense(n_outputs, activation="linear")(x)
# Define model interface, loss function, and optimizer
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
return model
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, dest="seed", default=1, help="random seed")
parser.add_argument("--seq-len", type=int, dest="seq_len", default=15, help="length of the input sequence")
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.01, help="dropout ratio")
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.015, help="learning rate")
parser.add_argument("--epochs", type=int, dest="epochs", default=25, help="# of epochs")
args = parser.parse_args()
# Fix random seeds
np.random.seed(args.seed)
random.seed(args.seed)
tf.set_random_seed(args.seed)
# Data paths
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
SUBMISSION_DIR = os.path.join(
tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "DilatedCNN"
)
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Dataset parameters
MAX_STORE_ID = 137
MAX_BRAND_ID = 11
# Parameters of the model
PRED_HORIZON = 3
PRED_STEPS = 2
SEQ_LEN = args.seq_len
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
STATIC_FEATURES = ["store", "brand"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
store_brand = [(x, y) for x in store_list for y in brand_list]
# Train and predict for all forecast rounds
pred_all = []
file_name = os.path.join(SUBMISSION_DIR, "dcnn_model.h5")
for r in range(bs.NUM_ROUNDS):
print("---- Round " + str(r + 1) + " ----")
offset = 0 if r == 0 else 40 + r * PRED_STEPS
# Create features
data_filled, data_scaled = make_features(r, TRAIN_DIR, PRED_STEPS, offset, store_list, brand_list)
# Create sequence array for 'move'
start_timestep = 0
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - PRED_HORIZON
train_input1 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep - offset
)
# Create sequence array for other dynamic features
start_timestep = PRED_HORIZON
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
train_input2 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep - offset
)
seq_in = np.concatenate([train_input1, train_input2], axis=2)
# Create array of static features
total_timesteps = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
cat_fea_in = static_feature_array(data_filled, total_timesteps - offset, STATIC_FEATURES)
# Create training output
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
train_output = gen_sequence_array(
data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep - offset
)
train_output = np.squeeze(train_output)
# Create and train model
if r == 0:
model = create_dcnn_model(
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
)
adam = optimizers.Adam(lr=args.learning_rate)
model.compile(loss="mape", optimizer=adam, metrics=["mape"])
# Define checkpoint and fit model
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
callbacks_list = [checkpoint]
history = model.fit(
[seq_in, cat_fea_in],
train_output,
epochs=args.epochs,
batch_size=args.batch_size,
callbacks=callbacks_list,
verbose=0,
)
else:
model = load_model(file_name)
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
callbacks_list = [checkpoint]
history = model.fit(
[seq_in, cat_fea_in],
train_output,
epochs=1,
batch_size=args.batch_size,
callbacks=callbacks_list,
verbose=0,
)
# Get inputs for prediction
start_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + PRED_STEPS
end_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK + PRED_STEPS - 1 - PRED_HORIZON
test_input1 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep - offset, end_timestep - offset
)
start_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN + 1
end_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
test_input2 = gen_sequence_array(
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep - offset, end_timestep - offset
)
seq_in = np.concatenate([test_input1, test_input2], axis=2)
total_timesteps = 1
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
# Make prediction
pred = np.round(model.predict([seq_in, cat_fea_in]))
# Create dataframe for submission
exp_output = data_filled[data_filled.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
exp_output = exp_output[["store", "brand", "week"]]
pred_df = (
exp_output.sort_values(["store", "brand", "week"]).loc[:, ["store", "brand", "week"]].reset_index(drop=True)
)
pred_df["weeks_ahead"] = pred_df["week"] - bs.TRAIN_END_WEEK_LIST[r]
pred_df["round"] = r + 1
pred_df["prediction"] = np.reshape(pred, (pred.size, 1))
pred_all.append(pred_df)
# Generate submission
submission = pd.concat(pred_all, axis=0).reset_index(drop=True)
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
filename = "submission_seed_" + str(args.seed) + ".csv"
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)
print("Done")

Просмотреть файл

@ -0,0 +1,212 @@
# coding: utf-8
# Perform cross validation of a Dilated Convolutional Neural Network (CNN) model on the training data of the 1st forecast round.
import os
import sys
import math
import keras
import argparse
import datetime
import numpy as np
import pandas as pd
from utils import *
from keras.layers import *
from keras.models import Model
from keras import optimizers
from keras.utils import multi_gpu_model
from azureml.core import Run
# Model definition
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
"""Create a Dilated CNN model.
Args:
seq_len (Integer): Input sequence length
kernel_size (Integer): Kernel size of each convolutional layer
n_filters (Integer): Number of filters in each convolutional layer
n_outputs (Integer): Number of outputs in the last layer
Returns:
Keras Model object
"""
# Sequential input
seq_in = Input(shape=(seq_len, n_input_series))
# Categorical input
cat_fea_in = Input(shape=(2,), dtype="uint8")
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
# Dilated convolutional layers
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
seq_in
)
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
# Skip connections
c4 = concatenate([c1, c3])
# Output of convolutional layers
conv_out = Conv1D(8, 1, activation="relu")(c4)
conv_out = Dropout(args.dropout_rate)(conv_out)
conv_out = Flatten()(conv_out)
# Concatenate with categorical features
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
x = Dense(16, activation="relu")(x)
output = Dense(n_outputs, activation="linear")(x)
# Define model interface, loss function, and optimizer
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
return model
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data-folder", type=str, dest="data_folder", help="data folder mounting point")
parser.add_argument("--seq-len", type=int, dest="seq_len", default=20, help="length of the input sequence")
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.10, help="dropout ratio")
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.01, help="learning rate")
parser.add_argument("--epochs", type=int, dest="epochs", default=30, help="# of epochs")
args = parser.parse_args()
args.dropout_rate = round(args.dropout_rate, 2)
print(args)
# Start an Azure ML run
run = Run.get_context()
# Data paths
DATA_DIR = args.data_folder
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Data and forecast problem parameters
MAX_STORE_ID = 137
MAX_BRAND_ID = 11
PRED_HORIZON = 3
PRED_STEPS = 2
TRAIN_START_WEEK = 40
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
TEST_START_WEEK_LIST = list(range(137, 161, 2))
TEST_END_WEEK_LIST = list(range(138, 162, 2))
# The start datetime of the first week in the record
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Input sequence length and feature names
SEQ_LEN = args.seq_len
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
STATIC_FEATURES = ["store", "brand"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
store_brand = [(x, y) for x in store_list for y in brand_list]
# Train and validate the model using only the first round data
r = 0
print("---- Round " + str(r + 1) + " ----")
# Load training data
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled.apply(lambda x: x["price"] / x["avg_price"], axis=1)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled.drop("week_start", axis=1, inplace=True)
# Normalize the dataframe of features
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
# Create sequence array for 'move'
start_timestep = 0
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - PRED_HORIZON
train_input1 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep)
# Create sequence array for other dynamic features
start_timestep = PRED_HORIZON
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
train_input2 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep)
seq_in = np.concatenate((train_input1, train_input2), axis=2)
# Create array of static features
total_timesteps = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
# Create training output
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
train_output = gen_sequence_array(data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep)
train_output = np.squeeze(train_output)
# Create model
model = create_dcnn_model(
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
)
# Convert to GPU model
try:
model = multi_gpu_model(model)
print("Training using multiple GPUs...")
except:
print("Training using single GPU or CPU...")
adam = optimizers.Adam(lr=args.learning_rate)
model.compile(loss="mape", optimizer=adam, metrics=["mape", "mae"])
# Model training and validation
history = model.fit(
[seq_in, cat_fea_in], train_output, epochs=args.epochs, batch_size=args.batch_size, validation_split=0.05
)
val_loss = history.history["val_loss"][-1]
print("Validation loss is {}".format(val_loss))
# Log the validation loss/MAPE
run.log("MAPE", np.float(val_loss))

Просмотреть файл

@ -1,11 +1,12 @@
# coding: utf-8
# Utility functions for building the Dilated Convolutional Neural Network (CNN) model.
# Utility functions for building the Dilated Convolutional Neural Network (CNN) model.
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def week_of_month(dt):
"""Get the week of the month for the specified date.
@ -14,14 +15,16 @@ def week_of_month(dt):
Returns:
wom (Integer): Week of the month of the input date
"""
"""
from math import ceil
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
wom = int(ceil(adjusted_dom/7.0))
wom = int(ceil(adjusted_dom / 7.0))
return wom
def df_from_cartesian_product(dict_in):
"""Generate a Pandas dataframe from Cartesian product of lists.
@ -33,11 +36,13 @@ def df_from_cartesian_product(dict_in):
"""
from collections import OrderedDict
from itertools import product
od = OrderedDict(sorted(dict_in.items()))
cart = list(product(*od.values()))
df = pd.DataFrame(cart, columns=od.keys())
return df
def gen_sequence(df, seq_len, seq_cols, start_timestep=0, end_timestep=None):
"""Reshape features into an array of dimension (time steps, features).
@ -54,9 +59,12 @@ def gen_sequence(df, seq_len, seq_cols, start_timestep=0, end_timestep=None):
data_array = df[seq_cols].values
if end_timestep is None:
end_timestep = df.shape[0]
for start, stop in zip(range(start_timestep, end_timestep-seq_len+2), range(start_timestep+seq_len, end_timestep+2)):
for start, stop in zip(
range(start_timestep, end_timestep - seq_len + 2), range(start_timestep + seq_len, end_timestep + 2)
):
yield data_array[start:stop, :]
def gen_sequence_array(df_all, store_brand, seq_len, seq_cols, start_timestep=0, end_timestep=None):
"""Combine feature sequences for all the combinations of (store, brand) into an 3d array.
@ -70,11 +78,22 @@ def gen_sequence_array(df_all, store_brand, seq_len, seq_cols, start_timestep=0,
Returns:
seq_array (Numpy Array): An array of the feature sequences of all stores and brands
"""
seq_gen = (list(gen_sequence(df_all[(df_all['store']==cur_store) & (df_all['brand']==cur_brand)], seq_len, seq_cols, start_timestep, end_timestep)) \
for cur_store, cur_brand in store_brand)
seq_gen = (
list(
gen_sequence(
df_all[(df_all["store"] == cur_store) & (df_all["brand"] == cur_brand)],
seq_len,
seq_cols,
start_timestep,
end_timestep,
)
)
for cur_store, cur_brand in store_brand
)
seq_array = np.concatenate(list(seq_gen)).astype(np.float32)
return seq_array
def static_feature_array(df_all, total_timesteps, seq_cols):
"""Generate an array which encodes all the static features.
@ -86,10 +105,11 @@ def static_feature_array(df_all, total_timesteps, seq_cols):
Return:
fea_array (Numpy Array): An array of static features of all stores and brands
"""
fea_df = df_all.groupby(['store', 'brand']).apply(lambda x: x.iloc[:total_timesteps,:]).reset_index(drop=True)
fea_df = df_all.groupby(["store", "brand"]).apply(lambda x: x.iloc[:total_timesteps, :]).reset_index(drop=True)
fea_array = fea_df[seq_cols].values
return fea_array
def normalize_dataframe(df, seq_cols, scaler=MinMaxScaler()):
"""Normalize a subset of columns of a dataframe.
@ -102,7 +122,6 @@ def normalize_dataframe(df, seq_cols, scaler=MinMaxScaler()):
df_scaled (Dataframe): Normalized dataframe
"""
cols_fixed = df.columns.difference(seq_cols)
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]),
columns=seq_cols, index=df.index)
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]), columns=seq_cols, index=df.index)
df_scaled = pd.concat([df[cols_fixed], df_scaled], axis=1)
return df_scaled, scaler
return df_scaled, scaler

Просмотреть файл

@ -84,27 +84,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Log into Azure Container Registry (ACR):
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Pull a Docker image from ACR using the following command
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS
```
7. Choose a name for a new Docker container (e.g. ets_container) and create it using command:
```bash
sudo docker run -it -v ~/Forecasting:/Forecasting --name ets_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name ets_container baseline_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
@ -142,7 +140,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile)
**Key packages/dependencies:**
* R

Просмотреть файл

@ -94,28 +94,26 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
`source deactivate`.
5. Log into Azure Container Registry (ACR):
5. Make sure Docker is installed
You can check if Docker is installed on your VM by running
```bash
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
sudo docker -v
```
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
sudo as a non-root user, you need to create a
Unix group and add users to it by following the instructions
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
6. Pull a Docker image from ACR using the following command
6. Build a local Docker image by running the following command from `~/Forecasting` directory
```bash
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/lightgbm_image:v1
sudo docker build -t lightgbm_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM
```
7. Choose a name for a new Docker container (e.g. lightgbm_container) and create it using command:
```bash
cd ~/Forecasting
sudo docker run -it -v ~/Forecasting:/Forecasting --name lightgbm_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/lightgbm_image:v1
sudo docker run -it -v ~/Forecasting:/Forecasting --name lightgbm_container lightgbm_image:v1
```
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
@ -153,7 +151,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
**Data storage:** Premium SSD
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile)
**Key packages/dependencies:**
* Python

Просмотреть файл

@ -1,449 +1,449 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of LightGBM Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of LightGBM model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains LightGBM models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_lgbm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--bagging-fraction', '0.8'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a CPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"cpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D14_v2\", # CPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'cpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'lightgbm', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--bagging-fraction': 0.8\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--num-leaves': quniform(8, 128, 1),\n",
" '--min-data-in-leaf': quniform(20, 500, 10),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--feature-fraction': uniform(0.2, 1), \n",
" '--bagging-fraction': uniform(0.1, 1), \n",
" '--bagging-freq': quniform(1, 20, 1), \n",
" '--max-rounds': quniform(50, 2000, 10),\n",
" '--max-lag': quniform(3, 40, 1), \n",
" '--window-size': quniform(3, 40, 1), \n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning Hyperparameters of LightGBM Model with AML SDK and HyperDrive\n",
"\n",
"This notebook performs hyperparameter tuning of LightGBM model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains LightGBM models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
"\n",
"## Prerequisites\n",
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
"`jupyter nbextension install --py --user azureml.widgets` \n",
"`jupyter nbextension enable --py --user azureml.widgets` \n",
"\n",
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"# Check core SDK version number\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.telemetry import set_diagnostics_collection\n",
"\n",
"# Opt-in diagnostics for better experience of future releases\n",
"set_diagnostics_collection(send_diagnostics=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Workspace & Create an Azure ML Experiment\n",
"\n",
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.workspace import Workspace\n",
"\n",
"ws = Workspace.from_config()\n",
"print('Workspace name: ' + ws.name, \n",
" 'Azure region: ' + ws.location, \n",
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Experiment\n",
"\n",
"exp = Experiment(workspace=ws, name='tune_lgbm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate Script Locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"\n",
"# Configure local, user managed environment\n",
"run_config_user_managed = RunConfiguration()\n",
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import ScriptRunConfig\n",
"\n",
"# Please update data-folder argument before submitting the job\n",
"src = ScriptRunConfig(source_directory='./', \n",
" script='train_validate.py', \n",
" arguments=['--data-folder', \n",
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
" '--bagging-fraction', '0.8'],\n",
" run_config=run_config_user_managed)\n",
"run_local = exp.submit(src)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check job status\n",
"run_local.get_status()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check results\n",
"while(run_local.get_status() != 'Completed'): {}\n",
"run_local.get_details()\n",
"run_local.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Script on Remote Compute Target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a CPU cluster as compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"# Choose a name for your cluster\n",
"cluster_name = \"cpucluster\"\n",
"\n",
"try:\n",
" # Look for the existing cluster by name\n",
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
" if type(compute_target) is AmlCompute:\n",
" print('Found existing compute target {}.'.format(cluster_name))\n",
" else:\n",
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
"except ComputeTargetException:\n",
" print('Creating a new compute target...')\n",
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D14_v2\", # CPU-based VM\n",
" #vm_priority='lowpriority', # optional\n",
" min_nodes=0, \n",
" max_nodes=4,\n",
" idle_seconds_before_scaledown=3600)\n",
" # Create the cluster\n",
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
" # if no min node count is provided it uses the scale settings for the cluster\n",
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
" # Get a detailed status for the current cluster. \n",
" print(compute_target.serialize())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If you have created the compute target, you should see one entry named 'cpucluster' of type AmlCompute \n",
"# in the workspace's compute_targets property.\n",
"compute_targets = ws.compute_targets\n",
"for name, ct in compute_targets.items():\n",
" print(name, ct.type, ct.provisioning_state)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Docker environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"env = EnvironmentDefinition()\n",
"env.python.user_managed_dependencies = False\n",
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'lightgbm', 'joblib'],\n",
" python_version='3.6.2')\n",
"env.python.conda_dependencies.add_channel('conda-forge')\n",
"env.docker.enabled=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Upload data to default datastore\n",
"\n",
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path_on_datastore = 'data'\n",
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get data reference object for the data path\n",
"ds_data = ds.path(path_on_datastore)\n",
"print(ds_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create estimator\n",
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import EnvironmentDefinition\n",
"from azureml.train.estimator import Estimator\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount(),\n",
" '--bagging-fraction': 0.8\n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Submit job to compute target\n",
"run_remote = exp.submit(config=est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check job status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.widgets import RunDetails\n",
"\n",
"RunDetails(run_remote).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"run_remote.get_details()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get metric value after the job finishes \n",
"while(run_remote.get_status() != 'Completed'): {}\n",
"run_remote.get_metrics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
"## Tune Hyperparameters using HyperDrive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.train.hyperdrive import *\n",
"\n",
"script_folder = './'\n",
"script_params = {\n",
" '--data-folder': ds_data.as_mount() \n",
"}\n",
"est = Estimator(source_directory=script_folder,\n",
" script_params=script_params,\n",
" compute_target=compute_target,\n",
" use_docker=True,\n",
" entry_script='train_validate.py',\n",
" environment_definition=env)\n",
"ps = BayesianParameterSampling({\n",
" '--num-leaves': quniform(8, 128, 1),\n",
" '--min-data-in-leaf': quniform(20, 500, 10),\n",
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
" '--feature-fraction': uniform(0.2, 1), \n",
" '--bagging-fraction': uniform(0.1, 1), \n",
" '--bagging-freq': quniform(1, 20, 1), \n",
" '--max-rounds': quniform(50, 2000, 10),\n",
" '--max-lag': quniform(3, 40, 1), \n",
" '--window-size': quniform(3, 40, 1), \n",
"})\n",
"htc = HyperDriveRunConfig(estimator=est, \n",
" hyperparameter_sampling=ps, \n",
" primary_metric_name='MAPE', \n",
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
" max_total_runs=200,\n",
" max_concurrent_runs=4)\n",
"htr = exp.submit(config=htc)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"RunDetails(htr).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"while(htr.get_status() != 'Completed'): {}\n",
"htr.get_metrics()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_run = htr.get_best_run_by_primary_metric()\n",
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
"print(parameter_values)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -1,6 +1,6 @@
# coding: utf-8
# Create input features for the boosted decision tree model.
# Create input features for the boosted decision tree model.
import os
import sys
@ -9,9 +9,9 @@ import itertools
import datetime
import numpy as np
import pandas as pd
import lightgbm as lgb
import lightgbm as lgb
# Append TSPerf path to sys.path
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
@ -20,6 +20,7 @@ if tsperf_dir not in sys.path:
from utils import *
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def lagged_features(df, lags):
"""Create lagged features based on time series data.
@ -33,11 +34,12 @@ def lagged_features(df, lags):
df_list = []
for lag in lags:
df_shifted = df.shift(lag)
df_shifted.columns = [x + '_lag' + str(lag) for x in df_shifted.columns]
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
df_list.append(df_shifted)
fea = pd.concat(df_list, axis=1)
return fea
def moving_averages(df, start_step, window_size=None):
"""Compute averages of every feature over moving time windows.
@ -49,12 +51,13 @@ def moving_averages(df, start_step, window_size=None):
Returns:
fea (Dataframe): Dataframe consisting of the moving averages
"""
if window_size == None: # Use a large window to compute average over all historical data
if window_size == None: # Use a large window to compute average over all historical data
window_size = df.shape[0]
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
fea.columns = fea.columns + '_mean'
fea.columns = fea.columns + "_mean"
return fea
def combine_features(df, lag_fea, lags, window_size, used_columns):
"""Combine different features for a certain store-brand.
@ -73,6 +76,7 @@ def combine_features(df, lag_fea, lags, window_size, used_columns):
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
return fea_all
def make_features(pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list):
"""Create a dataframe of the input features.
@ -88,46 +92,59 @@ def make_features(pred_round, train_dir, lags, window_size, offset, used_columns
Returns:
features (Dataframe): Dataframe including all the input features and target variable
"""
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, 'train_round_'+str(pred_round+1)+'.csv'))
train_df['move'] = train_df['logmove'].apply(lambda x: round(math.exp(x)))
train_df = train_df[['store', 'brand', 'week', 'move']]
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round]+1)
d = {'store': store_list,
'brand': brand_list,
'week': week_list}
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how='left',
on=['store', 'brand', 'week'])
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(train_dir, 'aux_round_'+str(pred_round+1)+'.csv'))
data_filled = pd.merge(data_filled, aux_df, how='left',
on=['store', 'brand', 'week'])
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = ['price1', 'price2', 'price3', 'price4', 'price5', 'price6', 'price7', 'price8', \
'price9', 'price10', 'price11']
data_filled['price'] = data_filled.apply(lambda x: x.loc['price' + str(int(x.loc['brand']))], axis=1)
data_filled['avg_price'] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled['price_ratio'] = data_filled['price'] / data_filled['avg_price']
data_filled.drop(price_cols, axis=1, inplace=True)
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(['store', 'brand']).apply(lambda x: x.fillna(method='ffill').fillna(method='bfill'))
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled['week_start'] = data_filled['week'].apply(lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x-1)*7))
data_filled['year'] = data_filled['week_start'].apply(lambda x: x.year)
data_filled['month'] = data_filled['week_start'].apply(lambda x: x.month)
data_filled['week_of_month'] = data_filled['week_start'].apply(lambda x: week_of_month(x))
data_filled['day'] = data_filled['week_start'].apply(lambda x: x.day)
data_filled.drop('week_start', axis=1, inplace=True)
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(['store','brand']).apply(lambda x: combine_features(x, ['move'], lags, window_size, used_columns))
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, window_size, used_columns)
)
return features
return features

Просмотреть файл

@ -0,0 +1,201 @@
# coding: utf-8
# Create input features for the boosted decision tree model.
import os
import sys
import math
import datetime
import pandas as pd
from sklearn.pipeline import Pipeline
from common.features.lag import LagFeaturizer
from common.features.rolling_window import RollingWindowFeaturizer
from common.features.stats import PopularityFeaturizer
from common.features.temporal import TemporalFeaturizer
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
# Import TSPerf components
from utils import df_from_cartesian_product
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
pd.set_option("display.max_columns", None)
def oj_preprocess(df, aux_df, week_list, store_list, brand_list, train_df=None):
df["move"] = df["logmove"].apply(lambda x: round(math.exp(x)))
df = df[["store", "brand", "week", "move"]].copy()
# Create a dataframe to hold all necessary data
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Fill missing values
if train_df is not None:
data_filled = pd.concat(train_df, data_filled)
forecast_creation_time = train_df["week_start"].max()
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
data_filled["week_start"] = data_filled["week"].apply(
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
if train_df is not None:
data_filled = data_filled.loc[data_filled["week_start"] > forecast_creation_time].copy()
return data_filled
def make_features(
pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list,
):
"""Create a dataframe of the input features.
Args:
pred_round (Integer): Prediction round
train_dir (String): Path of the training data directory
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Maximum step for computing the moving average
offset (Integer): Length of training data skipped in the retraining
used_columns (List): A list of names of columns used in model training
(including target variable)
store_list (Numpy Array): List of all the store IDs
brand_list (Numpy Array): List of all the brand IDs
Returns:
features (Dataframe): Dataframe including all the input features and
target variable
"""
# Load training data
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
train_df_preprocessed = oj_preprocess(train_df, aux_df, week_list, store_list, brand_list)
df_config = {
"time_col_name": "week_start",
"ts_id_col_names": ["brand", "store"],
"target_col_name": "move",
"frequency": "W",
"time_format": "%Y-%m-%d",
}
temporal_featurizer = TemporalFeaturizer(df_config=df_config, feature_list=["month_of_year", "week_of_month"])
popularity_featurizer = PopularityFeaturizer(
df_config=df_config,
id_col_name="brand",
data_format="wide",
feature_col_name="price",
wide_col_names=[
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
],
output_col_name="price_ratio",
return_feature_col=True,
)
lag_featurizer = LagFeaturizer(df_config=df_config, input_col_names="move", lags=lags, future_value_available=True,)
moving_average_featurizer = RollingWindowFeaturizer(
df_config=df_config,
input_col_names="move",
window_size=window_size,
window_args={"min_periods": 1, "center": False},
future_value_available=True,
rolling_gap=2,
)
feature_engineering_pipeline = Pipeline(
[
("temporal", temporal_featurizer),
("popularity", popularity_featurizer),
("lag", lag_featurizer),
("moving_average", moving_average_featurizer),
]
)
features = feature_engineering_pipeline.transform(train_df_preprocessed)
# Temporary code for result verification
features.rename(
mapper={
"move_lag_2": "move_lag2",
"move_lag_3": "move_lag3",
"move_lag_4": "move_lag4",
"move_lag_5": "move_lag5",
"move_lag_6": "move_lag6",
"move_lag_7": "move_lag7",
"move_lag_8": "move_lag8",
"move_lag_9": "move_lag9",
"move_lag_10": "move_lag10",
"move_lag_11": "move_lag11",
"move_lag_12": "move_lag12",
"move_lag_13": "move_lag13",
"move_lag_14": "move_lag14",
"move_lag_15": "move_lag15",
"move_lag_16": "move_lag16",
"move_lag_17": "move_lag17",
"move_lag_18": "move_lag18",
"move_lag_19": "move_lag19",
"month_of_year": "month",
},
axis=1,
inplace=True,
)
features = features[
[
"store",
"brand",
"week",
"week_of_month",
"month",
"deal",
"feat",
"move",
"price",
"price_ratio",
"move_lag2",
"move_lag3",
"move_lag4",
"move_lag5",
"move_lag6",
"move_lag7",
"move_lag8",
"move_lag9",
"move_lag10",
"move_lag11",
"move_lag12",
"move_lag13",
"move_lag14",
"move_lag15",
"move_lag16",
"move_lag17",
"move_lag18",
"move_lag19",
"move_mean",
]
]
return features

Просмотреть файл

@ -0,0 +1,137 @@
# coding: utf-8
# Train and score a boosted decision tree model using [LightGBM Python package](https://github.com/Microsoft/LightGBM) from Microsoft,
# which is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms.
import os
import sys
import argparse
import numpy as np
import pandas as pd
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")
# Append TSPerf path to sys.path
tsperf_dir = os.getcwd()
if tsperf_dir not in sys.path:
sys.path.append(tsperf_dir)
from make_features import make_features
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
def make_predictions(df, model):
"""Predict sales with the trained GBM model.
Args:
df (Dataframe): Dataframe including all needed features
model (Model): Trained GBM model
Returns:
Dataframe including the predicted sales of every store-brand
"""
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
predictions["move"] = predictions["move"].apply(lambda x: round(x))
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--seed", type=int, dest="seed", default=1, help="Random seed of GBM model")
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=124, help="# of leaves of the tree")
parser.add_argument(
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=340, help="minimum # of samples in each leaf"
)
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.1, help="learning rate")
parser.add_argument(
"--feature-fraction",
type=float,
dest="feature_fraction",
default=0.65,
help="ratio of features used in each iteration",
)
parser.add_argument(
"--bagging-fraction",
type=float,
dest="bagging_fraction",
default=0.87,
help="ratio of samples used in each iteration",
)
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=19, help="bagging frequency")
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=940, help="# of boosting iterations")
parser.add_argument("--max-lag", type=int, dest="max_lag", default=19, help="max lag of unit sales")
parser.add_argument(
"--window-size", type=int, dest="window_size", default=40, help="window size of moving average of unit sales"
)
args = parser.parse_args()
print(args)
# Data paths
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
SUBMISSION_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "LightGBM")
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Parameters of GBM model
params = {
"objective": "mape",
"num_leaves": args.num_leaves,
"min_data_in_leaf": args.min_data_in_leaf,
"learning_rate": args.learning_rate,
"feature_fraction": args.feature_fraction,
"bagging_fraction": args.bagging_fraction,
"bagging_freq": args.bagging_freq,
"num_rounds": args.max_rounds,
"early_stopping_rounds": 125,
"num_threads": 4,
"seed": args.seed,
}
# Lags and categorical features
lags = np.arange(2, args.max_lag + 1)
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
categ_fea = ["store", "brand", "deal"]
# Get unique stores and brands
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
# Train and predict for all forecast rounds
pred_all = []
metric_all = []
for r in range(bs.NUM_ROUNDS):
print("---- Round " + str(r + 1) + " ----")
# Create features
features = make_features(r, TRAIN_DIR, lags, args.window_size, 0, used_columns, store_list, brand_list)
train_fea = features[features.week <= bs.TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
# Drop rows with NaN values
train_fea.dropna(inplace=True)
# Create training set
dtrain = lgb.Dataset(train_fea.drop("move", axis=1, inplace=False), label=train_fea["move"])
if r % 3 == 0:
# Train GBM model
print("Training model...")
bst = lgb.train(params, dtrain, valid_sets=[dtrain], categorical_feature=categ_fea, verbose_eval=False)
# Generate forecasts
print("Making predictions...")
test_fea = features[features.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
pred = make_predictions(test_fea, bst).sort_values(by=["store", "brand", "week"]).reset_index(drop=True)
# Additional columns required by the submission format
pred["round"] = r + 1
pred["weeks_ahead"] = pred["week"] - bs.TRAIN_END_WEEK_LIST[r]
# Keep the predictions
pred_all.append(pred)
# Generate submission
submission = pd.concat(pred_all, axis=0)
submission.rename(columns={"move": "prediction"}, inplace=True)
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
filename = "submission_seed_" + str(args.seed) + ".csv"
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)

Просмотреть файл

@ -0,0 +1,241 @@
# coding: utf-8
# Perform cross validation of a boosted decision tree model on the training data of the 1st forecast round.
import os
import sys
import math
import argparse
import datetime
import itertools
import numpy as np
import pandas as pd
import lightgbm as lgb
from azureml.core import Run
from sklearn.model_selection import train_test_split
from utils import week_of_month, df_from_cartesian_product
def lagged_features(df, lags):
"""Create lagged features based on time series data.
Args:
df (Dataframe): Input time series data sorted by time
lags (List): Lag lengths
Returns:
fea (Dataframe): Lagged features
"""
df_list = []
for lag in lags:
df_shifted = df.shift(lag)
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
df_list.append(df_shifted)
fea = pd.concat(df_list, axis=1)
return fea
def moving_averages(df, start_step, window_size=None):
"""Compute averages of every feature over moving time windows.
Args:
df (Dataframe): Input features as a dataframe
start_step (Integer): Starting time step of rolling mean
window_size (Integer): Windows size of rolling mean
Returns:
fea (Dataframe): Dataframe consisting of the moving averages
"""
if window_size == None: # Use a large window to compute average over all historical data
window_size = df.shape[0]
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
fea.columns = fea.columns + "_mean"
return fea
def combine_features(df, lag_fea, lags, window_size, used_columns):
"""Combine different features for a certain store-brand.
Args:
df (Dataframe): Time series data of a certain store-brand
lag_fea (List): A list of column names for creating lagged features
lags (Numpy Array): Numpy array including all the lags
window_size (Integer): Windows size of rolling mean
used_columns (List): A list of names of columns used in model training (including target variable)
Returns:
fea_all (Dataframe): Dataframe including all features for the specific store-brand
"""
lagged_fea = lagged_features(df[lag_fea], lags)
moving_avg = moving_averages(df[lag_fea], 2, window_size)
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
return fea_all
def make_predictions(df, model):
"""Predict sales with the trained GBM model.
Args:
df (Dataframe): Dataframe including all needed features
model (Model): Trained GBM model
Returns:
Dataframe including the predicted sales of a certain store-brand
"""
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
predictions["move"] = predictions["move"].apply(lambda x: round(x))
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
if __name__ == "__main__":
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data-folder", type=str, dest="data_folder", default=".", help="data folder mounting point")
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=64, help="# of leaves of the tree")
parser.add_argument(
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=50, help="minimum # of samples in each leaf"
)
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.001, help="learning rate")
parser.add_argument(
"--feature-fraction",
type=float,
dest="feature_fraction",
default=1.0,
help="ratio of features used in each iteration",
)
parser.add_argument(
"--bagging-fraction",
type=float,
dest="bagging_fraction",
default=1.0,
help="ratio of samples used in each iteration",
)
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=1, help="bagging frequency")
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=400, help="# of boosting iterations")
parser.add_argument("--max-lag", type=int, dest="max_lag", default=10, help="max lag of unit sales")
parser.add_argument(
"--window-size", type=int, dest="window_size", default=10, help="window size of moving average of unit sales"
)
args = parser.parse_args()
args.feature_fraction = round(args.feature_fraction, 2)
args.bagging_fraction = round(args.bagging_fraction, 2)
print(args)
# Start an Azure ML run
run = Run.get_context()
# Data paths
DATA_DIR = args.data_folder
TRAIN_DIR = os.path.join(DATA_DIR, "train")
# Data and forecast problem parameters
TRAIN_START_WEEK = 40
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
TEST_START_WEEK_LIST = list(range(137, 161, 2))
TEST_END_WEEK_LIST = list(range(138, 162, 2))
# The start datetime of the first week in the record
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
# Parameters of GBM model
params = {
"objective": "mape",
"num_leaves": args.num_leaves,
"min_data_in_leaf": args.min_data_in_leaf,
"learning_rate": args.learning_rate,
"feature_fraction": args.feature_fraction,
"bagging_fraction": args.bagging_fraction,
"bagging_freq": args.bagging_freq,
"num_rounds": args.max_rounds,
"early_stopping_rounds": 125,
"num_threads": 16,
}
# Lags and used column names
lags = np.arange(2, args.max_lag + 1)
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
categ_fea = ["store", "brand", "deal"]
# Train and validate the model using only the first round data
r = 0
print("---- Round " + str(r + 1) + " ----")
# Load training data
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
train_df = train_df[["store", "brand", "week", "move"]]
# Create a dataframe to hold all necessary data
store_list = train_df["store"].unique()
brand_list = train_df["brand"].unique()
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
data_grid = df_from_cartesian_product(d)
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
# Get future price, deal, and advertisement info
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
# Create relative price feature
price_cols = [
"price1",
"price2",
"price3",
"price4",
"price5",
"price6",
"price7",
"price8",
"price9",
"price10",
"price11",
]
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
data_filled.drop(price_cols, axis=1, inplace=True)
# Fill missing values
data_filled = data_filled.groupby(["store", "brand"]).apply(
lambda x: x.fillna(method="ffill").fillna(method="bfill")
)
# Create datetime features
data_filled["week_start"] = data_filled["week"].apply(
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
)
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
data_filled.drop("week_start", axis=1, inplace=True)
# Create other features (lagged features, moving averages, etc.)
features = data_filled.groupby(["store", "brand"]).apply(
lambda x: combine_features(x, ["move"], lags, args.window_size, used_columns)
)
train_fea = features[features.week <= TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
# Drop rows with NaN values
train_fea.dropna(inplace=True)
# Model training and validation
# Create a training/validation split
train_fea, valid_fea, train_label, valid_label = train_test_split(
train_fea.drop("move", axis=1, inplace=False), train_fea["move"], test_size=0.05, random_state=1
)
dtrain = lgb.Dataset(train_fea, train_label)
dvalid = lgb.Dataset(valid_fea, valid_label)
# A dictionary to record training results
evals_result = {}
# Train GBM model
bst = lgb.train(
params, dtrain, valid_sets=[dtrain, dvalid], categorical_feature=categ_fea, evals_result=evals_result
)
# Get final training loss & validation loss
train_loss = evals_result["training"]["mape"][-1]
valid_loss = evals_result["valid_1"]["mape"][-1]
print("Final training loss is {}".format(train_loss))
print("Final validation loss is {}".format(valid_loss))
# Log the validation loss/MAPE
run.log("MAPE", np.float(valid_loss) * 100)

Просмотреть файл

@ -1,9 +1,10 @@
# coding: utf-8
# Utility functions for building the boosted decision tree model.
# Utility functions for building the boosted decision tree model.
import pandas as pd
def week_of_month(dt):
"""Get the week of the month for the specified date.
@ -12,15 +13,17 @@ def week_of_month(dt):
Returns:
wom (Integer): Week of the month of the input date
"""
"""
from math import ceil
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
wom = int(ceil(adjusted_dom/7.0))
wom = int(ceil(adjusted_dom / 7.0))
return wom
def df_from_cartesian_product(dict_in):
def df_from_cartesian_product(dict_in):
"""Generate a Pandas dataframe from Cartesian product of lists.
Args:
@ -31,7 +34,8 @@ def df_from_cartesian_product(dict_in):
"""
from collections import OrderedDict
from itertools import product
od = OrderedDict(sorted(dict_in.items()))
cart = list(product(*od.values()))
df = pd.DataFrame(cart, columns=od.keys())
return df
return df

Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше