First Release of Forecasting Repo (#181)
* Handled edge case where ts_id_col_names is None * Split long line into separate lines * Added notebook template * Added a test yml file * Added yml file for python unit test pipeline * Minor update * Minor update * Minor update * Minor update * Removed triggers * Removed triggers * Created a base ts estimator and inherit BaseTSFeaturizer from the BaseTSEstimator. * Refactored featurizer class hierachy. * Added week of month method. * add script to source entire * formatting * source only test files * Inherit temporal featurizers from BaseTSFeaturizer. * Minor update. * Replaced max_test_timestamp with max_horizon * Refactored rolling window featurizers. * Renamed hour_of_year feature to normalized_hour_of_year * Inherit all normalizers from base normalizer class. * address review comments for the PR of contributing * minor update * address review comments for PR of r test pipeline * add a test yml file * Remove checking target column existence, because testing data may not have the target column. * Create setter and getter of ts_id_col_names. * Fixed bug caused by unexpected behavior of pandas.shift * Some code cleanup. * Updated some featurizer names. * Some minor changes in df_config and feature configs. * Some minor changes in feature names. * Added usage examples in docstring. * Computation time update after feature engineering refactoring. * Removed setting frequency. * Added docstring to convert_to_tsdf function. * Removed frequency in convert_to_tsdf call. * Fixed week_of_month function. * Added popularity featurizer * Added utility function for checking Iterable but not string. * Updated LightGBM feature engineering code to use new feature engineering classes. * Improved checking whether input column names are Iterable and conver to list. * Made future_value_available a read-only property. * Minor docstring update. * Removed extra space in docstring examples. * Made some methods staticmethods. * Minor QRF result update after feature engineering code change. * Removed calling of validate_file and added catching of the exception * Update python_unit_tests_base.yml for Azure Pipelines [skip ci] Updated path of the test results * Test if the download link is wrong * Fixed minor format issues. * Fixed minor format issues. * Fixed formatting issues. * Fixed line length. * Removed data files before downloading and checked dimensions of energy data * Removed the change made for testing * Changed folder structure of tests and added table to show build status * Added missing files * Updated based on review comments * new folder structure * add repo metrics * remove prototypes folder * add models placeholder * adjust featurizers to the new structure of folders * changes in README and evaluation files * adjust data download to new folders * delete unnecessary files * energy load baseline model with new folders * delete data files * fix links in benchmarks file * fix bug * adjust GBM, QRF and FNN submissions to the new folder structure * Replace pd.to_timedelta with pd.offsets. * Added get_offset_by_frequency helper function. * fix small bugs * fix small bugs * Update TSCVSplitter. * refactored high-level folders * added a placeholder folder for PR/issue templates * added subfolders under notebooks/ * updated tests folder * renamed notebooks/ to examples/ * Update to CONTRIBUTING instructions (#34) * style checking and formatting files * git hook installation guide * issue and PR templates * minor change * working with github instructions * added specific issue templates * addressed PR comments * addressed Chenhui's comment * addressing chenhuis comments * conda environment file (#36) * conda environment file * updated environment file * updated instructions for installing conda env * Vapaunic/lib (#37) * initial core for forecasting library * syncing with new structure * __init__ files in modules * renamed lib directory * Added legal headers and some formatting of py files * restructured benchmarking directory in lib * fixed imports, warnings, legal headers * more import fixes and legal headers * updated instructions with package installation * barebones library README * moved energy benchmark to contrib * formatting changes plus more legal headers * Added license to the setup * moved .swp file to contrib, not sure we need to keep it at all * added missing headers and a brief snipet to README file * minor wording change in readme * Chenhui/cpu unit test pipeline (#38) * address review comments * added full conda path * minor change * added conda to PATH * added build status in README * removed energy data prep placeholder notebook * moved out data energy explore notebook into contrib * moved data download script to tools/ * Added getting started section to readme * Added rbase and rbayesm to conda environment * modified data download script * added instructions for data download * renamed data download script * fixing issues with test pipeline * parsing issue in yml file * cleaning up ci test yaml file for more diagnostic info * fixed a missing argument in instructions * removed retail directory under dataset module * moved feature_engineering.py to the feature engineering module * moved evaluate.py to evaluation module * combined benchmark settings into a single file * moved download sript to the package and modified the tests * modified instructions * fixed the build pipeline yml * fix to the pipeline yml * fix to the pipeline yml * moved serve_folds into ojdata.py * removed data_schema.py file as all content moved to ojdata.py * fixed split_train_test in ojdata.py * moved retail_data_schema into ojdata.py * moved all oj utilities to ojdata.py * removed paths from benchmark_settings * fixed up a docstring * quick fix a typo * removed benchmark_settings * parameterized experiment settings * refactored experiment settings * Fixed docstrings * addressed chenhuis comment around round file naming * renamed experiment to forecast settings * Chenhui/light gbm quick start (#40) * initial example notebook for lightgbm * reduced to one round forecast * added text * added text * added text * moved week_of_month to feature engineering utils * moved df_from_cartesian_product to feature utils * moved functions to feature utils * moved functions to feature utils * added lightgbm model utils * updated plots * added text and renamed predict function * reduced print out frequency in model training * moved data visualization code to utils * added text * updated plot function and added docstring * renamed the notebook * updated text * added NOTICE file, currently empty as we're not redistributing any packages * Chenhui/add scrapbook (#43) * added scrapbook support * Added gitpython to environtment.yml file * added git_repo_path function to utils * updated notebook * added test for lightgbm notebook * included testing of notebooks * resolve test error * resolve test error * added kernel name * updated kernel name * trying installing bayesm from cmd * trying installing bayesm from cmd * trying installing bayesm from cmd * excluded notebook test * excluded notebook test * added lapack.so link fix * included notebook tests * excluded files for notebook test Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com> * added integration test * added initial data prep notebook * updated notebook * updated notebook * updated notebook * updated url * init * model parameters * removed blank quick start notebooks * removed blank modeling notebooks * removed blank evaluation notebooks * Removed blank model selection notebooks * removed blank o16n notebooks * removed outdated text from contrib/README * removed outdated swp file * updating .gitignore * removed change log, as we don't plan to maintain this * Excluding irrelevant directories * fix settings * separated out the setup guide * fix settings * simplemodel init * typo * add rproj file * Renaming forecasting_lib to fclib (#59) * renamed forecasting_lib directory * modified references to forecasting_lib * Vapaunic/envname (#61) * renamed conda env * modified setup instructions * minor change in contributing guide * keep top-level gitignore only * formatting fixes * Chenhui/add automl example (#62) * added multiple linear models and example notebook for AutoML * removed commented code * address review comments * minor update to the notebook * minor update to the notebook * added text * changed types in lightgbm to be consistent with the rest of the code * modified docstrings in multiple_linear_regression.py * updated ci yaml files * changed import statement in confest.py * updated gitpython version to the latest Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com> * Vapaunic/split bug (#65) * fixed a yield bug * removed two blank files * modified split data function to auto-calculate the splits based on the parameters * removed forecast_settings module * removed unused parameter * modified splitting function to use non-overlapping testing * tested the split function after the update * minor fix * defaults changed in split function * modified lightgbm example with new split function * modified automl example (needs verification) * modified data explore notebook * quick fix: * updated data preparation notebook * changed defaults in split function * Addressed changes in lightgbm * addressed issues in automl notebook * fixed typo in lightgbm plot * first images of time series split * updated the pictures * updated evaluation periods (#66) * Chenhui/env setup script (#67) * added a shell script for setting up environment * changed yaml to yml * added comments and updated SETUP.md * modified data preparation notebook with images * moved r exploration notebook to contrib directory * modified data explore notebook, updated info about the data, and removed reference to TSPerf * addressed review feedback and fixed the explore notebook * Chenhui/multiround lightgbm (#68) * added initial multiround notebook for lightgbm * updated data splitting * updated text * updated week list * addressed review comments * added pyramid-automl to conda file * first draft of arima notebook * replace pyramid with pmdarima * Added a complete function * minor type * forecasting across many stores/brands * complete arima notebook * renamed data preparation/exploration notebooks * added git clone to setup * addressed PR comments * typo * Arima to ARIMA * fixed docstring in plot function * fixed a bug in MAPE calculation and added plotting * fixed a bug in predict * modeling arima on log scale * Fixing AML Example Notebook (#84) * Cleaning notebook output, adding get_or_create workspace call, and fixing get_or_create AmlCompute * Add regression-based models (#64) * modelling updates * code tweak * rebuild * update mape * update mape 2 * new forecasting structure * update eval * rebuild dataprep * rebuild with profit * rm profit * add plot * typo * tidy up * expand readme * oops * clarified setup guide (#94) * Update SETUP (#95) minor fix * Cleaned up unused files and directories (#96) * removed non-used files * moved docs into a docs/ dir * fixed broken links * Chenhui/dilated cnn example and utils (#76) * added initial model util file for DCNN * initial notebook * added feature utils for DCNN * upadted evaluation and visualization * removed plot function * replaced PRED_HORIZON, PRED_STEPS by HORIZON, GAP * removed log dir if it exists * updated model utils * generalized categorical features in dcnn model util * generalized network definition * update training code * format with blackcellmagic * address review comments and added README * Chenhui/add ci tests (#146) * Update conda env with versions (#99) * 💥 * revert * minor changes Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * Adding missing Jupyter Extension (#90) * Update environment.yml * specified version Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * fix links to examples/ (#104) * Chenhui/rename notebooks and update automl notebook (#106) * removed unused module * added outputs in automl notebook * fixed a notebook name * Arima multi-round notebook (#91) * working arima model * final auto arima example * added tqdm to requirements * addressed review comments * Revert "Chenhui/rename notebooks and update automl notebook (#106)" (#107) This reverts commit 032c91d9bfa389f22ae1f1f2150913a4f063bd18 [formerly15d25213dc
]. Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * Fixing data download issue (#109) * removed dependency on __file__ from data download, doesn't work in jupyter * changed aux to auxdata * fixe data download function * fixed path * auxdata -> auxi * adding tl;dr directions for setup to README.md (#88) * adding tl;dr directions for setup to README.md * added a bit more text * Cleaned up obsolete (tsperf) code in fclib (#112) * moved out tsperf files from evaluation module * moved out tsperf tuning code * removed more unused files * Addressing documentation related issues (#111) * Added conda activate to the setup readme * added instructions for starting jupyter to setup * minor * deleted duplicate instructions * addressed PR comments * Chenhui/rename notebooks and updated AutoML example (#108) * removed unused module * added outputs in automl notebook * fixed a notebook name * updated pytest file * address review comments * reran notebook with blackcellmagic * adding pylint (#93) * adding tl;dr directions for setup to README.md * removing pylint hook and pylint_junit from the env file * removed pylint config file * Chenhui/update example folder (#115) * restructure examples folder * updated readme * added readme * minor update * removed R folder * minor change * fixed a broken link * another broken link * fixing notebook tests * Chenhui/fix aux file path (#118) * fixed figure links * changed to auxi_i.csv * minor change * [MINOR] Small changes to Arima notebooks (#121) * fixed a broken link * minor text changes * Documentation (#120) * added target audience section * added intro on forecasting * Added fclib documentation * improved examples readme * address comments * added info about the dataset * added items to be ignored (#123) * added items to be ignored * added *.log and score.py * Chenhui/toplevel readme (#127) * added content table * added references * added external repo links * minor update * Chenhui/tune deploy lgbm (#122) * added notebook and utils * updated readme links * fix data path * updated text * group imports * minor update * using azureml utils to create workspace and compute (#126) * using azureml utils to create workspace and compute * group imports * Download ojdata directly from github (#128) * new function to download and load oj data directly from bayesm repo * removed bayesm * new R function to only load the data * removed download R function * minor fix * added documentation to load_oj_data.R * added requests to requirements * fixed a syntax error (#130) * fix setup.md link (#129) * fix setup.md link * mention related use cases * Vapaunic/cgbuild (#133) * added files to generate reqs.txt and the ci yml file * Added notice generation task * Checking if notice is there * Update component_governance.yml for Azure Pipelines * check in notice file * Update component_governance.yml for Azure Pipelines * fixed heading * Chenhui/windows setup (#131) * initial test * added batch script and instructions * align image to center * adjust image size * added text * adjust image size * address comments * Readds R material (#116) * redo R stuff in new dirs * dirname fixup * add Rproj file * rebuild * fixups * roxygenise * copyright notice * dataprep * updated yaml * more updates * more tweaks * reg models * update reg models * more updates * reword * rendered prophet html * name fix * add lintr file * move stuff * renamed use case folder (#138) * renamed use case folder * dirname change * updated readme * added notebooks * fix ci test * Vapaunic/featutils (#137) * moved feature engineering module to contrib * removed lag submod * cleaned up feature engineering * rebuild R notebooks (#139) * Chenhui/toplevel readme (#140) * added content table * added references * added external repo links * minor update * updated setup instructions * added text * align text * removed duplicated Content section * address review comments * Chenhui/hyperdrive example update (#142) * removed blackcellmagic * removed utils under aml_scripts and updated notebook * added notebook path * added ci test of lightgbm multi round example * make forecast round as parameter * Make -Agent Name * resolve duplicated function name * increased time limit and reduce number of rounds * increase time limit * added parameters tag to multiround lightgbm and dilatedcnn * README change (#147) * minor change * hide tags * hide tags * added parameters tag * Revert "Chenhui/add ci tests (#146)" (#149) This reverts commit de7a19cfa7637476b9ebfc92f5c18a26a8eca4da [formerlyf8bd22733c
]. * Chenhui/add ci tests (#150) * Update conda env with versions (#99) * 💥 * revert * minor changes Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * Adding missing Jupyter Extension (#90) * Update environment.yml * specified version Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * fix links to examples/ (#104) * Chenhui/rename notebooks and update automl notebook (#106) * removed unused module * added outputs in automl notebook * fixed a notebook name * Arima multi-round notebook (#91) * working arima model * final auto arima example * added tqdm to requirements * addressed review comments * Revert "Chenhui/rename notebooks and update automl notebook (#106)" (#107) This reverts commit 032c91d9bfa389f22ae1f1f2150913a4f063bd18 [formerly15d25213dc
]. Co-authored-by: Chenhui Hu <chenhhu@microsoft.com> * Fixing data download issue (#109) * removed dependency on __file__ from data download, doesn't work in jupyter * changed aux to auxdata * fixe data download function * fixed path * auxdata -> auxi * adding tl;dr directions for setup to README.md (#88) * adding tl;dr directions for setup to README.md * added a bit more text * Cleaned up obsolete (tsperf) code in fclib (#112) * moved out tsperf files from evaluation module * moved out tsperf tuning code * removed more unused files * Addressing documentation related issues (#111) * Added conda activate to the setup readme * added instructions for starting jupyter to setup * minor * deleted duplicate instructions * addressed PR comments * Chenhui/rename notebooks and updated AutoML example (#108) * removed unused module * added outputs in automl notebook * fixed a notebook name * updated pytest file * address review comments * reran notebook with blackcellmagic * adding pylint (#93) * adding tl;dr directions for setup to README.md * removing pylint hook and pylint_junit from the env file * removed pylint config file * Chenhui/update example folder (#115) * restructure examples folder * updated readme * added readme * minor update * removed R folder * minor change * fixed a broken link * another broken link * fixing notebook tests * Chenhui/fix aux file path (#118) * fixed figure links * changed to auxi_i.csv * minor change * [MINOR] Small changes to Arima notebooks (#121) * fixed a broken link * minor text changes * Documentation (#120) * added target audience section * added intro on forecasting * Added fclib documentation * improved examples readme * address comments * added info about the dataset * added items to be ignored (#123) * added items to be ignored * added *.log and score.py * Chenhui/toplevel readme (#127) * added content table * added references * added external repo links * minor update * Chenhui/tune deploy lgbm (#122) * added notebook and utils * updated readme links * fix data path * updated text * group imports * minor update * using azureml utils to create workspace and compute (#126) * using azureml utils to create workspace and compute * group imports * Download ojdata directly from github (#128) * new function to download and load oj data directly from bayesm repo * removed bayesm * new R function to only load the data * removed download R function * minor fix * added documentation to load_oj_data.R * added requests to requirements * fixed a syntax error (#130) * fix setup.md link (#129) * fix setup.md link * mention related use cases * Vapaunic/cgbuild (#133) * added files to generate reqs.txt and the ci yml file * Added notice generation task * Checking if notice is there * Update component_governance.yml for Azure Pipelines * check in notice file * Update component_governance.yml for Azure Pipelines * fixed heading * Chenhui/windows setup (#131) * initial test * added batch script and instructions * align image to center * adjust image size * added text * adjust image size * address comments * Readds R material (#116) * redo R stuff in new dirs * dirname fixup * add Rproj file * rebuild * fixups * roxygenise * copyright notice * dataprep * updated yaml * more updates * more tweaks * reg models * update reg models * more updates * reword * rendered prophet html * name fix * add lintr file * move stuff * renamed use case folder (#138) * renamed use case folder * dirname change * updated readme * added notebooks * fix ci test * Vapaunic/featutils (#137) * moved feature engineering module to contrib * removed lag submod * cleaned up feature engineering * rebuild R notebooks (#139) * Chenhui/toplevel readme (#140) * added content table * added references * added external repo links * minor update * updated setup instructions * added text * align text * removed duplicated Content section * address review comments * Chenhui/hyperdrive example update (#142) * removed blackcellmagic * removed utils under aml_scripts and updated notebook * added notebook path * added ci test of lightgbm multi round example * make forecast round as parameter * Make -Agent Name * resolve duplicated function name * increased time limit and reduce number of rounds * increase time limit * added parameters tag to multiround lightgbm and dilatedcnn * README change (#147) * minor change * hide tags * hide tags * added parameters tag * Revert "Chenhui/add ci tests (#150)" (#151) This reverts commit 357453234088f2ebb8453bd8cd77527a1c6c2130 [formerly21846168a7
]. * Chenhui/Add CI tests for notebooks This reverts commit 8a99549da8b9096b65130fd2f6634e2a217b2dd9 [formerly89e986fe2c
]. * minor update * Added CI tests for example notebooks * Update component governance pipeline * Update component governance pipeline * add ignored items * Readds R material (#116) * Chenhui/windows setup (#131) * Vapaunic/featutils (#137) * Chenhui/add CI tests for notebooks * Vapaunic/arimaint (#154) * modified conftests to add arima * added tests * modified notebooks with parameters * Chenhui/code improvments (#157) * updated docstring * pinged package versions * minor improvements * minor improvement * modified metrics to take any iterable (#158) * improvement: using Ray to parallelize arima fitting (#159) * using Ray to parallelize arima fitting * added ray as dependency * text about ray, disable warnings, and minor stuff * scipy 1.4.1 or above * reverting scipy, azuremlsdk issue * minor mod Co-authored-by: Vanja Paunic <15053814+vapaunic@users.noreply.github.com> * chenhui/improve ray output (#166) * modified arima multiround to run with ray (#167) * Chenhui/improve doc (#168) * minor changes * remove redundancy * updated text * improved text in model tuning and deployment notebook * clarify the data used * updated text * added description of the script * add explanation of gaps in the curve * add explanation of gaps in the curve * updated text * fix typos * improve documentation and format * Addressing a few issues around package dependencies (#169) * syncronizing utils with other OSS AI repos * exclude xlrd, leftover from tsperf * exclude urlib3, leftover from tsperf * moving tqdm to fclib as only used by lib at the moment * included fclib dependencies in requirements.txt * lower bounded package versions that we dont need specific versions of * lower bound gitpython * Chenhui/improve checking of run completion (#170) * Chenhui/added ray dashboard (#171) * Chenhui/update diagram (#172) * update multiround training diagram * minor change * update diagram and minor change * Addressing doc related issues (#173) * taking out inventory optimization link * pulled contributing out of docs * Chenhui/ray windows (#177) * add util to check if module exists * use ray if available or use sequential training * updated text * updated text * reduce code redundancy * Chenhui/setup scripts (#178) * move ray to linux setup script * remove duplicated azureml-sdk to avoid errors * add ray to ci yaml files * update azureml-sdk * update manual setup instructions * minor change * Chenhui/content table (#179) * update readme * minor change * minor update * Chenhui/multiround arima (#180) * use ray if it is installed * update text and reran notebook * add reference * Chenhui/dilatedcnn windows (#184) * resolve format issues * update log path and tensorboard path * remove subprocess import * fix path * change env name to resolve pipeline failures * Chenhui/hyperdrive windows (#185) * resolve format issues * update log path and tensorboard path * remove subprocess import * fetch common utils from chenhui/dilatedcnn_windows * update notebook * removed explain module and added notebooks module * get updated ci yml files * updated kernel name * Chenhui/enhancement (#186) * modified module_path * updated tensorboard section * rerun notebook * only submit local run if python path is found * minor change and rerun notebook * updated content section (#187) * updated content section * minor change * address comments * add links Co-authored-by: Hong Lu <honglu@microsoft.com> Co-authored-by: ZhouFang928 <ZhouFang928@users.noreply.github.com> Co-authored-by: pechyony <pechyony@outlook.com> Co-authored-by: Ubuntu <chenhui@chhdsvmnc6.hyjxgt1qggauhj0g0g2jh3guwb.bx.internal.cloudapp.net> Co-authored-by: vapaunic <15053814+vapaunic@users.noreply.github.com> Co-authored-by: Hong Ooi <hongooi@microsoft.com> Co-authored-by: Daniel Ciborowski <dciborow@microsoft.com> Co-authored-by: Markus Cozowicz <marcozo@microsoft.com> Former-commit-id:6098ecf68c
This commit is contained in:
Родитель
a409804093
Коммит
0607fd568f
|
@ -0,0 +1,21 @@
|
|||
[flake8]
|
||||
max-line-length = 120
|
||||
max-complexity = 18
|
||||
select = B,C,E,F,W,T4,B9
|
||||
ignore =
|
||||
# slice notation whitespace, invalid
|
||||
E203
|
||||
# too many leading ‘#’ for block comment
|
||||
E266
|
||||
# module level import not at top of file
|
||||
E402
|
||||
# line break before binary operator
|
||||
W503
|
||||
# blank line contains whitespace
|
||||
W293
|
||||
# line too long
|
||||
E501
|
||||
# trailing white spaces
|
||||
W291
|
||||
# missing white space after ,
|
||||
E231
|
|
@ -0,0 +1,25 @@
|
|||
### Description
|
||||
<!--- Describe your issue/bug/request in detail -->
|
||||
|
||||
|
||||
### In which platform does it happen?
|
||||
<!--- Describe the platform where the issue is happening (use a list if needed) -->
|
||||
<!--- For example: -->
|
||||
<!--- * Azure Ubuntu Data Science Virtual Machine. -->
|
||||
<!--- * Other platforms. -->
|
||||
|
||||
|
||||
### How do we replicate the issue?
|
||||
<!--- Please be specific as possible (use a list if needed). -->
|
||||
<!--- For example: -->
|
||||
<!--- * Create a conda environment for gpu -->
|
||||
<!--- * Run unit test `test_timer.py` -->
|
||||
<!--- * ... -->
|
||||
|
||||
|
||||
### Expected behavior (i.e. solution)
|
||||
<!--- For example: -->
|
||||
<!--- * The tests for the timer should pass successfully. -->
|
||||
|
||||
|
||||
### Other Comments
|
|
@ -0,0 +1,27 @@
|
|||
---
|
||||
name: Bug report
|
||||
about: Create a report to help us improve
|
||||
title: "[BUG] "
|
||||
labels: 'bug'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
### Description
|
||||
<!--- Describe your bug in detail -->
|
||||
|
||||
|
||||
### How do we replicate the bug?
|
||||
<!--- Please be specific as possible (use a list if needed). -->
|
||||
<!--- For example: -->
|
||||
<!--- * Create a conda environment for gpu -->
|
||||
<!--- * Run unit test `test_timer.py` -->
|
||||
<!--- * ... -->
|
||||
|
||||
|
||||
### Expected behavior (i.e. solution)
|
||||
<!--- For example: -->
|
||||
<!--- * The tests for the timer should pass successfully. -->
|
||||
|
||||
|
||||
### Other Comments
|
|
@ -0,0 +1,19 @@
|
|||
---
|
||||
name: Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: "[FEATURE] "
|
||||
labels: 'enhancement'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
### Description
|
||||
<!--- Describe your expected feature in detail -->
|
||||
|
||||
|
||||
### Expected behavior with the suggested feature
|
||||
<!--- For example: -->
|
||||
<!--- *Adding algorithm xxx will help people understand more about xxx use case scenarios. -->
|
||||
|
||||
|
||||
### Other Comments
|
|
@ -0,0 +1,14 @@
|
|||
---
|
||||
name: General ask
|
||||
about: Technical/non-technical asks about the repo
|
||||
title: "[ASK] "
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
### Description
|
||||
<!--- Describe your general ask in detail -->
|
||||
|
||||
|
||||
### Other Comments
|
|
@ -0,0 +1,15 @@
|
|||
### Description
|
||||
<!--- Describe your changes in detail -->
|
||||
<!--- Why is this change required? What problem does it solve? -->
|
||||
|
||||
|
||||
### Related Issues
|
||||
<!--- If it fixes an open issue, please link to the issue here. -->
|
||||
|
||||
|
||||
### Checklist:
|
||||
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
|
||||
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
|
||||
- [ ] My code follows the code style of this project, as detailed in our [contribution guidelines](../CONTRIBUTING.md).
|
||||
- [ ] I have added tests.
|
||||
- [ ] I have updated the documentation accordingly.
|
|
@ -1,5 +1,28 @@
|
|||
**/__pycache__
|
||||
**/.ipynb_checkpoints
|
||||
|
||||
data/*
|
||||
energy_load/GEFCom2017-D_Prob_MT_hourly/data/*
|
||||
**/__pycache__
|
||||
**/.ipynb_checkpoints
|
||||
*.egg-info/
|
||||
.vscode/
|
||||
*.pkl
|
||||
*.h5
|
||||
|
||||
# Data
|
||||
ojdata/*
|
||||
*.Rdata
|
||||
|
||||
# AML Config
|
||||
aml_config/
|
||||
.azureml/
|
||||
.config/
|
||||
|
||||
# Pytests
|
||||
.pytest_cache/
|
||||
|
||||
# File for model deployment
|
||||
score.py
|
||||
|
||||
# Environments
|
||||
myenv.yml
|
||||
|
||||
# Logs
|
||||
logs/
|
||||
*.log
|
||||
|
|
|
@ -0,0 +1,18 @@
|
|||
linters: with_defaults(
|
||||
infix_spaces_linter = NULL,
|
||||
spaces_left_parentheses_linter = NULL,
|
||||
open_curly_linter = NULL,
|
||||
line_length_linter = NULL,
|
||||
camel_case_linter = NULL,
|
||||
object_name_linter = NULL,
|
||||
object_usage_linter = NULL,
|
||||
object_length_linter = NULL,
|
||||
trailing_blank_lines_linter = NULL,
|
||||
absolute_paths_linter = NULL,
|
||||
commented_code_linter = NULL,
|
||||
implicit_integer_linter = NULL,
|
||||
extraction_operator_linter = NULL,
|
||||
single_quotes_linter = NULL,
|
||||
pipe_continuation_linter = NULL,
|
||||
cyclocomp_linter = NULL
|
||||
)
|
|
@ -0,0 +1,17 @@
|
|||
repos:
|
||||
- repo: https://github.com/psf/black
|
||||
rev: stable
|
||||
hooks:
|
||||
- id: black
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v1.2.3
|
||||
hooks:
|
||||
- id: flake8
|
||||
- repo: local
|
||||
hooks:
|
||||
- id: jupytext
|
||||
name: jupytext
|
||||
entry: jupytext --from ipynb --pipe black --check flake8
|
||||
pass_filenames: true
|
||||
files: .ipynb
|
||||
language: python
|
|
@ -0,0 +1,139 @@
|
|||
# Contribution Guidelines
|
||||
|
||||
Contribution are welcome! Here's a few things to know:
|
||||
|
||||
* [Setup](./SETUP.md)
|
||||
* [Microsoft Contributor License Agreement](#microsoft-contributor-license-agreement)
|
||||
* [Steps to Contributing](#steps-to-contributing)
|
||||
* [Coding Guidelines](#forecasting-team-contribution-guidelines)
|
||||
* [Code of Conduct](#code-of-conduct)
|
||||
|
||||
|
||||
## Setup
|
||||
To get started, navigate to the [Setup Guide](./SETUP.md), which lists instructions on how to set up your environment and dependencies.
|
||||
|
||||
## Microsoft Contributor License Agreement
|
||||
|
||||
Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
|
||||
|
||||
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
|
||||
|
||||
|
||||
## Steps to Contributing
|
||||
|
||||
Here are the basic steps to get started with your first contribution. Please reach out with any questions.
|
||||
1. Use [open issues](https://github.com/Microsoft/Forecasting/issues) to discuss the proposed changes. Create an issue describing changes if necessary to collect feedback. Also, please use provided labels to tag issues so everyone can easily sort issues of interest.
|
||||
2. [Fork the repo](https://help.github.com/articles/fork-a-repo/) so you can make and test local changes.
|
||||
3. Create a new branch for the issue. We suggest prefixing the branch with your username and then a descriptive title, e.g. chenhui/python_test_pipeline.
|
||||
5. Make code changes.
|
||||
6. Ensure unit tests pass and code style / formatting is consistent (see [wiki](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines#python-and-docstrings-style) for more details).
|
||||
7. We use [pre-commit](https://pre-commit.com/) package to run our pre-commit hooks. We use [black](https://github.com/ambv/black) formatter and [flake8](https://pypi.org/project/flake8/) for linting on each commit. In order to set up pre-commit on your machine, follow the steps here, please note that you only need to run these steps the first time you use `pre-commit` for this project.
|
||||
|
||||
* Update your conda environment, `pre-commit` is part of the yaml file or just do
|
||||
```
|
||||
$ pip install pre-commit
|
||||
```
|
||||
* Set up `pre-commit` by running following command, this will put pre-commit under your .git/hooks directory.
|
||||
```
|
||||
$ pre-commit install
|
||||
```
|
||||
> Note: Git hooks to install are specified in the pre-commit configuration file `.pre-commit-config.yaml`. Settings used by `black` and `flake8` are specified in `pyproject.toml` and `.flake8` files, respectively.
|
||||
* When you've made changes on local files and are ready to commit, run
|
||||
```
|
||||
$ git commit -m "message"
|
||||
```
|
||||
* Each time you commit, git will run the pre-commit hooks on any python files that are getting committed and are part of the git index. If `black` modifies/formats the file, or if `flake8` finds any linting errors, the commit will not succeed. You will need to stage the file again if `black` changed the file, or fix the issues identified by `flake8` and and stage it again.
|
||||
|
||||
* To run pre-commit on all files just run
|
||||
```
|
||||
$ pre-commit run --all-files
|
||||
```
|
||||
|
||||
|
||||
8. Create a pull request (PR) against __`staging`__ branch.
|
||||
|
||||
|
||||
We use `staging` branch to land all new features, so please remember to create the Pull Request against `staging`. To work with GitHub, please see the next section for more detail about our [working with GitHub](#working-with-github).
|
||||
|
||||
Once the features included in a milestone are complete we will merge `staging` into `master` branch and make a release. See the wiki for more detail about our [merge strategy](https://github.com/Microsoft/Forecasting/wiki/Strategy-to-merge-the-code-to-master-branch).
|
||||
|
||||
### Working with GitHub
|
||||
|
||||
1. All development is done in a branch off from the `staging` and named following this convention: `<user>/<topic>`.
|
||||
To create a new branch, run this command:
|
||||
```shell
|
||||
$ git checkout -b <user>/<topic>
|
||||
```
|
||||
|
||||
When done making the changes locally, push your branch to the server, but make sure to sync with the remote first.
|
||||
|
||||
```
|
||||
$ git pull origin staging
|
||||
$ git push origin <your branch>
|
||||
```
|
||||
|
||||
2. To merge a new branch into the `staging` branch, please open a pull request.
|
||||
|
||||
3. The person who opens a PR should complete the PR, once it has been reviewed and all comments addressed.
|
||||
|
||||
4. We will use *Squash and Merge* when completing PRs, to maintain a clean merge history on the repo.
|
||||
|
||||
5. When a branch is merged into the `staging`, it must be deleted from the remote repository.
|
||||
|
||||
```shell
|
||||
# Delete local branch
|
||||
$ git branch -d <your branch>
|
||||
|
||||
# Delete remote branch
|
||||
$ git push origin --delete <your branch>
|
||||
```
|
||||
|
||||
|
||||
## Coding Guidelines
|
||||
|
||||
We strive to maintain high quality code to make it easy to understand, use, and extend. We also work hard to maintain a friendly and constructive environment. We've found that having clear expectations on the development process and consistent style helps to ensure everyone can contribute and collaborate effectively.
|
||||
|
||||
Please review the [coding guidelines](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines) wiki page to see more details about the expectations for development approach and style.
|
||||
|
||||
|
||||
## Code of Conduct
|
||||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
|
||||
Apart from the official Code of Conduct developed by Microsoft, in the Forecasting team we adopt the following behaviors, to ensure a great working environment:
|
||||
|
||||
#### Do not point fingers
|
||||
Let’s be constructive.
|
||||
|
||||
<details>
|
||||
<summary><em>Click here to see some examples</em></summary>
|
||||
|
||||
"This method is missing docstrings" instead of "YOU forgot to put docstrings".
|
||||
|
||||
</details>
|
||||
|
||||
#### Provide code feedback based on evidence
|
||||
|
||||
When making code reviews, try to support your ideas based on evidence (papers, library documentation, stackoverflow, etc) rather than your personal preferences.
|
||||
|
||||
<details>
|
||||
<summary><em>Click here to see some examples</em></summary>
|
||||
|
||||
"When reviewing this code, I saw that the Python implementation the metrics are based on classes, however, [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) and [tensorflow](https://www.tensorflow.org/api_docs/python/tf/metrics) use functions. We should follow the standard in the industry."
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
#### Ask questions do not give answers
|
||||
Try to be empathic.
|
||||
|
||||
<details>
|
||||
<summary><em>Click here to see some examples</em></summary>
|
||||
|
||||
* Would it make more sense if ...?
|
||||
* Have you considered this ... ?
|
||||
|
||||
</details>
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE
|
|
@ -0,0 +1,17 @@
|
|||
NOTICES AND INFORMATION
|
||||
Do Not Translate or Localize
|
||||
|
||||
This software incorporates material from third parties.
|
||||
Microsoft makes certain open source code available at https://3rdpartysource.microsoft.com,
|
||||
or you may send a check or money order for US $5.00, including the product name,
|
||||
the open source component name, platform, and version number, to:
|
||||
|
||||
Source Code Compliance Team
|
||||
Microsoft Corporation
|
||||
One Microsoft Way
|
||||
Redmond, WA 98052
|
||||
USA
|
||||
|
||||
Notwithstanding any other terms, you may reverse engineer this software to the extent
|
||||
required to debug changes to any libraries licensed under the GNU Lesser General Public License.
|
||||
|
153
README.md
153
README.md
|
@ -1,69 +1,98 @@
|
|||
# TSPerf
|
||||
# Forecasting Best Practices
|
||||
|
||||
TSPerf is a repository of time-series forecasting models with a comprehensive comparison of their performance over provided benchmark data sets, implemented on Azure. Model implementations are compared by forecasting accuracy, training and scoring time and cost on Azure compute. Each implementation includes all the necessary instructions and tools that ensure its reproducibility. We envision TSPerf to become a central repository of time-series forecasting that provides wide coverage of time-series algorithms, from the very simple to the state of the art in the industry. The roadmap of TSPerf can be found [here](docs/roadmap.md).
|
||||
Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.
|
||||
|
||||
This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.
|
||||
|
||||
The examples and best practices are provided as [Python Jupyter notebooks and R markdown files](examples) and [a library of utility functions](fclib). We hope that these examples and utilities can significantly reduce the “time to market” by simplifying the experience from defining the business problem to the development of solutions by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.
|
||||
|
||||
|
||||
The following table summarizes benchmarks that are currently included in TSPerf.
|
||||
|
||||
Benchmark | Dataset | Benchmark directory
|
||||
--------------------------------------------|------------------------|---------------------------------------------
|
||||
Probabilistic electricity load forecasting | GEFCom2017 | `energy_load/GEFCom2017-D_Prob_MT_Hourly`
|
||||
Retail sales forecasting | Orange Juice dataset | `retail_sales/OrangeJuice_Pt_3Weeks_Weekly`
|
||||
|
||||
|
||||
|
||||
|
||||
A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.
|
||||
|
||||
## Probabilistic energy forecasting performance board
|
||||
|
||||
|
||||
The following table lists the current submision for the energy forecasting and their respective performances.
|
||||
|
||||
|
||||
Submission Name | Pinball Loss | Training and Scoring Time (sec) | Training and Scoring Cost($) | Architecture | Framework | Algorithm | Uni/Multivariate | External Feature Support
|
||||
--------------------------------------------------------------------------------|----------------|-----------------------------------|--------------------------------|-----------------------------------------------|------------------------------------|---------------------------------------|--------------------|--------------------------
|
||||
[Baseline](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Fbaseline) | 84.11 | 444 | 0.0474 | Linux DSVM (Standard D8s v3 - Premium SSD) | quantreg package of R | Linear Quantile Regression | Multivariate | Yes
|
||||
[GBM](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2FGBM) | 78.71 | 888 | 0.0947 | Linux DSVM (Standard D8s v3 - Premium SSD) | gbm package of R | Gradient Boosting Decision Tree | Multivariate | Yes
|
||||
[QRF](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Fqrf) | 76.48 | 22709 | 19.03 | Linux DSVM (F72s v2 - Premium SSD) | scikit-garden package of Python | Quantile Regression Forest | Multivariate | Yes
|
||||
[FNN](energy_load%2FGEFCom2017_D_Prob_MT_hourly%2Fsubmissions%2Ffnn) | 79.27 | 4604 | 0.4911 | Linux DSVM (Standard D8s v3 - Premium SSD) | qrnn package of R | Quantile Regression Neural Network | Multivariate | Yes
|
||||
|
||||
|
||||
The following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:
|
||||
|
||||
|
||||
![EnergyPBLvsTime](./docs/images/Energy-Cost.png)
|
||||
|
||||
|
||||
|
||||
|
||||
## Retail sales forecasting performance board
|
||||
|
||||
|
||||
The following table lists the current submision for the retail forecasting and their respective performances.
|
||||
|
||||
|
||||
Submission Name | MAPE (%) | Training and Scoring Time (sec) | Training and Scoring Cost ($) | Architecture | Framework | Algorithm | Uni/Multivariate | External Feature Support
|
||||
--------------------------------------------------------------------------------------------|------------|-----------------------------------|---------------------------------|----------------------------------------------|------------------------------|---------------------------------------------------------------------|--------------------|--------------------------
|
||||
[Baseline](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2Fbaseline) | 109.67 | 114.06 | 0.003 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Naive Forecast | Univariate | No
|
||||
[AutoARIMA](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FARIMA) | 70.80 | 265.94 | 0.0071 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Auto ARIMA | Multivariate | Yes
|
||||
[ETS](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FETS) | 70.99 | 277 | 0.01 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | ETS | Multivariate | No
|
||||
[MeanForecast](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FMeanForecast) | 70.74 | 69.88 | 0.002 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Mean forecast | Univariate | No
|
||||
[SeasonalNaive](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FSeasonalNaive) | 165.06 | 160.45 | 0.004 | Linux DSVM(Standard D2s v3 - Premium SSD) | forecast package of R | Seasonal Naive | Univariate | No
|
||||
[LightGBM](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FLightGBM) | 36.28 | 625.10 | 0.0167 | Linux DSVM (Standard D2s v3 - Premium SSD) | lightGBM package of Python | Gradient Boosting Decision Tree | Multivariate | Yes
|
||||
[DilatedCNN](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FDilatedCNN) | 37.09 | 413 | 0.1032 | Ubuntu VM(NC6 - Standard HDD) | Keras and Tensorflow | Python + Dilated convolutional neural network | Multivariate | Yes
|
||||
[RNN Encoder-Decoder](retail_sales%2FOrangeJuice_Pt_3Weeks_Weekly%2Fsubmissions%2FRNN) | 37.68 | 669 | 0.2 | Ubuntu VM(NC6 - Standard HDD) | Tensorflow | Python + Encoder-decoder architecture of recurrent neural network | Multivariate | Yes
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
The following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:
|
||||
|
||||
|
||||
![EnergyPBLvsTime](./docs/images/Retail-Cost.png)
|
||||
## Content
|
||||
|
||||
The following is a summary of models and methods for developing forecasting solutions covered in this repository. The [examples](examples) are organized according to use cases. Currently, we focus on a retail sales forecasting use case as it is widely used in [assortment planning](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1569&context=edissertations), [inventory optimization](https://en.wikipedia.org/wiki/Inventory_optimization), and [price optimization](https://en.wikipedia.org/wiki/Price_optimization). To enable high-throughput forecasting scenarios, we have included examples for forecasting multiple time series with distributed training techniques such as Ray in Python, parallel package in R, and multi-threading in LightGBM.
|
||||
|
||||
| Model | Language | Description |
|
||||
|---------------------------------------------------------------------------------------------------|----------|-------------------------------------------------------------------------------------------------------------|
|
||||
| [Auto ARIMA](examples/grocery_sales/python/00_quick_start/autoarima_single_round.ipynb) | Python | Auto Regressive Integrated Moving Average (ARIMA) model that is automatically selected |
|
||||
| [Linear Regression](examples/grocery_sales/python/00_quick_start/azure_automl_single_round.ipynb) | Python | Linear regression model trained on lagged features of the target variable and external features |
|
||||
| [LightGBM](examples/grocery_sales/python/00_quick_start/lightgbm_single_round.ipynb) | Python | Gradient boosting decision tree implemented with LightGBM package for high accuracy and fast speed |
|
||||
| [DilatedCNN](examples/grocery_sales/python/02_model/dilatedcnn_multi_round.ipynb) | Python | Dilated Convolutional Neural Network that captures long-range temporal flow with dilated causal connections |
|
||||
| [Mean Forecast](examples/grocery_sales/R/02_basic_models.Rmd) | R | Simple forecasting method based on historical mean |
|
||||
| [ARIMA](examples/grocery_sales/R/02a_reg_models.Rmd) | R | ARIMA model without or with external features |
|
||||
| [ETS](examples/grocery_sales/R/02_basic_models.Rmd) | R | Exponential Smoothing algorithm with additive errors |
|
||||
| [Prophet](examples/grocery_sales/R/02b_prophet_models.Rmd) | R | Automated forecasting procedure based on an additive model with non-linear trends |
|
||||
|
||||
The repository also comes with AzureML-themed notebooks and best practices recipes to accelerate the development of scalable, production-grade forecasting solutions on Azure. In particular, we have the following examples for forecasting with Azure AutoML as well as tuning and deploying a forecasting model on Azure.
|
||||
|
||||
| Method | Language | Description |
|
||||
|-----------------------------------------------------------------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------|
|
||||
| [Azure AutoML](examples/grocery_sales/python/00_quick_start/azure_automl_single_round.ipynb) | Python | AzureML service that automates model development process and identifies the best machine learning pipeline |
|
||||
| [HyperDrive](examples/grocery_sales/python/03_model_tune_deploy/azure_hyperdrive_lightgbm.ipynb) | Python | AzureML service for tuning hyperparameters of machine learning models in parallel on cloud |
|
||||
| [AzureML Web Service](examples/grocery_sales/python/03_model_tune_deploy/azure_hyperdrive_lightgbm.ipynb) | Python | AzureML service for deploying a model as a web service on Azure Container Instances |
|
||||
|
||||
|
||||
## Getting Started in Python
|
||||
|
||||
To quickly get started with the repository on your local machine, use the following commands.
|
||||
|
||||
1. Install Anaconda with Python >= 3.6. [Miniconda](https://conda.io/miniconda.html) is a quick way to get started.
|
||||
|
||||
2. Clone the repository
|
||||
```
|
||||
git clone https://github.com/microsoft/forecasting
|
||||
cd forecasting/
|
||||
```
|
||||
|
||||
3. Run setup scripts to create conda environment. Please execute one of the following commands from the root of Forecasting repo based on your operating system.
|
||||
|
||||
- Linux
|
||||
```
|
||||
./tools/environment_setup.sh
|
||||
```
|
||||
|
||||
- Windows
|
||||
```
|
||||
tools\environment_setup.bat
|
||||
```
|
||||
|
||||
Note that for Windows you need to run the batch script from Anaconda Prompt. The script creates a conda environment `forecasting_env` and installs the forecasting utility library `fclib`.
|
||||
|
||||
4. Start the Jupyter notebook server
|
||||
```
|
||||
jupyter notebook
|
||||
```
|
||||
|
||||
5. Run the [LightGBM single-round](examples/oj_retail/python/00_quick_start/lightgbm_single_round.ipynb) notebook under the `00_quick_start` folder. Make sure that the selected Jupyter kernel is `forecasting_env`.
|
||||
|
||||
If you have any issues with the above setup, or want to find more detailed instructions on how to set up your environment and run examples provided in the repository, on local or a remote machine, please navigate to the [Setup Guide](./docs/SETUP.md).
|
||||
|
||||
## Getting Started in R
|
||||
|
||||
We assume you already have R installed on your machine. If not, simply follow the [instructions on CRAN](https://cloud.r-project.org/) to download and install R.
|
||||
|
||||
The recommended editor is [RStudio](https://rstudio.com), which supports interactive editing and previewing of R notebooks. However, you can use any editor or IDE that supports RMarkdown. In particular, [Visual Studio Code](https://code.visualstudio.com) with the [R extension](https://marketplace.visualstudio.com/items?itemName=Ikuyadeu.r) can be used to edit and render the notebook files. The rendered `.nb.html` files can be viewed in any modern web browser.
|
||||
|
||||
The examples use the [Tidyverts](https://tidyverts.org) family of packages, which is a modern framework for time series analysis that builds on the widely-used [Tidyverse](https://tidyverse.org) family. The Tidyverts framework is still under active development, so it's recommended that you update your packages regularly to get the latest bug fixes and features.
|
||||
|
||||
## Target Audience
|
||||
Our target audience for this repository includes data scientists and machine learning engineers with varying levels of knowledge in forecasting as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world forecasting problems.
|
||||
|
||||
## Contributing
|
||||
We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our [Contributing Guide](CONTRIBUTING.md).
|
||||
|
||||
## Reference
|
||||
|
||||
The following is a list of related repositories that you may find helpful.
|
||||
|
||||
| | |
|
||||
|------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||
| [Deep Learning for Time Series Forecasting](https://github.com/Azure/DeepLearningForTimeSeriesForecasting) | A collection of examples for using deep neural networks for time series forecasting with Keras. |
|
||||
| [Microsoft AI Github](https://github.com/microsoft/ai) | Find other Best Practice projects, and Azure AI designed patterns in our central repository. |
|
||||
|
||||
|
||||
|
||||
## Build Status
|
||||
| Build | Branch | Status |
|
||||
|---------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **Linux CPU** | master | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=master)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=master) |
|
||||
| **Linux CPU** | staging | [![Build Status](https://dev.azure.com/best-practices/forecasting/_apis/build/status/cpu_unit_tests_linux?branchName=staging)](https://dev.azure.com/best-practices/forecasting/_build/latest?definitionId=128&branchName=staging) |
|
||||
|
|
|
@ -0,0 +1,36 @@
|
|||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
#' Creates a local background cluster for parallel computations
|
||||
#'
|
||||
#' @param ncores The number of nodes (cores) for the cluster. The default is 2 less than the number of physical cores.
|
||||
#' @param libs The packages to load on each node, as a character vector.
|
||||
#' @param useXDR For most platforms, this can be left at its default `FALSE` value.
|
||||
#' @return
|
||||
#' A cluster object.
|
||||
make_cluster <- function(ncores=NULL, libs=character(0), useXDR=FALSE)
|
||||
{
|
||||
if(is.null(ncores))
|
||||
ncores <- max(2, parallel::detectCores(logical=FALSE) - 2)
|
||||
cl <- parallel::makeCluster(ncores, type="PSOCK", useXDR=useXDR)
|
||||
res <- try(parallel::clusterCall(
|
||||
cl,
|
||||
function(libs)
|
||||
{
|
||||
for(lib in libs) library(lib, character.only=TRUE)
|
||||
},
|
||||
libs
|
||||
), silent=TRUE)
|
||||
if(inherits(res, "try-error"))
|
||||
parallel::stopCluster(cl)
|
||||
else cl
|
||||
}
|
||||
|
||||
|
||||
#' Deletes a local background cluster
|
||||
#'
|
||||
#' @param cl The cluster object, as returned from `make_cluster`.
|
||||
destroy_cluster <- function(cl)
|
||||
{
|
||||
try(parallel::stopCluster(cl), silent=TRUE)
|
||||
}
|
|
@ -0,0 +1,50 @@
|
|||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
#' Computes forecast values on a dataset
|
||||
#'
|
||||
#' @param mable A mable (model table) as returned by `fabletools::model`.
|
||||
#' @param newdata The dataset for which to compute forecasts.
|
||||
#' @param ... Further arguments to `fabletools::forecast`.
|
||||
#' @return
|
||||
#' A tsibble, with one column per model type in `mable`, and one column named `.response` containing the response variable from `newdata`.
|
||||
get_forecasts <- function(mable, newdata, ...)
|
||||
{
|
||||
fcast <- forecast(mable, new_data=newdata, ...)
|
||||
keyvars <- key_vars(fcast)
|
||||
keyvars <- keyvars[-length(keyvars)]
|
||||
indexvar <- index_var(fcast)
|
||||
fcastvar <- as.character(attr(fcast, "response")[[1]])
|
||||
fcast <- fcast %>%
|
||||
as_tibble() %>%
|
||||
pivot_wider(
|
||||
id_cols=all_of(c(keyvars, indexvar)),
|
||||
names_from=.model,
|
||||
values_from=all_of(fcastvar))
|
||||
select(newdata, !!keyvars, !!indexvar, !!fcastvar) %>%
|
||||
rename(.response=!!fcastvar) %>%
|
||||
inner_join(fcast)
|
||||
}
|
||||
|
||||
|
||||
#' Evaluate quality of forecasts given a criterion
|
||||
#'
|
||||
#' @param fcast_df A tsibble as returned from `get_forecasts`.
|
||||
#' @param gof A goodness-of-fit function. The default is to use `fabletools::MAPE`, which computes the mean absolute percentage error.
|
||||
#' @return
|
||||
#' A single-row data frame with the computed goodness-of-fit statistic for each model.
|
||||
eval_forecasts <- function(fcast_df, gof=fabletools::MAPE)
|
||||
{
|
||||
if(!is.function(gof))
|
||||
gof <- get(gof, mode="function")
|
||||
resp <- fcast_df$.response
|
||||
keyvars <- key_vars(fcast_df)
|
||||
indexvar <- index_var(fcast_df)
|
||||
fcast_df %>%
|
||||
as_tibble() %>%
|
||||
select(-all_of(c(keyvars, indexvar, ".response"))) %>%
|
||||
summarise_all(
|
||||
function(x, .actual) gof(x - .actual, .actual=.actual),
|
||||
.actual=resp
|
||||
)
|
||||
}
|
|
@ -0,0 +1,25 @@
|
|||
# Copyright (c) Microsoft Corporation.
|
||||
# Licensed under the MIT License.
|
||||
|
||||
#' Loads serialised objects relating to a given forecasting example into the current workspace
|
||||
#'
|
||||
#' @param example The particular forecasting example.
|
||||
#' @param file The name of the file (with extension).
|
||||
#' @return
|
||||
#' This function is run for its side effect, namely loading the given file into the global environment.
|
||||
load_objects <- function(example, file)
|
||||
{
|
||||
examp_dir <- here::here("examples", example, "R")
|
||||
load(file.path(examp_dir, file), envir=globalenv())
|
||||
}
|
||||
|
||||
#' Saves R objects for a forecasting example to a file
|
||||
#'
|
||||
#' @param ... Objects to save, as unquoted names.
|
||||
#' @param example The particular forecasting example.
|
||||
#' @param file The name of the file (with extension).
|
||||
save_objects <- function(..., example, file)
|
||||
{
|
||||
examp_dir <- here::here("examples", example, "R")
|
||||
save(..., file=file.path(examp_dir, file))
|
||||
}
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 125 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 52 KiB |
|
@ -0,0 +1 @@
|
|||
[Our Code of Conduct](https://opensource.microsoft.com/codeofconduct/faq/)
|
|
@ -1,112 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
import csvtomd
|
||||
import matplotlib.pyplot as plt
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
|
||||
### Generating performance charts
|
||||
#################################################
|
||||
|
||||
#Function to plot a performance chart
|
||||
def plot_perf(x,y,df):
|
||||
|
||||
# extract submission name from submission URL
|
||||
labels = df.apply(lambda x: x['Submission Name'][1:].split(']')[0], axis=1)
|
||||
|
||||
fig = plt.scatter(x=df[x],y=df[y], label=labels, s=150, alpha = 0.5,
|
||||
c= ['b', 'g', 'r', 'c', 'm', 'y', 'k'])
|
||||
plt.xlabel(x)
|
||||
plt.ylabel(y)
|
||||
plt.title(y + ' by ' + x)
|
||||
offset = (max(df[y]) - min(df[y]))/50
|
||||
for i,name in enumerate(labels):
|
||||
ax = df[x][i]
|
||||
ay = df[y][i] + offset * (-2.5 + i % 5)
|
||||
plt.text(ax, ay, name, fontsize=10)
|
||||
|
||||
return(fig)
|
||||
|
||||
### Printing the Readme.md file
|
||||
############################################
|
||||
readmefile = '../Readme.md'
|
||||
#Wrtie header
|
||||
#print(file=open(readmefile))
|
||||
print('# TSPerf\n', file=open(readmefile, "w"))
|
||||
|
||||
print('TSPerf is a collection of implementations of time-series forecasting algorithms in Azure cloud and comparison of their performance over benchmark datasets. \
|
||||
Algorithm implementations are compared by model accuracy, training and scoring time and cost. Each implementation includes all the necessary \
|
||||
instructions and tools that ensure its reproducibility.', file=open(readmefile, "a"))
|
||||
|
||||
print('The following table summarizes benchmarks that are currently included in TSPerf.\n', file=open(readmefile, "a"))
|
||||
|
||||
#Read the benchmark table the CSV file and converrt to a table in md format
|
||||
with open('Benchmarks.csv', 'r') as f:
|
||||
table = csvtomd.csv_to_table(f, ',')
|
||||
print(csvtomd.md_table(table), file=open(readmefile, "a"))
|
||||
print('\n\n\n',file=open(readmefile, "a"))
|
||||
|
||||
print('A complete documentation of TSPerf, along with the instructions for submitting and reviewing implementations, \
|
||||
can be found [here](./docs/tsperf_rules.md). The tables below show performance of implementations that are developed so far. Source code of \
|
||||
implementations and instructions for reproducing their performance can be found in submission folders, which are linked in the first column.\n', file=open(readmefile, "a"))
|
||||
|
||||
### Write the Energy section
|
||||
#============================
|
||||
|
||||
print('## Probabilistic energy forecasting performance board\n\n', file=open(readmefile, "a"))
|
||||
print('The following table lists the current submision for the energy forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
|
||||
|
||||
#Read the energy perfromane board from the CSV file and converrt to a table in md format
|
||||
with open('TSPerfBoard-Energy.csv', 'r') as f:
|
||||
table = csvtomd.csv_to_table(f, ',')
|
||||
print(csvtomd.md_table(table), file=open(readmefile, "a"))
|
||||
|
||||
#Read Energy Performance Board CSV file
|
||||
df = pd.read_csv('TSPerfBoard-Energy.csv', engine='python')
|
||||
#df
|
||||
|
||||
#Plot ,'Pinball Loss' by 'Training and Scoring Cost($)' chart
|
||||
fig4 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
|
||||
fig4 = plot_perf('Training and Scoring Cost($)','Pinball Loss',df)
|
||||
plt.savefig('../docs/images/Energy-Cost.png')
|
||||
|
||||
|
||||
#insetting the performance charts
|
||||
print('\n\nThe following chart compares the submissions performance on accuracy in Pinball Loss vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
|
||||
print('![EnergyPBLvsTime](./docs/images/Energy-Cost.png)' ,file=open(readmefile, "a"))
|
||||
print('\n\n\n',file=open(readmefile, "a"))
|
||||
|
||||
|
||||
#print the retail sales forcsating section
|
||||
#========================================
|
||||
print('## Retail sales forecasting performance board\n\n', file=open(readmefile, "a"))
|
||||
print('The following table lists the current submision for the retail forecasting and their respective performances.\n\n', file=open(readmefile, "a"))
|
||||
|
||||
#Read the energy perfromane board from the CSV file and converrt to a table in md format
|
||||
with open('TSPerfBoard-Retail.csv', 'r') as f:
|
||||
table = csvtomd.csv_to_table(f, ',')
|
||||
print(csvtomd.md_table(table), file=open(readmefile, "a"))
|
||||
print('\n\n\n',file=open(readmefile, "a"))
|
||||
|
||||
#Read Retail Performane Board CSV file
|
||||
df = pd.read_csv('TSPerfBoard-Retail.csv', engine='python')
|
||||
#df
|
||||
|
||||
#Plot MAPE (%) by Training and Scoring Cost ($) chart
|
||||
fig2 = plt.figure(figsize=(12, 8), dpi= 80, facecolor='w', edgecolor='k') #this sets the plotting area size
|
||||
fig2 = plot_perf('Training and Scoring Cost ($)','MAPE (%)',df)
|
||||
plt.savefig('../docs/images/Retail-Cost.png')
|
||||
|
||||
|
||||
#insetting the performance charts
|
||||
print('\n\nThe following chart compares the submissions performance on accuracy in %MAPE vs. Training and Scoring cost in $:\n\n ', file=open(readmefile, "a"))
|
||||
print('![EnergyPBLvsTime](./docs/images/Retail-Cost.png)' ,file=open(readmefile, "a"))
|
||||
print('\n\n\n',file=open(readmefile, "a"))
|
||||
|
||||
|
||||
|
||||
print('A new Readme.md file has been generated successfuly.')
|
||||
|
||||
|
|
@ -1,17 +0,0 @@
|
|||
name: tsperf
|
||||
channels:
|
||||
- defaults
|
||||
- r
|
||||
- conda-forge
|
||||
dependencies:
|
||||
- python=3.6
|
||||
- numpy=1.15.0
|
||||
- pandas=0.23.4
|
||||
- xlrd=1.1.0
|
||||
- urllib3=1.21.1
|
||||
- jupyter=1.0.0
|
||||
- r-essentials=3.5.1
|
||||
- matplotlib=2.2.3
|
||||
- pip:
|
||||
- csvtomd==0.3.0
|
||||
|
|
@ -1 +0,0 @@
|
|||
5/cVuditI8OEN7ADztEWg6k+91MTQVbt
|
118
common/utils.py
118
common/utils.py
|
@ -1,118 +0,0 @@
|
|||
import datetime
|
||||
import pandas as pd
|
||||
from dateutil.relativedelta import relativedelta
|
||||
|
||||
ALLOWED_TIME_COLUMN_TYPES = [pd.Timestamp, pd.DatetimeIndex,
|
||||
datetime.datetime, datetime.date]
|
||||
|
||||
|
||||
def is_datetime_like(x):
|
||||
"""Function that checks if a data frame column x is of a datetime type."""
|
||||
return any(isinstance(x, col_type)
|
||||
for col_type in ALLOWED_TIME_COLUMN_TYPES)
|
||||
|
||||
|
||||
def get_datetime_col(df, datetime_colname):
|
||||
"""
|
||||
Helper function for extracting the datetime column as datetime type from
|
||||
a data frame.
|
||||
|
||||
Args:
|
||||
df: pandas DataFrame containing the column to convert
|
||||
datetime_colname: name of the column to be converted
|
||||
|
||||
Returns:
|
||||
pandas.Series: converted column
|
||||
|
||||
Raises:
|
||||
Exception: if datetime_colname does not exist in the dateframe df.
|
||||
Exception: if datetime_colname cannot be converted to datetime type.
|
||||
"""
|
||||
if datetime_colname in df.index.names:
|
||||
datetime_col = df.index.get_level_values(datetime_colname)
|
||||
elif datetime_colname in df.columns:
|
||||
datetime_col = df[datetime_colname]
|
||||
else:
|
||||
raise Exception('Column or index {0} does not exist in the data '
|
||||
'frame'.format(datetime_colname))
|
||||
|
||||
if not is_datetime_like(datetime_col):
|
||||
try:
|
||||
datetime_col = pd.to_datetime(df[datetime_colname])
|
||||
except:
|
||||
raise Exception('Column or index {0} can not be converted to '
|
||||
'datetime type.'.format(datetime_colname))
|
||||
return datetime_col
|
||||
|
||||
|
||||
def get_month_day_range(date):
|
||||
"""
|
||||
Returns the first date and last date of the month of the given date.
|
||||
"""
|
||||
# Replace the date in the original timestamp with day 1
|
||||
first_day = date + relativedelta(day=1)
|
||||
# Replace the date in the original timestamp with day 1
|
||||
# Add a month to get to the first day of the next month
|
||||
# Subtract one day to get the last day of the current month
|
||||
last_day = date + relativedelta(day=1, months=1, days=-1, hours=23)
|
||||
return first_day, last_day
|
||||
|
||||
|
||||
def split_train_validation(df, fct_horizon, datetime_colname):
|
||||
"""
|
||||
Splits the input dataframe into train and validate folds based on the forecast
|
||||
creation time (fct) and forecast horizon specified by fct_horizon.
|
||||
|
||||
Args:
|
||||
df: The input data frame to split.
|
||||
fct_horizon: list of tuples in the format of
|
||||
(fct, (forecast_horizon_start, forecast_horizon_end))
|
||||
datetime_colname: name of the datetime column
|
||||
|
||||
Note: df[datetime_colname] needs to be a datetime type.
|
||||
"""
|
||||
i_round = 0
|
||||
for fct, horizon in fct_horizon:
|
||||
i_round += 1
|
||||
train = df.loc[df[datetime_colname] < fct, ].copy()
|
||||
validation = df.loc[(df[datetime_colname] >= horizon[0]) &
|
||||
(df[datetime_colname] <= horizon[1]), ].copy()
|
||||
|
||||
yield i_round, train, validation
|
||||
|
||||
|
||||
def add_datetime(input_datetime, unit, add_count):
|
||||
"""
|
||||
Function to add a specified units of time (years, months, weeks, days,
|
||||
hours, or minutes) to the input datetime.
|
||||
|
||||
Args:
|
||||
input_datetime: datatime to be added to
|
||||
unit: unit of time, valid values: 'year', 'month', 'week',
|
||||
'day', 'hour', 'minute'.
|
||||
add_count: number of units to add
|
||||
|
||||
Returns:
|
||||
New datetime after adding the time difference to input datetime.
|
||||
|
||||
Raises:
|
||||
Exception: if invalid unit is provided. Valid units are:
|
||||
'year', 'month', 'week', 'day', 'hour', 'minute'.
|
||||
"""
|
||||
if unit == 'year':
|
||||
new_datetime = input_datetime + relativedelta(years=add_count)
|
||||
elif unit == 'month':
|
||||
new_datetime = input_datetime + relativedelta(months=add_count)
|
||||
elif unit == 'week':
|
||||
new_datetime = input_datetime + relativedelta(weeks=add_count)
|
||||
elif unit == 'day':
|
||||
new_datetime = input_datetime + relativedelta(days=add_count)
|
||||
elif unit == 'hour':
|
||||
new_datetime = input_datetime + relativedelta(hours=add_count)
|
||||
elif unit == 'minute':
|
||||
new_datetime = input_datetime + relativedelta(minutes=add_count)
|
||||
else:
|
||||
raise Exception('Invalid backtest step unit, {}, provided. Valid '
|
||||
'step units are year, month, week, day, hour, and minute'
|
||||
.format(unit))
|
||||
return new_datetime
|
|
@ -0,0 +1,3 @@
|
|||
# Contrib
|
||||
|
||||
Independent or incubating algorithms and utilities are candidates for the `contrib` folder. This folder will house contributions which may not easily fit into the core repository or need time to refactor the code and add necessary tests.
|
|
@ -1,8 +1,6 @@
|
|||
## Download base image
|
||||
FROM continuumio/anaconda3:4.4.0
|
||||
#ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/conda_dependencies.yml /tmp/conda_dependencies.yml
|
||||
FROM rocker/r-base
|
||||
ADD ./conda_dependencies.yml /tmp
|
||||
#ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/install_R_dependencies.R /tmp/install_R_dependencies.R
|
||||
ADD ./install_R_dependencies.R /tmp
|
||||
WORKDIR /tmp
|
||||
|
||||
|
@ -13,12 +11,9 @@ RUN apt-get install -y --no-install-recommends \
|
|||
zlib1g-dev \
|
||||
libssl-dev \
|
||||
libssh2-1-dev \
|
||||
libcurl4-openssl-dev \
|
||||
libreadline-gplv2-dev \
|
||||
libncursesw5-dev \
|
||||
libsqlite3-dev \
|
||||
tk-dev \
|
||||
libgdbm-dev \
|
||||
libc6-dev \
|
||||
libbz2-dev \
|
||||
libffi-dev \
|
||||
|
@ -26,26 +21,20 @@ RUN apt-get install -y --no-install-recommends \
|
|||
build-essential \
|
||||
checkinstall \
|
||||
ca-certificates \
|
||||
curl \
|
||||
lsb-release \
|
||||
apt-utils \
|
||||
python3-pip \
|
||||
vim
|
||||
vim
|
||||
|
||||
## Create and activate conda environment
|
||||
# Install miniconda
|
||||
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
|
||||
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
|
||||
ENV PATH="/root/miniconda/bin:${PATH}"
|
||||
|
||||
## Create conda environment
|
||||
RUN conda update -y conda
|
||||
RUN conda env create --file conda_dependencies.yml
|
||||
|
||||
## Install R
|
||||
ENV R_BASE_VERSION 3.5.1
|
||||
RUN apt-get install -y aptitude
|
||||
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
|
||||
&& aptitude install -y debian-keyring debian-archive-keyring
|
||||
RUN apt-get remove -y binutils
|
||||
RUN apt-get update \
|
||||
&& apt-get install -t unstable -y --no-install-recommends \
|
||||
r-base=${R_BASE_VERSION}-*
|
||||
|
||||
# Install prerequisites of R packages
|
||||
RUN apt-get install -y \
|
||||
gfortran \
|
||||
|
@ -62,7 +51,7 @@ RUN Rscript install_R_dependencies.R
|
|||
RUN rm install_R_dependencies.R
|
||||
RUN rm conda_dependencies.yml
|
||||
|
||||
RUN mkdir /TSPerf
|
||||
WORKDIR /TSPerf
|
||||
RUN mkdir /Forecasting
|
||||
WORKDIR /Forecasting
|
||||
|
||||
ENTRYPOINT ["/bin/bash"]
|
|
@ -12,7 +12,7 @@
|
|||
|
||||
**Submission name:** GBM
|
||||
|
||||
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM
|
||||
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
|
||||
|
||||
|
||||
## Implementation description
|
||||
|
@ -35,105 +35,105 @@ The data of January - April of 2016 were used as validation dataset for some min
|
|||
|
||||
### Description of implementation scripts
|
||||
|
||||
* `feature_engineering.py`: Python script for computing features and generating feature files.
|
||||
* `compute_features.py`: Python script for computing features and generating feature files.
|
||||
* `train_predict.R`: R script that trains Gradient Boosting Machine model for quantile regression task and predicts on each round of test data.
|
||||
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
|
||||
### Steps to reproduce results
|
||||
|
||||
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
VM.
|
||||
|
||||
1. Clone the Forecasting repository to the home directory of your machine
|
||||
2. Clone the Forecasting repository to the home directory of your machine
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/Microsoft/Forecasting.git
|
||||
```
|
||||
Use one of the following options to securely connect to the Git repo:
|
||||
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
|
||||
For this method, the clone command becomes
|
||||
Use one of the following options to securely connect to the Git repo:
|
||||
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
|
||||
For this method, the clone command becomes
|
||||
```bash
|
||||
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
|
||||
```
|
||||
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
|
||||
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
|
||||
|
||||
|
||||
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
|
||||
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
|
||||
|
||||
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
|
||||
From the `~/Forecasting` directory on the VM create a conda environment named `tsperf` by running:
|
||||
From the `~/Forecasting` directory on the VM create a conda environment named `tsperf` by running:
|
||||
|
||||
```bash
|
||||
conda env create --file ./common/conda_dependencies.yml
|
||||
```
|
||||
```bash
|
||||
conda env create --file tsperf/benchmarking/conda_dependencies.yml
|
||||
```
|
||||
|
||||
3. Download and extract data **on the VM**.
|
||||
4. Download and extract data **on the VM**.
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
|
||||
```
|
||||
|
||||
4. Prepare Docker container for model training and predicting.
|
||||
5. Prepare Docker container for model training and predicting.
|
||||
|
||||
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
|
||||
|
||||
4.1 Log into Azure Container Registry (ACR)
|
||||
4.1 Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
|
||||
|
||||
4.2 Build a local Docker image
|
||||
|
||||
```bash
|
||||
sudo docker build -t gbm_image benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM
|
||||
```
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
|
||||
6. Train and predict **within Docker container**
|
||||
|
||||
4.2 Pull the Docker image from ACR to your VM
|
||||
6.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image:v1
|
||||
```
|
||||
|
||||
5. Train and predict **within Docker container**
|
||||
|
||||
5.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name gbm_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name gbm_container gbm_image
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
|
||||
|
||||
5.2 Train and predict
|
||||
6.2 Train and predict
|
||||
|
||||
```
|
||||
source activate tsperf
|
||||
cd /Forecasting
|
||||
bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/train_score_vm.sh > out.txt &
|
||||
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/train_score_vm.sh > out.txt &
|
||||
```
|
||||
After generating the forecast results, you can exit the Docker container with command `exit`.
|
||||
|
||||
6. Model evaluation **on the VM**
|
||||
7. Model evaluation **on the VM**
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
cd ~/Forecasting
|
||||
bash ./common/evaluate submissions/GBM energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
bash tsperf/benchmarking/evaluate GBM tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
|
||||
```
|
||||
|
||||
## Implementation resources
|
||||
|
||||
**Platform:** Azure Cloud
|
||||
**Resource location:** East US region
|
||||
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
|
||||
**Data storage:** Premium SSD
|
||||
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/gbm_image
|
||||
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
|
||||
**Data storage:** Premium SSD
|
||||
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/GBM/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
||||
- python==3.6
|
||||
- python==3.7
|
||||
* R
|
||||
- r-base==3.5.1
|
||||
- r-base==3.5.3
|
||||
- gbm==2.1.3
|
||||
- data.table==1.11.4
|
||||
|
||||
|
@ -145,34 +145,34 @@ Please follow the instructions below to deploy the Linux DSVM.
|
|||
## Implementation evaluation
|
||||
**Quality:**
|
||||
|
||||
* Pinball loss run 1: 78.71
|
||||
* Pinball loss run 2: 78.72
|
||||
* Pinball loss run 3: 78.69
|
||||
* Pinball loss run 4: 78.71
|
||||
* Pinball loss run 5: 78.71
|
||||
* Pinball loss run 1: 78.85
|
||||
* Pinball loss run 2: 78.84
|
||||
* Pinball loss run 3: 78.86
|
||||
* Pinball loss run 4: 78.76
|
||||
* Pinball loss run 5: 78.82
|
||||
|
||||
Median Pinball loss: **78.71**
|
||||
Median Pinball loss: **78.84**
|
||||
|
||||
**Time:**
|
||||
|
||||
* Run time 1: 878 seconds
|
||||
* Run time 2: 888 seconds
|
||||
* Run time 3: 894 seconds
|
||||
* Run time 4: 894 seconds
|
||||
* Run time 5: 878 seconds
|
||||
* Run time 1: 268 seconds
|
||||
* Run time 2: 269 seconds
|
||||
* Run time 3: 269 seconds
|
||||
* Run time 4: 269 seconds
|
||||
* Run time 5: 266 seconds
|
||||
|
||||
Median run time: **888 seconds**
|
||||
Median run time: **269 seconds**
|
||||
|
||||
**Cost:**
|
||||
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date. Thus, the total cost is `888/3600 * 0.3840 = $0.0947`.
|
||||
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date. Thus, the total cost is `269/3600 * 0.3840 = $0.0287`.
|
||||
|
||||
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
|
||||
Round 1: 9.57
|
||||
Round 2: 18.17
|
||||
Round 3: 17.83
|
||||
Round 4: 8.58
|
||||
Round 5: 7.54
|
||||
Round 6: 6.96
|
||||
Round 1: 9.55
|
||||
Round 2: 18.24
|
||||
Round 3: 17.90
|
||||
Round 4: 8.27
|
||||
Round 5: 7.22
|
||||
Round 6: 6.80
|
||||
|
||||
**Ranking in the qualifying round of GEFCom2017 competition**
|
||||
4
|
|
@ -0,0 +1,66 @@
|
|||
"""
|
||||
This script uses
|
||||
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
|
||||
compute a list of features needed by the Gradient Boosting Machines model.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
import getopt
|
||||
|
||||
import localpath
|
||||
|
||||
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
|
||||
|
||||
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
|
||||
print("Data directory used: {}".format(DATA_DIR))
|
||||
|
||||
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
|
||||
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
|
||||
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
|
||||
|
||||
DF_CONFIG = {
|
||||
"time_col_name": "Datetime",
|
||||
"ts_id_col_names": "Zone",
|
||||
"target_col_name": "DEMAND",
|
||||
"frequency": "H",
|
||||
"time_format": "%Y-%m-%d %H:%M:%S",
|
||||
}
|
||||
|
||||
# Feature configuration list used to specify the features to be computed by
|
||||
# compute_features.
|
||||
# Each feature configuration is a tuple in the format of (feature_name,
|
||||
# featurizer_args)
|
||||
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
|
||||
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
|
||||
# featurizer_args is a dictionary of arguments passed to the
|
||||
# featurizer
|
||||
feature_config_list = [
|
||||
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
|
||||
("annual_fourier", {"n_harmonics": 3}),
|
||||
("weekly_fourier", {"n_harmonics": 3}),
|
||||
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
|
||||
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
|
||||
for opt, arg in opts:
|
||||
if opt == "--submission":
|
||||
submission_folder = arg
|
||||
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
|
||||
if not os.path.isdir(output_data_dir):
|
||||
os.mkdir(output_data_dir)
|
||||
OUTPUT_DIR = os.path.join(output_data_dir, "features")
|
||||
if not os.path.isdir(OUTPUT_DIR):
|
||||
os.mkdir(OUTPUT_DIR)
|
||||
|
||||
compute_features(
|
||||
TRAIN_DATA_DIR,
|
||||
TEST_DATA_DIR,
|
||||
OUTPUT_DIR,
|
||||
DF_CONFIG,
|
||||
feature_config_list,
|
||||
filter_by_month=True,
|
||||
compute_load_ratio=True,
|
||||
)
|
|
@ -6,4 +6,5 @@ dependencies:
|
|||
- numpy=1.15.1
|
||||
- pandas=0.23.4
|
||||
- xlrd=1.1.0
|
||||
- urllib3=1.21.1
|
||||
- urllib3=1.21.1
|
||||
- scikit-learn=0.20.3
|
|
@ -0,0 +1,7 @@
|
|||
pkgs <- c(
|
||||
'data.table',
|
||||
'gbm',
|
||||
'doParallel'
|
||||
)
|
||||
|
||||
install.packages(pkgs)
|
|
@ -3,10 +3,9 @@ This script inserts the TSPerf directory into sys.path, so that scripts can impo
|
|||
"""
|
||||
|
||||
import os, sys
|
||||
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
|
||||
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
|
||||
|
||||
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
|
||||
|
||||
if TSPERF_DIR not in sys.path:
|
||||
sys.path.insert(0, TSPERF_DIR)
|
|
@ -0,0 +1,101 @@
|
|||
args = commandArgs(trailingOnly=TRUE)
|
||||
seed_value = args[1]
|
||||
|
||||
library('data.table')
|
||||
library('gbm')
|
||||
library('doParallel')
|
||||
|
||||
n_cores = detectCores()
|
||||
|
||||
cl <- parallel::makeCluster(n_cores)
|
||||
parallel::clusterEvalQ(cl, lapply(c("gbm", "data.table"), library, character.only = TRUE))
|
||||
registerDoParallel(cl)
|
||||
|
||||
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/data/features'
|
||||
|
||||
train_dir = file.path(data_dir, 'train')
|
||||
test_dir = file.path(data_dir, 'test')
|
||||
|
||||
train_file_prefix = 'train_round_'
|
||||
test_file_prefix = 'test_round_'
|
||||
|
||||
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/GBM/submission_seed_', seed_value, '.csv', sep=""))
|
||||
|
||||
normalize_columns = list( 'DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
|
||||
|
||||
quantiles = seq(0.1, 0.9, by = 0.1)
|
||||
|
||||
result_all = list()
|
||||
N_ROUNDS = 6
|
||||
for (iR in 1:N_ROUNDS){
|
||||
print(paste('Round', iR))
|
||||
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
|
||||
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
|
||||
|
||||
train_df = fread(train_file)
|
||||
test_df = fread(test_file)
|
||||
|
||||
for (c in normalize_columns){
|
||||
min_c = min(train_df[, ..c])
|
||||
max_c = max(train_df[, ..c])
|
||||
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
|
||||
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
|
||||
}
|
||||
|
||||
|
||||
zones = unique(train_df[, Zone])
|
||||
hours = unique(train_df[, hour_of_day])
|
||||
all_zones_hours = expand.grid(zones, hours)
|
||||
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
|
||||
|
||||
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
|
||||
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
|
||||
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
|
||||
|
||||
ntrees = 1000
|
||||
shrinkage = 0.005
|
||||
|
||||
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
|
||||
set.seed(seed_value)
|
||||
|
||||
z = all_zones_hours[i, 'Zone']
|
||||
h = all_zones_hours[i, 'hour_of_day']
|
||||
train_df_sub = train_df[Zone == z & hour_of_day == h]
|
||||
test_df_sub = test_df[Zone == z & hour_of_day == h]
|
||||
|
||||
|
||||
result_all_quantiles = list()
|
||||
q_counter = 1
|
||||
for (tau in quantiles) {
|
||||
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
|
||||
|
||||
gbmModel = gbm(formula = DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
|
||||
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
|
||||
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
|
||||
distribution = list(name = "quantile", alpha = tau),
|
||||
data = train_df_sub,
|
||||
n.trees = ntrees,
|
||||
shrinkage = shrinkage)
|
||||
|
||||
gbmPredictions = predict(object = gbmModel,
|
||||
newdata = test_df_sub,
|
||||
n.trees = ntrees,
|
||||
type = "response") * test_df_sub$load_ratio
|
||||
|
||||
result$Prediction = gbmPredictions
|
||||
result$q = tau
|
||||
|
||||
result_all_quantiles[[q_counter]] = result
|
||||
q_counter = q_counter + 1
|
||||
}
|
||||
rbindlist(result_all_quantiles)
|
||||
}
|
||||
result_all[[iR]] = result_all_zones_hours
|
||||
}
|
||||
|
||||
result_final = rbindlist(result_all)
|
||||
# Sort the quantiles
|
||||
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
|
||||
result_final$Prediction = round(result_final$Prediction)
|
||||
|
||||
fwrite(result_final, output_file)
|
|
@ -1,14 +1,14 @@
|
|||
#!/bin/bash
|
||||
path=energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
|
||||
for i in `seq 1 5`;
|
||||
do
|
||||
echo "Run $i"
|
||||
start=`date +%s`
|
||||
echo 'Creating features...'
|
||||
python $path/submissions/fnn/feature_engineering.py --submission fnn
|
||||
python $path/GBM/compute_features.py --submission GBM
|
||||
|
||||
echo 'Training and predicting...'
|
||||
Rscript $path/submissions/fnn/train_predict.R $i
|
||||
Rscript $path/GBM/train_predict.R $i
|
||||
|
||||
end=`date +%s`
|
||||
echo 'Running time '$((end-start))' seconds'
|
|
@ -22,7 +22,7 @@ The table below summarizes the benchmark problem definition:
|
|||
| **Forecast granularity** | hourly |
|
||||
| **Forecast type** | probabilistic, 9 quantiles: 10th, 20th, ...90th percentiles|
|
||||
|
||||
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/reference/submission.csv)
|
||||
A template of the submission file can be found [here](https://github.com/Microsoft/Forecasting/blob/master/benchmarks/GEFCom2017_D_Prob_MT_hourly/sample_submission.csv)
|
||||
|
||||
# Data
|
||||
### Dataset attribution
|
||||
|
@ -31,8 +31,7 @@ A template of the submission file can be found [here](https://github.com/Microso
|
|||
### Dataset description
|
||||
|
||||
1. The data files can be downloaded from ISO New England website via the
|
||||
[zonal information page of the energy, load and demand reports](https://www
|
||||
.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info). If you
|
||||
[zonal information page of the energy, load and demand reports](https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info). If you
|
||||
are outside United States, you may need a VPN to access the data. Use columns
|
||||
A, B, D, M and N in the worksheets of "YYYY SMD Hourly Data" files, where YYYY
|
||||
represents the year. Detailed information of each column can be found in the
|
||||
|
@ -68,6 +67,45 @@ using the available training data:
|
|||
| 5 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-03-01 01:00:00 | 2017-03-31 00:00:00 |
|
||||
| 6 | 2011-01-01 01:00:00 | 2017-01-31 00:00:00 | 2017-04-01 01:00:00 | 2017-04-30 00:00:00 |
|
||||
|
||||
|
||||
### Feature engineering
|
||||
A common feature engineering script, common/feature_engineering.py, is provided to be used by individual submissions.
|
||||
Below is an example of using this script.
|
||||
The feature configuration list is used to specify the features to be computed by the compute_features function.
|
||||
Each feature configuration is a tuple in the format of (feature_name, featurizer_args).
|
||||
* feature_name is used to determine the featurizer to use, see FEATURE_MAP in
|
||||
common/feature_engineering.py.
|
||||
* featurizer_args is a dictionary of arguments passed to the featurizer.
|
||||
|
||||
```python
|
||||
from energy_load.GEFCom2017_D_Prob_MT_hourly.common.feature_engineering\
|
||||
import compute_features
|
||||
|
||||
DF_CONFIG = {
|
||||
'time_col_name': 'Datetime',
|
||||
'grain_col_name': 'Zone',
|
||||
'value_col_name': 'DEMAND',
|
||||
'frequency': 'hourly',
|
||||
'time_format': '%Y-%m-%d %H:%M:%S'
|
||||
}
|
||||
|
||||
feature_config_list = \
|
||||
[('temporal', {'feature_list': ['hour_of_day', 'month_of_year']}),
|
||||
('annual_fourier', {'n_harmonics': 3}),
|
||||
('weekly_fourier', {'n_harmonics': 3}),
|
||||
('previous_year_load_lag',
|
||||
{'input_col_name': 'DEMAND', 'output_col_name': 'load_lag'}),
|
||||
('previous_year_dry_bulb_lag',
|
||||
{'input_col_name': 'DryBulb', 'output_col_name': 'dry_bulb_lag'})]
|
||||
|
||||
TRAIN_DATA_DIR = './data/train'
|
||||
TEST_DATA_DIR = './data/test'
|
||||
OUTPUT_DIR = './data/features'
|
||||
|
||||
compute_features(TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG,
|
||||
feature_config_list,
|
||||
filter_by_month=True)
|
||||
```
|
||||
# Model Evaluation
|
||||
|
||||
**Evaluation metric**: Pinball loss
|
|
@ -1,24 +1,19 @@
|
|||
## Download base image
|
||||
FROM continuumio/anaconda3:4.4.0
|
||||
# ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/conda_dependencies.yml /tmp/conda_dependencies.yml
|
||||
# ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/install_R_dependencies.R /tmp/install_R_dependencies.R
|
||||
ADD ./conda_dependencies.yml /tmp/conda_dependencies.yml
|
||||
ADD ./install_R_dependencies.R /tmp/install_R_dependencies.R
|
||||
FROM rocker/r-base
|
||||
ADD ./conda_dependencies.yml /tmp
|
||||
ADD ./install_R_dependencies.R /tmp
|
||||
WORKDIR /tmp
|
||||
|
||||
## Install basic packages
|
||||
RUN apt-get update
|
||||
RUN apt-get update
|
||||
RUN apt-get install -y --no-install-recommends \
|
||||
wget \
|
||||
zlib1g-dev \
|
||||
libssl-dev \
|
||||
libssh2-1-dev \
|
||||
libcurl4-openssl-dev \
|
||||
libreadline-gplv2-dev \
|
||||
libncursesw5-dev \
|
||||
libsqlite3-dev \
|
||||
tk-dev \
|
||||
libgdbm-dev \
|
||||
libc6-dev \
|
||||
libbz2-dev \
|
||||
libffi-dev \
|
||||
|
@ -26,42 +21,37 @@ RUN apt-get install -y --no-install-recommends \
|
|||
build-essential \
|
||||
checkinstall \
|
||||
ca-certificates \
|
||||
curl \
|
||||
lsb-release \
|
||||
apt-utils \
|
||||
python3-pip \
|
||||
vim
|
||||
|
||||
## Create and activate conda environment
|
||||
# Install miniconda
|
||||
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
|
||||
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
|
||||
ENV PATH="/root/miniconda/bin:${PATH}"
|
||||
|
||||
## Create conda environment
|
||||
RUN conda update -y conda
|
||||
RUN conda env create --file conda_dependencies.yml
|
||||
|
||||
## Install R
|
||||
ENV R_BASE_VERSION 3.5.1
|
||||
RUN apt-get install -y aptitude
|
||||
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
|
||||
&& aptitude install -y debian-keyring debian-archive-keyring
|
||||
RUN apt-get remove -y binutils
|
||||
RUN apt-get update \
|
||||
&& apt-get install -t unstable -y --no-install-recommends \
|
||||
r-base=${R_BASE_VERSION}-*
|
||||
|
||||
# Install prerequisites of R packages
|
||||
RUN apt-get install -y \
|
||||
gfortran \
|
||||
liblapack-dev \
|
||||
liblapack3 \
|
||||
libopenblas-base \
|
||||
libopenblas-dev
|
||||
libopenblas-dev \
|
||||
g++
|
||||
## Mount R dependency file into the docker container and install dependencies
|
||||
# Use a MRAN snapshot URL to download packages archived on a specific date
|
||||
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
|
||||
RUN Rscript install_R_dependencies.R
|
||||
|
||||
RUN rm conda_dependencies.yml
|
||||
RUN rm install_R_dependencies.R
|
||||
RUN rm conda_dependencies.yml
|
||||
|
||||
RUN mkdir /TSPerf
|
||||
WORKDIR /TSPerf
|
||||
RUN mkdir /Forecasting
|
||||
WORKDIR /Forecasting
|
||||
|
||||
ENTRYPOINT ["/bin/bash"]
|
||||
ENTRYPOINT ["/bin/bash"]
|
|
@ -12,7 +12,7 @@
|
|||
|
||||
**Submission name:** baseline
|
||||
|
||||
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline
|
||||
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
|
||||
|
||||
|
||||
## Implementation description
|
||||
|
@ -36,16 +36,16 @@ No parameter tuning was done.
|
|||
|
||||
### Description of implementation scripts
|
||||
|
||||
* `feature_engineering.py`: Python script for computing features and generating feature files.
|
||||
* `compute_features.py`: Python script for computing features and generating feature files.
|
||||
* `train_predict.R`: R script that trains Quantile Regression models and predicts on each round of test data.
|
||||
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
|
||||
### Steps to reproduce results
|
||||
|
||||
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
VM.
|
||||
|
||||
1. Clone the Forecasting repo to the home directory of your machine
|
||||
2. Clone the Forecasting repo to the home directory of your machine
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
|
@ -60,81 +60,80 @@ VM.
|
|||
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
|
||||
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
|
||||
|
||||
|
||||
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
|
||||
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
|
||||
|
||||
```bash
|
||||
cd ~/Forecasting
|
||||
conda env create --file ./common/conda_dependencies.yml
|
||||
conda env create --file tsperf/benchmarking/conda_dependencies.yml
|
||||
```
|
||||
|
||||
3. Download and extract data **on the VM**.
|
||||
4. Download and extract data **on the VM**.
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
|
||||
```
|
||||
|
||||
4. Prepare Docker container for model training and predicting.
|
||||
4.1 Log into Azure Container Registry (ACR)
|
||||
5. Prepare Docker container for model training and predicting.
|
||||
|
||||
5.1 Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
5.2 Build a local Docker image
|
||||
|
||||
```bash
|
||||
sudo docker build -t baseline_image benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline
|
||||
```
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
|
||||
If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
6. Train and predict **within Docker container**
|
||||
|
||||
4.2 Pull the Docker image from ACR to your VM
|
||||
6.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
|
||||
```
|
||||
|
||||
5. Train and predict **within Docker container**
|
||||
5.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name baseline_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name baseline_container baseline_image
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
|
||||
|
||||
5.2 Train and predict
|
||||
6.2 Train and predict
|
||||
|
||||
```
|
||||
source activate tsperf
|
||||
cd /Forecasting
|
||||
bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/train_score_vm.sh
|
||||
bash benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/train_score_vm.sh
|
||||
```
|
||||
After generating the forecast results, you can exit the Docker container by command `exit`.
|
||||
6. Model evaluation **on the VM**
|
||||
|
||||
7. Model evaluation **on the VM**
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
cd ~/Forecasting
|
||||
bash ./common/evaluate submissions/baseline energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
```
|
||||
```bash
|
||||
source activate tsperf
|
||||
cd ~/Forecasting
|
||||
bash tsperf/benchmarking/evaluate baseline tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
|
||||
```
|
||||
|
||||
## Implementation resources
|
||||
|
||||
**Platform:** Azure Cloud
|
||||
**Resource location:** East US region
|
||||
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
|
||||
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
|
||||
**Data storage:** Premium SSD
|
||||
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/baseline_image
|
||||
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
||||
- python==3.6
|
||||
- python==3.7
|
||||
* R
|
||||
- r-base==3.5.1
|
||||
- r-base==3.5.3
|
||||
- quantreg==5.34
|
||||
- data.table==1.10.4.3
|
||||
|
||||
|
@ -147,43 +146,43 @@ Please follow the instructions below to deploy the Linux DSVM.
|
|||
**Quality:**
|
||||
Note there is no randomness in this baseline model, so the model quality is the same for all five runs.
|
||||
|
||||
* Pinball loss run 1: 84.11
|
||||
* Pinball loss run 1: 84.12
|
||||
|
||||
* Pinball loss run 2: 84.11
|
||||
* Pinball loss run 2: 84.12
|
||||
|
||||
* Pinball loss run 3: 84.11
|
||||
* Pinball loss run 3: 84.12
|
||||
|
||||
* Pinball loss run 4: 84.11
|
||||
* Pinball loss run 4: 84.12
|
||||
|
||||
* Pinball loss run 5: 84.11
|
||||
* Pinball loss run 5: 84.12
|
||||
|
||||
* Median Pinball loss: 84.11
|
||||
* Median Pinball loss: 84.12
|
||||
|
||||
**Time:**
|
||||
|
||||
* Run time 1: 425 seconds
|
||||
* Run time 1: 188 seconds
|
||||
|
||||
* Run time 2: 462 seconds
|
||||
* Run time 2: 185 seconds
|
||||
|
||||
* Run time 3: 441 seconds
|
||||
* Run time 3: 185 seconds
|
||||
|
||||
* Run time 4: 458 seconds
|
||||
* Run time 4: 189 seconds
|
||||
|
||||
* Run time 5: 444 seconds
|
||||
* Run time 5: 189 seconds
|
||||
|
||||
* Median run time: **444 seconds**
|
||||
* Median run time: **188 seconds**
|
||||
|
||||
**Cost:**
|
||||
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
|
||||
Thus, the total cost is 444/3600 * 0.3840 = $0.0474.
|
||||
Thus, the total cost is 188/3600 * 0.3840 = $0.0201.
|
||||
|
||||
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
|
||||
Round 1: -6.67
|
||||
Round 2: 20.25
|
||||
Round 3: 20.04
|
||||
Round 2: 20.26
|
||||
Round 3: 20.05
|
||||
Round 4: -5.61
|
||||
Round 5: -6.45
|
||||
Round 6: 11.22
|
||||
Round 6: 11.21
|
||||
|
||||
**Ranking in the qualifying round of GEFCom2017 competition**
|
||||
10
|
|
@ -0,0 +1,67 @@
|
|||
"""
|
||||
This script uses
|
||||
tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/feature_engineering.py to
|
||||
compute a list of features needed by the Quantile Regression model.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
import getopt
|
||||
|
||||
import localpath
|
||||
|
||||
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
|
||||
|
||||
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
|
||||
print("Data directory used: {}".format(DATA_DIR))
|
||||
|
||||
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
|
||||
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
|
||||
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
|
||||
|
||||
DF_CONFIG = {
|
||||
"time_col_name": "Datetime",
|
||||
"ts_id_col_names": "Zone",
|
||||
"target_col_name": "DEMAND",
|
||||
"frequency": "H",
|
||||
"time_format": "%Y-%m-%d %H:%M:%S",
|
||||
}
|
||||
|
||||
# Feature configuration list used to specify the features to be computed by
|
||||
# compute_features.
|
||||
# Each feature configuration is a tuple in the format of (feature_name,
|
||||
# featurizer_args)
|
||||
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
|
||||
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
|
||||
# featurizer_args is a dictionary of arguments passed to the
|
||||
# featurizer
|
||||
feature_config_list = [
|
||||
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
|
||||
("annual_fourier", {"n_harmonics": 3}),
|
||||
("weekly_fourier", {"n_harmonics": 3}),
|
||||
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
|
||||
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
|
||||
]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
|
||||
for opt, arg in opts:
|
||||
if opt == "--submission":
|
||||
submission_folder = arg
|
||||
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
|
||||
if not os.path.isdir(output_data_dir):
|
||||
os.mkdir(output_data_dir)
|
||||
OUTPUT_DIR = os.path.join(output_data_dir, "features")
|
||||
if not os.path.isdir(OUTPUT_DIR):
|
||||
os.mkdir(OUTPUT_DIR)
|
||||
|
||||
compute_features(
|
||||
TRAIN_DATA_DIR,
|
||||
TEST_DATA_DIR,
|
||||
OUTPUT_DIR,
|
||||
DF_CONFIG,
|
||||
feature_config_list,
|
||||
filter_by_month=True,
|
||||
compute_load_ratio=True,
|
||||
)
|
|
@ -6,4 +6,5 @@ dependencies:
|
|||
- numpy=1.15.1
|
||||
- pandas=0.23.4
|
||||
- xlrd=1.1.0
|
||||
- urllib3=1.21.1
|
||||
- urllib3=1.21.1
|
||||
- scikit-learn=0.20.3
|
|
@ -0,0 +1,7 @@
|
|||
pkgs <- c(
|
||||
'data.table',
|
||||
'quantreg',
|
||||
'doParallel'
|
||||
)
|
||||
|
||||
install.packages(pkgs)
|
|
@ -5,10 +5,9 @@ localpath.py file.
|
|||
"""
|
||||
|
||||
import os, sys
|
||||
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
|
||||
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
|
||||
|
||||
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
|
||||
|
||||
if TSPERF_DIR not in sys.path:
|
||||
sys.path.insert(0, TSPERF_DIR)
|
|
@ -0,0 +1,87 @@
|
|||
args = commandArgs(trailingOnly=TRUE)
|
||||
seed_value = args[1]
|
||||
library('data.table')
|
||||
library('quantreg')
|
||||
library('doParallel')
|
||||
|
||||
n_cores = detectCores()
|
||||
|
||||
cl <- parallel::makeCluster(n_cores)
|
||||
parallel::clusterEvalQ(cl, lapply(c("quantreg", "data.table"), library, character.only = TRUE))
|
||||
registerDoParallel(cl)
|
||||
|
||||
|
||||
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/data/features'
|
||||
train_dir = file.path(data_dir, 'train')
|
||||
test_dir = file.path(data_dir, 'test')
|
||||
|
||||
train_file_prefix = 'train_round_'
|
||||
test_file_prefix = 'test_round_'
|
||||
|
||||
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/baseline/submission_seed_', seed_value, '.csv', sep=""))
|
||||
|
||||
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
|
||||
|
||||
quantiles = seq(0.1, 0.9, by = 0.1)
|
||||
|
||||
result_all = list()
|
||||
for (iR in 1:6){
|
||||
print(paste('Round', iR))
|
||||
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
|
||||
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
|
||||
|
||||
train_df = fread(train_file)
|
||||
test_df = fread(test_file)
|
||||
|
||||
for (c in normalize_columns){
|
||||
min_c = min(train_df[, ..c])
|
||||
max_c = max(train_df[, ..c])
|
||||
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
|
||||
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
|
||||
}
|
||||
|
||||
|
||||
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
|
||||
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
|
||||
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
|
||||
|
||||
|
||||
zones = unique(train_df[, Zone])
|
||||
hours = unique(train_df[, hour_of_day])
|
||||
all_zones_hours = expand.grid(zones, hours)
|
||||
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
|
||||
|
||||
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
|
||||
z = all_zones_hours[i, 'Zone']
|
||||
h = all_zones_hours[i, 'hour_of_day']
|
||||
|
||||
train_df_sub = train_df[Zone == z & hour_of_day == h]
|
||||
test_df_sub = test_df[Zone == z & hour_of_day == h]
|
||||
|
||||
result_all_quantiles = list()
|
||||
q_counter = 1
|
||||
for (tau in quantiles){
|
||||
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
|
||||
|
||||
model = rq(DEMAND ~ DEMAND_same_woy_lag + DryBulb_same_doy_lag +
|
||||
annual_sin_1 + annual_cos_1 + annual_sin_2 + annual_cos_2 + annual_sin_3 + annual_cos_3 +
|
||||
weekly_sin_1 + weekly_cos_1 + weekly_sin_2 + weekly_cos_2 + weekly_sin_3 + weekly_cos_3,
|
||||
data=train_df_sub, tau = tau)
|
||||
|
||||
result$Prediction = predict(model, test_df_sub) * test_df_sub$load_ratio
|
||||
result$q = tau
|
||||
|
||||
result_all_quantiles[[q_counter]] = result
|
||||
q_counter = q_counter + 1
|
||||
}
|
||||
rbindlist(result_all_quantiles)
|
||||
}
|
||||
result_all[[iR]] = result_all_zones_hours
|
||||
}
|
||||
|
||||
result_final = rbindlist(result_all)
|
||||
# Sort the quantiles
|
||||
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
|
||||
result_final$Prediction = round(result_final$Prediction)
|
||||
|
||||
fwrite(result_final, output_file)
|
|
@ -1,15 +1,15 @@
|
|||
#!/bin/bash
|
||||
path=energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
|
||||
for i in `seq 1 5`;
|
||||
do
|
||||
echo "Run $i"
|
||||
start=`date +%s`
|
||||
echo 'Creating features...'
|
||||
python $path/submissions/baseline/feature_engineering.py --submission baseline
|
||||
python $path/baseline/compute_features.py --submission baseline
|
||||
|
||||
echo 'Training and predicting...'
|
||||
Rscript $path/submissions/baseline/train_predict.R $i
|
||||
Rscript $path/baseline/train_predict.R $i
|
||||
|
||||
end=`date +%s`
|
||||
echo 'Running time '$((end-start))' seconds'
|
||||
done
|
||||
done
|
|
@ -423,4 +423,4 @@
|
|||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
||||
}
|
|
@ -1,21 +1,19 @@
|
|||
## Download base image
|
||||
FROM continuumio/anaconda3:4.4.0
|
||||
ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/conda_dependencies.yml /tmp/conda_dependencies.yml
|
||||
ADD TSPerf/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/baseline/install_R_dependencies.R /tmp/install_R_dependencies.R
|
||||
FROM rocker/r-base
|
||||
ADD ./conda_dependencies.yml /tmp
|
||||
ADD ./install_R_dependencies.R /tmp
|
||||
WORKDIR /tmp
|
||||
|
||||
## Install basic packages
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
RUN apt-get update
|
||||
RUN apt-get install -y --no-install-recommends \
|
||||
wget \
|
||||
zlib1g-dev \
|
||||
libssl-dev \
|
||||
libssh2-1-dev \
|
||||
libcurl4-openssl-dev \
|
||||
libreadline-gplv2-dev \
|
||||
libncursesw5-dev \
|
||||
libsqlite3-dev \
|
||||
tk-dev \
|
||||
libgdbm-dev \
|
||||
libc6-dev \
|
||||
libbz2-dev \
|
||||
libffi-dev \
|
||||
|
@ -23,34 +21,28 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|||
build-essential \
|
||||
checkinstall \
|
||||
ca-certificates \
|
||||
curl \
|
||||
lsb-release \
|
||||
apt-utils \
|
||||
python3-pip \
|
||||
vim
|
||||
|
||||
## Create and activate conda environment
|
||||
RUN conda update -y conda
|
||||
# Install miniconda
|
||||
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
|
||||
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
|
||||
ENV PATH="/root/miniconda/bin:${PATH}"
|
||||
|
||||
## Create conda environment
|
||||
RUN conda update -y conda
|
||||
RUN conda env create --file conda_dependencies.yml
|
||||
|
||||
## Install R
|
||||
ENV R_BASE_VERSION 3.5.1
|
||||
RUN apt-get install -y aptitude
|
||||
RUN echo "deb http://http.debian.net/debian sid main" > /etc/apt/sources.list.d/debian-unstable.list \
|
||||
&& aptitude install -y debian-keyring debian-archive-keyring
|
||||
RUN apt-get remove -y binutils
|
||||
RUN apt-get update \
|
||||
&& apt-get install -t unstable -y --no-install-recommends \
|
||||
r-base=${R_BASE_VERSION}-*
|
||||
|
||||
# Install prerequisites of R packages
|
||||
RUN apt-get install -y \
|
||||
gfortran \
|
||||
liblapack-dev \
|
||||
liblapack3 \
|
||||
libopenblas-base \
|
||||
libopenblas-dev
|
||||
libopenblas-dev \
|
||||
g++
|
||||
## Mount R dependency file into the docker container and install dependencies
|
||||
# Use a MRAN snapshot URL to download packages archived on a specific date
|
||||
RUN echo 'options(repos = list(CRAN = "http://mran.revolutionanalytics.com/snapshot/2018-09-01/"))' >> /etc/R/Rprofile.site
|
||||
|
@ -59,7 +51,7 @@ RUN Rscript install_R_dependencies.R
|
|||
RUN rm install_R_dependencies.R
|
||||
RUN rm conda_dependencies.yml
|
||||
|
||||
RUN mkdir /TSPerf
|
||||
WORKDIR /TSPerf
|
||||
RUN mkdir /Forecasting
|
||||
WORKDIR /Forecasting
|
||||
|
||||
ENTRYPOINT ["/bin/bash"]
|
||||
ENTRYPOINT ["/bin/bash"]
|
|
@ -12,7 +12,7 @@
|
|||
|
||||
**Submission name:** Quantile Regression Neural Network
|
||||
|
||||
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn
|
||||
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
|
||||
|
||||
|
||||
## Implementation description
|
||||
|
@ -36,14 +36,14 @@ The data of January - April of 2016 were used as validation dataset for some min
|
|||
### Description of implementation scripts
|
||||
|
||||
Train and Predict:
|
||||
* `feature_engineering.py`: Python script for computing features and generating feature files.
|
||||
* `compute_features.py`: Python script for computing features and generating feature files.
|
||||
* `train_predict.R`: R script that trains Quantile Regression Neural Network models and predicts on each round of test data.
|
||||
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py` and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_predict.R` five times to generate five submission files and measure model running time.
|
||||
|
||||
Tune hyperparameters using R:
|
||||
* `cv_settings.json`: JSON script that sets cross validation folds.
|
||||
* `train_validate.R`: R script that trains Quantile Regression Neural Network models and evaluate the loss on validation data of each cross validation round and forecast round with a set of hyperparameters and calculate the average loss. This script is used for grid search on vm.
|
||||
* `train_validate_vm.sh`: Bash script that runs `feature_engineering.py` and `train_validate.R` multiple times to generate cross validation result files and measure model tuning time.
|
||||
* `train_validate_vm.sh`: Bash script that runs `compute_features.py` and `train_validate.R` multiple times to generate cross validation result files and measure model tuning time.
|
||||
|
||||
Tune hyperparameters using AzureML HyperDrive:
|
||||
* `cv_settings.json`: JSON script that sets cross validation folds.
|
||||
|
@ -53,10 +53,10 @@ Tune hyperparameters using AzureML HyperDrive:
|
|||
|
||||
### Steps to reproduce results
|
||||
|
||||
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux virtual machine and log into the provisioned
|
||||
VM.
|
||||
|
||||
1. Clone the Forecasting repo to the home directory of your machine
|
||||
2. Clone the Forecasting repo to the home directory of your machine
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
|
@ -72,92 +72,91 @@ VM.
|
|||
* [Git Credential Managers](https://docs.microsoft.com/en-us/vsts/repos/git/set-up-credential-managers?view=vsts)
|
||||
* [Authenticate with SSH](https://docs.microsoft.com/en-us/vsts/repos/git/use-ssh-keys-to-authenticate?view=vsts)
|
||||
|
||||
|
||||
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
|
||||
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
|
||||
|
||||
```bash
|
||||
cd ~/Forecasting
|
||||
conda env create --file ./common/conda_dependencies.yml
|
||||
conda env create --file tsperf/benchmarking/conda_dependencies.yml
|
||||
```
|
||||
|
||||
3. Download and extract data **on the VM**.
|
||||
4. Download and extract data **on the VM**.
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
|
||||
```
|
||||
|
||||
4. Prepare Docker container for model training and predicting.
|
||||
> NOTE: To execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
5. Prepare Docker container for model training and predicting.
|
||||
|
||||
> NOTE: To execute docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user). Otherwise, simply prefix all docker commands with sudo.
|
||||
|
||||
4.1 Log into Azure Container Registry (ACR)
|
||||
5.1 Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
|
||||
|
||||
5.2 Build a local Docker image
|
||||
|
||||
```bash
|
||||
sudo docker build -t fnn_image benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn
|
||||
```
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
|
||||
6. Tune Hyperparameters **within Docker container** or **with AzureML hyperdrive**.
|
||||
|
||||
4.2 Pull the Docker image from ACR to your VM
|
||||
6.1.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
|
||||
```
|
||||
|
||||
5. Tune Hyperparameters **within Docker container** or **with AzureML hyperdrive**.
|
||||
|
||||
5.1.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_cv_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_cv_container fnn_image
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
|
||||
|
||||
5.1.2 Train and validate
|
||||
6.1.2 Train and validate
|
||||
|
||||
```
|
||||
source activate tsperf
|
||||
cd /Forecasting
|
||||
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/train_validate_vm.sh >& cv_out.txt &
|
||||
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_validate_vm.sh >& cv_out.txt &
|
||||
```
|
||||
After generating the cross validation results, you can exit the Docker container by command `exit`.
|
||||
|
||||
5.2 Do hyperparameter tuning with AzureML hyperdrive
|
||||
6.2 Do hyperparameter tuning with AzureML hyperdrive
|
||||
|
||||
To tune hyperparameters with AzureML hyperdrive, you don't need to create a local Docker container. You can do feature engineering on the VM by the command
|
||||
|
||||
```
|
||||
cd ~/Forecasting
|
||||
source activate tsperf
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/feature_engineering.py
|
||||
python benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/compute_features.py
|
||||
```
|
||||
and then run through the jupyter notebook `hyperparameter_tuning.ipynb` on the VM with the conda env `tsperf` as the jupyter kernel.
|
||||
|
||||
Based on the average pinball loss obtained at each set of hyperparameters, you can choose the best set of hyperparameters and use it in the Rscript of `train_predict.R`.
|
||||
|
||||
6. Train and predict **within Docker container**.
|
||||
7. Train and predict **within Docker container**.
|
||||
|
||||
6.1 Start a Docker container from the image
|
||||
7.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name fnn_container fnn_image
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
|
||||
|
||||
6.2 Train and predict
|
||||
7.2 Train and predict
|
||||
|
||||
```
|
||||
source activate tsperf
|
||||
cd /Forecasting
|
||||
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/train_score_vm.sh >& out.txt &
|
||||
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/train_score_vm.sh >& out.txt &
|
||||
```
|
||||
The last command will take about 7 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
|
||||
|
||||
|
@ -168,12 +167,12 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
|
|||
to connect to the running container and check the status of the run.
|
||||
After generating the forecast results, you can exit the Docker container by command `exit`.
|
||||
|
||||
7. Model evaluation **on the VM**.
|
||||
8. Model evaluation **on the VM**.
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
cd ~/Forecasting
|
||||
bash ./common/evaluate submissions/fnn energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
bash tsperf/benchmarking/evaluate fnn tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
|
||||
```
|
||||
|
||||
## Implementation resources
|
||||
|
@ -182,13 +181,13 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
|
|||
**Resource location:** East US region
|
||||
**Hardware:** Standard D8s v3 (8 vcpus, 32 GB memory) Ubuntu Linux VM
|
||||
**Data storage:** Premium SSD
|
||||
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/fnn_image:v1
|
||||
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
||||
- python==3.6
|
||||
- python==3.7
|
||||
* R
|
||||
- r-base==3.5.1
|
||||
- r-base==3.5.3
|
||||
- qrnn==2.0.2
|
||||
- data.table==1.10.4.3
|
||||
- rjson==0.2.20 (optional for cv)
|
||||
|
@ -202,43 +201,43 @@ Please follow the instructions below to deploy the Linux DSVM.
|
|||
## Implementation evaluation
|
||||
**Quality:**
|
||||
|
||||
* Pinball loss run 1: 79.27
|
||||
* Pinball loss run 1: 79.54
|
||||
|
||||
* Pinball loss run 2: 79.32
|
||||
* Pinball loss run 2: 78.32
|
||||
|
||||
* Pinball loss run 3: 79.25
|
||||
* Pinball loss run 3: 80.06
|
||||
|
||||
* Pinball loss run 4: 79.24
|
||||
* Pinball loss run 4: 80.12
|
||||
|
||||
* Pinball loss run 5: 79.32
|
||||
* Pinball loss run 5: 80.13
|
||||
|
||||
* Median Pinball loss: 79.27
|
||||
* Median Pinball loss: 80.06
|
||||
|
||||
**Time:**
|
||||
|
||||
* Run time 1: 4611 seconds
|
||||
* Run time 1: 1092 seconds
|
||||
|
||||
* Run time 2: 4604 seconds
|
||||
* Run time 2: 1085 seconds
|
||||
|
||||
* Run time 3: 4587 seconds
|
||||
* Run time 3: 1062 seconds
|
||||
|
||||
* Run time 4: 4630 seconds
|
||||
* Run time 4: 1083 seconds
|
||||
|
||||
* Run time 5: 4583 seconds
|
||||
* Run time 5: 1110 seconds
|
||||
|
||||
* Median run time: 4604 seconds
|
||||
* Median run time: 1085 seconds
|
||||
|
||||
**Cost:**
|
||||
The hourly cost of the Standard D8s Ubuntu Linux VM in East US Azure region is 0.3840 USD, based on the price at the submission date.
|
||||
Thus, the total cost is 4604/3600 * 0.3840 = $0.4911.
|
||||
Thus, the total cost is 1085/3600 * 0.3840 = $0.1157.
|
||||
|
||||
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
|
||||
Round 1: 6.64
|
||||
Round 2: 20.13
|
||||
Round 3: 19.75
|
||||
Round 4: 5.01
|
||||
Round 5: 4.21
|
||||
Round 6: 10.68
|
||||
Round 1: 6.13
|
||||
Round 2: 19.20
|
||||
Round 3: 18.86
|
||||
Round 4: 3.84
|
||||
Round 5: 2.76
|
||||
Round 6: 11.10
|
||||
|
||||
**Ranking in the qualifying round of GEFCom2017 competition**
|
||||
4
|
|
@ -0,0 +1,70 @@
|
|||
"""
|
||||
This script passes the input arguments of AzureML job to the R script train_validate_aml.R,
|
||||
and then passes the output of train_validate_aml.R back to AzureML.
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import os
|
||||
import sys
|
||||
import getopt
|
||||
import pandas as pd
|
||||
from datetime import datetime
|
||||
from azureml.core import Run
|
||||
import time
|
||||
|
||||
start_time = time.time()
|
||||
run = Run.get_submitted_run()
|
||||
|
||||
base_command = "Rscript train_validate_aml.R"
|
||||
|
||||
if __name__ == "__main__":
|
||||
opts, args = getopt.getopt(
|
||||
sys.argv[1:], "", ["path=", "cv_path=", "n_hidden_1=", "n_hidden_2=", "iter_max=", "penalty="]
|
||||
)
|
||||
for opt, arg in opts:
|
||||
if opt == "--path":
|
||||
path = arg
|
||||
elif opt == "--cv_path":
|
||||
cv_path = arg
|
||||
elif opt == "--n_hidden_1":
|
||||
n_hidden_1 = arg
|
||||
elif opt == "--n_hidden_2":
|
||||
n_hidden_2 = arg
|
||||
elif opt == "--iter_max":
|
||||
iter_max = arg
|
||||
elif opt == "--penalty":
|
||||
penalty = arg
|
||||
time_stamp = datetime.now().strftime("%Y%m%d%H%M%S")
|
||||
task = " ".join(
|
||||
[
|
||||
base_command,
|
||||
"--path",
|
||||
path,
|
||||
"--cv_path",
|
||||
cv_path,
|
||||
"--n_hidden_1",
|
||||
n_hidden_1,
|
||||
"--n_hidden_2",
|
||||
n_hidden_2,
|
||||
"--iter_max",
|
||||
iter_max,
|
||||
"--penalty",
|
||||
penalty,
|
||||
"--time_stamp",
|
||||
time_stamp,
|
||||
]
|
||||
)
|
||||
process = subprocess.call(task, shell=True)
|
||||
|
||||
# process.communicate()
|
||||
# process.wait()
|
||||
|
||||
output_file_name = "cv_output_" + time_stamp + ".csv"
|
||||
result = pd.read_csv(os.path.join(cv_path, output_file_name))
|
||||
|
||||
APL = result["loss"].mean()
|
||||
|
||||
print(APL)
|
||||
print("--- %s seconds ---" % (time.time() - start_time))
|
||||
|
||||
run.log("average pinball loss", APL)
|
|
@ -0,0 +1,68 @@
|
|||
"""
|
||||
This script uses
|
||||
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
|
||||
compute a list of features needed by the Feed-forward Neural Network model.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import getopt
|
||||
|
||||
import localpath
|
||||
|
||||
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
|
||||
|
||||
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
|
||||
print("Data directory used: {}".format(DATA_DIR))
|
||||
|
||||
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
|
||||
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
|
||||
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
|
||||
|
||||
DF_CONFIG = {
|
||||
"time_col_name": "Datetime",
|
||||
"ts_id_col_names": "Zone",
|
||||
"target_col_name": "DEMAND",
|
||||
"frequency": "H",
|
||||
"time_format": "%Y-%m-%d %H:%M:%S",
|
||||
}
|
||||
|
||||
# Feature configuration list used to specify the features to be computed by
|
||||
# compute_features.
|
||||
# Each feature configuration is a tuple in the format of (feature_name,
|
||||
# featurizer_args)
|
||||
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
|
||||
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
|
||||
# featurizer_args is a dictionary of arguments passed to the
|
||||
# featurizer
|
||||
feature_config_list = [
|
||||
("temporal", {"feature_list": ["hour_of_day", "month_of_year"]}),
|
||||
("annual_fourier", {"n_harmonics": 3}),
|
||||
("weekly_fourier", {"n_harmonics": 3}),
|
||||
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
|
||||
("previous_year_temp_lag", {"input_col_names": "DryBulb", "round_agg_result": True},),
|
||||
]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
|
||||
for opt, arg in opts:
|
||||
if opt == "--submission":
|
||||
submission_folder = arg
|
||||
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
|
||||
if not os.path.isdir(output_data_dir):
|
||||
os.mkdir(output_data_dir)
|
||||
OUTPUT_DIR = os.path.join(output_data_dir, "features")
|
||||
if not os.path.isdir(OUTPUT_DIR):
|
||||
os.mkdir(OUTPUT_DIR)
|
||||
|
||||
compute_features(
|
||||
TRAIN_DATA_DIR,
|
||||
TEST_DATA_DIR,
|
||||
OUTPUT_DIR,
|
||||
DF_CONFIG,
|
||||
feature_config_list,
|
||||
filter_by_month=True,
|
||||
compute_load_ratio=True,
|
||||
)
|
|
@ -6,4 +6,5 @@ dependencies:
|
|||
- numpy=1.15.1
|
||||
- pandas=0.23.4
|
||||
- xlrd=1.1.0
|
||||
- urllib3=1.21.1
|
||||
- urllib3=1.21.1
|
||||
- scikit-learn=0.20.3
|
|
@ -1243,9 +1243,9 @@
|
|||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python [conda env:tsperf]",
|
||||
"display_name": "drlnd",
|
||||
"language": "python",
|
||||
"name": "conda-env-tsperf-py"
|
||||
"name": "drlnd"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
|
@ -0,0 +1,7 @@
|
|||
pkgs <- c(
|
||||
'data.table',
|
||||
'qrnn',
|
||||
'doParallel'
|
||||
)
|
||||
|
||||
install.packages(pkgs)
|
|
@ -4,10 +4,9 @@ all the modules in TSPerf. Each submission folder needs its own localpath.py fil
|
|||
"""
|
||||
|
||||
import os, sys
|
||||
_CUR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
_SUBMISSIONS_DIR = os.path.dirname(_CUR_DIR)
|
||||
_BENCHMARK_DIR = os.path.dirname(_SUBMISSIONS_DIR)
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(_BENCHMARK_DIR))
|
||||
|
||||
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
|
||||
|
||||
if TSPERF_DIR not in sys.path:
|
||||
sys.path.insert(0, TSPERF_DIR)
|
|
@ -0,0 +1,107 @@
|
|||
#!/usr/bin/Rscript
|
||||
#
|
||||
# This script trains the Quantile Regression Neural Network model and predicts on each data
|
||||
# partition per zone and hour at each quantile point.
|
||||
|
||||
args = commandArgs(trailingOnly=TRUE)
|
||||
seed_value = args[1]
|
||||
|
||||
library('data.table')
|
||||
library('qrnn')
|
||||
library('doParallel')
|
||||
|
||||
n_cores = detectCores()
|
||||
|
||||
cl <- parallel::makeCluster(n_cores)
|
||||
parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.only = TRUE))
|
||||
registerDoParallel(cl)
|
||||
|
||||
# Specify data directory
|
||||
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
|
||||
train_dir = file.path(data_dir, 'train')
|
||||
test_dir = file.path(data_dir, 'test')
|
||||
|
||||
train_file_prefix = 'train_round_'
|
||||
test_file_prefix = 'test_round_'
|
||||
|
||||
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/submission_seed_', seed_value, '.csv', sep=""))
|
||||
|
||||
# Data and forecast parameters
|
||||
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
|
||||
quantiles = seq(0.1, 0.9, by = 0.1)
|
||||
|
||||
# Train and predict
|
||||
result_all = list()
|
||||
for (iR in 1:6){
|
||||
print(paste('Round', iR))
|
||||
train_file = file.path(train_dir, paste(train_file_prefix, iR, '.csv', sep=''))
|
||||
test_file = file.path(test_dir, paste(test_file_prefix, iR, '.csv', sep=''))
|
||||
|
||||
train_df = fread(train_file)
|
||||
test_df = fread(test_file)
|
||||
|
||||
for (c in normalize_columns){
|
||||
min_c = min(train_df[, ..c])
|
||||
max_c = max(train_df[, ..c])
|
||||
train_df[, c] = (train_df[, ..c] - min_c)/(max_c - min_c)
|
||||
test_df[, c] = (test_df[, ..c] - min_c)/(max_c - min_c)
|
||||
}
|
||||
|
||||
zones = unique(train_df[, Zone])
|
||||
hours = unique(train_df[, hour_of_day])
|
||||
all_zones_hours = expand.grid(zones, hours)
|
||||
colnames(all_zones_hours) = c('Zone', 'hour_of_day')
|
||||
|
||||
test_df$average_load_ratio = rowMeans(test_df[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
|
||||
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
|
||||
|
||||
|
||||
test_df[, load_ratio:=mean(average_load_ratio), by=list(hour_of_day, month_of_year)]
|
||||
|
||||
result_all_zones_hours = foreach(i = 1:nrow(all_zones_hours), .combine = rbind) %dopar%{
|
||||
set.seed(seed_value)
|
||||
z = all_zones_hours[i, 'Zone']
|
||||
h = all_zones_hours[i, 'hour_of_day']
|
||||
train_df_sub = train_df[Zone == z & hour_of_day == h]
|
||||
test_df_sub = test_df[Zone == z & hour_of_day == h]
|
||||
|
||||
train_x <- as.matrix(train_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
|
||||
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
|
||||
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
|
||||
drop=FALSE])
|
||||
train_y <- as.matrix(train_df_sub[, c('DEMAND'), drop=FALSE])
|
||||
|
||||
test_x <- as.matrix(test_df_sub[, c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
|
||||
'annual_sin_1', 'annual_cos_1', 'annual_sin_2', 'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
|
||||
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2', 'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3'),
|
||||
drop=FALSE])
|
||||
|
||||
result_all_quantiles = list()
|
||||
q_counter = 1
|
||||
for (tau in quantiles){
|
||||
result = data.table(Zone=test_df_sub$Zone, Datetime = test_df_sub$Datetime, Round=iR)
|
||||
|
||||
model = qrnn2.fit(x=train_x, y=train_y,
|
||||
n.hidden=8, n.hidden2=4,
|
||||
tau=tau, Th=tanh,
|
||||
iter.max=1,
|
||||
penalty=0)
|
||||
|
||||
result$Prediction = qrnn2.predict(model, x=test_x) * test_df_sub$load_ratio
|
||||
result$q = tau
|
||||
|
||||
result_all_quantiles[[q_counter]] = result
|
||||
q_counter = q_counter + 1
|
||||
}
|
||||
rbindlist(result_all_quantiles)
|
||||
}
|
||||
result_all[[iR]] = result_all_zones_hours
|
||||
}
|
||||
|
||||
result_final = rbindlist(result_all)
|
||||
# Sort the quantiles
|
||||
result_final = result_final[order(Prediction), q:=quantiles, by=c('Zone', 'Datetime', 'Round')]
|
||||
result_final$Prediction = round(result_final$Prediction)
|
||||
|
||||
fwrite(result_final, output_file)
|
||||
|
|
@ -1,14 +1,14 @@
|
|||
#!/bin/bash
|
||||
path=energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
|
||||
for i in `seq 1 5`;
|
||||
do
|
||||
echo "Run $i"
|
||||
start=`date +%s`
|
||||
echo 'Creating features...'
|
||||
python $path/submissions/GBM/feature_engineering.py --submission GBM
|
||||
python $path/fnn/compute_features.py --submission fnn
|
||||
|
||||
echo 'Training and predicting...'
|
||||
Rscript $path/submissions/GBM/train_predict.R $i
|
||||
Rscript $path/fnn/train_predict.R $i
|
||||
|
||||
end=`date +%s`
|
||||
echo 'Running time '$((end-start))' seconds'
|
|
@ -20,7 +20,7 @@ parallel::clusterEvalQ(cl, lapply(c("qrnn", "data.table"), library, character.on
|
|||
registerDoParallel(cl)
|
||||
|
||||
# Specify data directory
|
||||
data_dir = 'energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/data/features'
|
||||
data_dir = 'benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/data/features'
|
||||
train_dir = file.path(data_dir, 'train')
|
||||
|
||||
train_file_prefix = 'train_round_'
|
||||
|
@ -45,10 +45,10 @@ for (j in 1:length(parameter_names)){
|
|||
output_file_name = paste(output_file_name, parameter_names[j], parameter_values[j], sep="_")
|
||||
}
|
||||
|
||||
output_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', output_file_name, sep=""))
|
||||
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
|
||||
|
||||
# Define cross validation split settings
|
||||
cv_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', 'cv_settings.json', sep=""))
|
||||
cv_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', 'cv_settings.json', sep=""))
|
||||
cv_settings = fromJSON(file=cv_file)
|
||||
|
||||
# Parameters of model
|
||||
|
@ -58,13 +58,13 @@ iter.max = as.integer(param_grid[parameter_set, 'iter.max'])
|
|||
penalty = as.integer(param_grid[parameter_set, 'penalty'])
|
||||
|
||||
# Data and forecast parameters
|
||||
features = c('LoadLag', 'DryBulbLag',
|
||||
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
|
||||
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
|
||||
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
|
||||
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
|
||||
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
|
||||
|
||||
normalize_columns = list('LoadLag', 'DryBulbLag')
|
||||
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
|
||||
quantiles = seq(0.1, 0.9, by = 0.1)
|
||||
subset_columns_train = c(features, 'DEMAND')
|
||||
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'LoadRatio')
|
||||
|
@ -97,7 +97,7 @@ for (i in 1:length(cv_settings)){
|
|||
validation_data = cvdata_df[Datetime >= validation_range[1] & Datetime <= validation_range[2]]
|
||||
|
||||
zones = unique(validation_data$Zone)
|
||||
hours = unique(validation_data$Hour)
|
||||
hours = unique(validation_data$hour_of_day)
|
||||
|
||||
for (c in normalize_columns){
|
||||
min_c = min(train_data[, ..c])
|
||||
|
@ -106,9 +106,9 @@ for (i in 1:length(cv_settings)){
|
|||
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
|
||||
}
|
||||
|
||||
validation_data$AverageLoadRatio = rowMeans(validation_data[,c('LoadRatio_10', 'LoadRatio_11', 'LoadRatio_12',
|
||||
'LoadRatio_13', 'LoadRatio_14', 'LoadRatio_15', 'LoadRatio_16')], na.rm=TRUE)
|
||||
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(Hour, MonthOfYear)]
|
||||
validation_data$AverageLoadRatio = rowMeans(validation_data[,c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
|
||||
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
|
||||
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(hour_of_day, month_of_year)]
|
||||
|
||||
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
|
||||
print(paste('Zone', z))
|
||||
|
@ -117,8 +117,8 @@ for (i in 1:length(cv_settings)){
|
|||
hour_counter = 1
|
||||
|
||||
for (h in hours){
|
||||
train_df_sub = train_data[Zone == z & Hour == h, ..subset_columns_train]
|
||||
validation_df_sub = validation_data[Zone == z & Hour == h, ..subset_columns_validation]
|
||||
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
|
||||
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
|
||||
|
||||
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
|
||||
|
||||
|
@ -165,7 +165,7 @@ print(paste('Average Pinball Loss:', average_PL))
|
|||
output_file_name = paste(output_file_name, 'APL', average_PL, sep="_")
|
||||
output_file_name = paste(output_file_name, '.csv', sep="")
|
||||
|
||||
output_file = file.path(paste('energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/fnn/', output_file_name, sep=""))
|
||||
output_file = file.path(paste('benchmarks/GEFCom2017_D_Prob_MT_hourly/fnn/', output_file_name, sep=""))
|
||||
|
||||
fwrite(result_final, output_file)
|
||||
|
|
@ -59,7 +59,7 @@ cv_settings = fromJSON(file=cv_file)
|
|||
|
||||
|
||||
# Data and forecast parameters
|
||||
normalize_columns = list('LoadLag', 'DryBulbLag')
|
||||
normalize_columns = list('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag')
|
||||
quantiles = seq(0.1, 0.9, by = 0.1)
|
||||
|
||||
|
||||
|
@ -101,26 +101,26 @@ for (i in 1:length(cv_settings)){
|
|||
validation_data[, c] = (validation_data[, ..c] - min_c)/(max_c - min_c)
|
||||
}
|
||||
|
||||
validation_data$AverageLoadRatio = rowMeans(validation_data[, c('LoadRatio_10', 'LoadRatio_11', 'LoadRatio_12',
|
||||
'LoadRatio_13', 'LoadRatio_14', 'LoadRatio_15', 'LoadRatio_16')], na.rm=TRUE)
|
||||
validation_data[, LoadRatio:=mean(AverageLoadRatio), by=list(Hour, MonthOfYear)]
|
||||
validation_data$average_load_ratio = rowMeans(validation_data[, c('recent_load_ratio_10', 'recent_load_ratio_11', 'recent_load_ratio_12',
|
||||
'recent_load_ratio_13', 'recent_load_ratio_14', 'recent_load_ratio_15', 'recent_load_ratio_16')], na.rm=TRUE)
|
||||
validation_data[, load_ratio:=mean(average_load_ratio), by=list(Hour, month_of_year)]
|
||||
|
||||
result_all_zones = foreach(z = zones, .combine = rbind) %dopar% {
|
||||
print(paste('Zone', z))
|
||||
|
||||
features = c('LoadLag', 'DryBulbLag',
|
||||
features = c('DEMAND_same_woy_lag', 'DryBulb_same_doy_lag',
|
||||
'annual_sin_1', 'annual_cos_1', 'annual_sin_2',
|
||||
'annual_cos_2', 'annual_sin_3', 'annual_cos_3',
|
||||
'weekly_sin_1', 'weekly_cos_1', 'weekly_sin_2',
|
||||
'weekly_cos_2', 'weekly_sin_3', 'weekly_cos_3')
|
||||
subset_columns_train = c(features, 'DEMAND')
|
||||
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'LoadRatio')
|
||||
subset_columns_validation = c(features, 'DEMAND', 'Zone', 'Datetime', 'load_ratio')
|
||||
|
||||
result_all_hours = list()
|
||||
hour_counter = 1
|
||||
for (h in hours){
|
||||
train_df_sub = train_data[Zone == z & Hour == h, ..subset_columns_train]
|
||||
validation_df_sub = validation_data[Zone == z & Hour == h, ..subset_columns_validation]
|
||||
train_df_sub = train_data[Zone == z & hour_of_day == h, ..subset_columns_train]
|
||||
validation_df_sub = validation_data[Zone == z & hour_of_day == h, ..subset_columns_validation]
|
||||
|
||||
result = data.table(Zone=validation_df_sub$Zone, Datetime=validation_df_sub$Datetime, Round=iR, CVRound=i)
|
||||
|
||||
|
@ -140,7 +140,7 @@ for (i in 1:length(cv_settings)){
|
|||
iter.max=iter.max,
|
||||
penalty=penalty)
|
||||
|
||||
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$LoadRatio
|
||||
result$Prediction = qrnn2.predict(model, x=validation_x) * validation_df_sub$load_ratio
|
||||
result$DEMAND = validation_df_sub$DEMAND
|
||||
result$loss = pinball_loss(tau, validation_df_sub$DEMAND, result$Prediction)
|
||||
result$q = tau
|
|
@ -1,14 +1,14 @@
|
|||
#!/bin/bash
|
||||
path=energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
|
||||
for i in `seq 1 40`;
|
||||
do
|
||||
echo "Parameter Set $i"
|
||||
start=`date +%s`
|
||||
echo 'Creating features...'
|
||||
python $path/submissions/fnn/feature_engineering.py --submission fnn
|
||||
python $path/fnn/compute_features.py --submission fnn
|
||||
|
||||
echo 'Training and validation...'
|
||||
Rscript $path/submissions/fnn/train_validate.R $i
|
||||
Rscript $path/fnn/train_validate.R $i
|
||||
|
||||
end=`date +%s`
|
||||
echo 'Running time '$((end-start))' seconds'
|
|
@ -1,5 +1,5 @@
|
|||
## Download base image
|
||||
FROM continuumio/anaconda3:4.4.0
|
||||
FROM continuumio/anaconda3:5.3.0
|
||||
ADD ./conda_dependencies.yml /tmp
|
||||
WORKDIR /tmp
|
||||
|
||||
|
@ -14,7 +14,6 @@ RUN apt-get install -y --no-install-recommends \
|
|||
libreadline-gplv2-dev \
|
||||
libncursesw5-dev \
|
||||
libsqlite3-dev \
|
||||
tk-dev \
|
||||
libgdbm-dev \
|
||||
libc6-dev \
|
||||
libbz2-dev \
|
||||
|
@ -35,7 +34,7 @@ RUN conda env create --file conda_dependencies.yml
|
|||
|
||||
RUN rm conda_dependencies.yml
|
||||
|
||||
RUN mkdir /TSPerf
|
||||
WORKDIR /TSPerf
|
||||
RUN mkdir /Forecasting
|
||||
WORKDIR /Forecasting
|
||||
|
||||
ENTRYPOINT ["/bin/bash"]
|
|
@ -12,7 +12,7 @@
|
|||
|
||||
**Submission name:** Quantile Random Forest
|
||||
|
||||
**Submission path:** energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf
|
||||
**Submission path:** benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
|
||||
|
||||
|
||||
## Implementation description
|
||||
|
@ -42,81 +42,78 @@ We used 2 validation time frames, the first one in January-April 2015, the secon
|
|||
|
||||
### Description of implementation scripts
|
||||
|
||||
* `feature_engineering.py`: Python script for computing features and generating feature files.
|
||||
* `compute_features.py`: Python script for computing features and generating feature files.
|
||||
* `train_score.py`: Python script that trains Quantile Random Forest models and predicts on each round of test data.
|
||||
* `train_score_vm.sh`: Bash script that runs `feature_engineering.py`and `train_score.py` five times to generate five submission files and measure model running time.
|
||||
* `train_score_vm.sh`: Bash script that runs `compute_features.py` and `train_score.py` five times to generate five submission files and measure model running time.
|
||||
|
||||
### Steps to reproduce results
|
||||
|
||||
0. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux Data Science Virtual Machine and log into it.
|
||||
1. Follow the instructions [here](#resource-deployment-instructions) to provision a Linux Data Science Virtual Machine and log into it.
|
||||
|
||||
1. Clone the Forecasting repo to the home directory of your machine
|
||||
2. Clone the Forecasting repo to the home directory of your machine
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/Microsoft/Forecasting.git
|
||||
```
|
||||
Use one of the following options to securely connect to the Git repo:
|
||||
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
|
||||
For this method, the clone command becomes
|
||||
Use one of the following options to securely connect to the Git repo:
|
||||
* [Personal Access Tokens](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
|
||||
For this method, the clone command becomes
|
||||
```bash
|
||||
git clone https://<username>:<personal access token>@github.com/Microsoft/Forecasting.git
|
||||
```
|
||||
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
|
||||
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
|
||||
* [Git Credential Managers](https://github.com/Microsoft/Git-Credential-Manager-for-Windows)
|
||||
* [Authenticate with SSH](https://help.github.com/articles/connecting-to-github-with-ssh/)
|
||||
|
||||
|
||||
2. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
3. Create a conda environment for running the scripts of data downloading, data preparation, and result evaluation.
|
||||
To do this, you need to check if conda has been installed by runnning command `conda -V`. If it is installed, you will see the conda version in the terminal. Otherwise, please follow the instructions [here](https://conda.io/docs/user-guide/install/linux.html) to install conda.
|
||||
Then, you can go to `~/Forecasting` directory in the VM and create a conda environment named `tsperf` by running
|
||||
|
||||
```bash
|
||||
cd ~/Forecasting
|
||||
conda env create --file ./common/conda_dependencies.yml
|
||||
conda env create --file tsperf/benchmarking/conda_dependencies.yml
|
||||
```
|
||||
|
||||
3. Download and extract data **on the VM**.
|
||||
4. Download and extract data **on the VM**.
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/download_data.py
|
||||
python energy_load/GEFCom2017_D_Prob_MT_hourly/common/extract_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/download_data.py
|
||||
python tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly/extract_data.py
|
||||
```
|
||||
|
||||
4. Prepare Docker container for model training and predicting.
|
||||
4.1 Log into Azure Container Registry (ACR)
|
||||
5. Prepare Docker container for model training and predicting.
|
||||
|
||||
5.1 Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
5.2 Build a local Docker image
|
||||
|
||||
```bash
|
||||
sudo docker build -t qrf_image benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf
|
||||
```
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt).
|
||||
If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
4.2 Pull the Docker image from ACR to your VM
|
||||
6. Train and predict **within Docker container**
|
||||
6.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image:v1
|
||||
```
|
||||
|
||||
5. Train and predict **within Docker container**
|
||||
5.1 Start a Docker container from the image
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name qrf_container tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name qrf_container qrf_image
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` mounts the `~/Forecasting` folder (the one you cloned) to the container so that you can access the code and data on your VM within the container.
|
||||
|
||||
5.2 Train and predict
|
||||
6.2 Train and predict
|
||||
|
||||
```
|
||||
source activate tsperf
|
||||
cd /Forecasting
|
||||
nohup bash ./energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/train_score_vm.sh >& out.txt &
|
||||
nohup bash benchmarks/GEFCom2017_D_Prob_MT_hourly/qrf/train_score_vm.sh >& out.txt &
|
||||
```
|
||||
The last command will take about 31 hours to complete. You can monitor its progress by checking out.txt file. Also during the run you can disconnect from VM. After reconnecting to VM, use the command
|
||||
|
||||
|
@ -127,12 +124,12 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
|
|||
to connect to the running container and check the status of the run.
|
||||
After generating the forecast results, you can exit the Docker container by command `exit`.
|
||||
|
||||
6. Model evaluation **on the VM**
|
||||
7. Model evaluation **on the VM**
|
||||
|
||||
```bash
|
||||
source activate tsperf
|
||||
cd ~/Forecasting
|
||||
bash ./common/evaluate submissions/qrf energy_load/GEFCom2017_D_Prob_MT_hourly
|
||||
bash tsperf/benchmarking/evaluate qrf tsperf/benchmarking/GEFCom2017_D_Prob_MT_hourly
|
||||
```
|
||||
|
||||
## Implementation resources
|
||||
|
@ -141,7 +138,7 @@ Then, you can go to `~/Forecasting` directory in the VM and create a conda envir
|
|||
**Resource location:** East US region
|
||||
**Hardware:** F72s v2 (72 vcpus, 144 GB memory) Ubuntu Linux VM
|
||||
**Data storage:** Standard SSD
|
||||
**Docker image:** tsperf.azurecr.io/energy_load/gefcom2017_d_prob_mt_hourly/qrf_image
|
||||
**Dockerfile:** [energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/energy_load/GEFCom2017_D_Prob_MT_hourly/submissions/qrf/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
||||
|
@ -162,43 +159,43 @@ Please follow the instructions below to deploy the Linux DSVM.
|
|||
## Implementation evaluation
|
||||
**Quality:**
|
||||
|
||||
* Pinball loss run 1: 76.48
|
||||
* Pinball loss run 1: 76.29
|
||||
|
||||
* Pinball loss run 2: 76.49
|
||||
* Pinball loss run 2: 76.29
|
||||
|
||||
* Pinball loss run 3: 76.43
|
||||
* Pinball loss run 3: 76.18
|
||||
|
||||
* Pinball loss run 4: 76.47
|
||||
* Pinball loss run 4: 76.23
|
||||
|
||||
* Pinball loss run 5: 76.6
|
||||
* Pinball loss run 5: 76.38
|
||||
|
||||
* Median Pinball loss: 76.48
|
||||
* Median Pinball loss: 76.29
|
||||
|
||||
**Time:**
|
||||
|
||||
* Run time 1: 22289 seconds
|
||||
* Run time 1: 20119 seconds
|
||||
|
||||
* Run time 2: 22493 seconds
|
||||
* Run time 2: 20489 seconds
|
||||
|
||||
* Run time 3: 22859 seconds
|
||||
* Run time 3: 20616 seconds
|
||||
|
||||
* Run time 4: 22709 seconds
|
||||
* Run time 4: 20297 seconds
|
||||
|
||||
* Run time 5: 23197 seconds
|
||||
* Run time 5: 20322 seconds
|
||||
|
||||
* Median run time: 22709 seconds (6.3 hours)
|
||||
* Median run time: 20322 seconds (5.65 hours)
|
||||
|
||||
**Cost:**
|
||||
The hourly cost of the F72s v2 Ubuntu Linux VM in East US Azure region is 3.045 USD, based on the price at the submission date.
|
||||
Thus, the total cost is 22709/3600 * 3.045 = 19.21 USD.
|
||||
Thus, the total cost is 20322/3600 * 3.045 = 17.19 USD.
|
||||
|
||||
**Average relative improvement (in %) over GEFCom2017 benchmark model** (measured over the first run)
|
||||
Round 1: 16.84
|
||||
Round 2: 14.98
|
||||
Round 3: 12.08
|
||||
Round 4: 14.97
|
||||
Round 5: 16.16
|
||||
Round 6: -2.52
|
||||
Round 1: 16.89
|
||||
Round 2: 14.93
|
||||
Round 3: 12.34
|
||||
Round 4: 14.95
|
||||
Round 5: 16.19
|
||||
Round 6: -0.32
|
||||
|
||||
**Ranking in the qualifying round of GEFCom2017 competition**
|
||||
3
|
|
@ -0,0 +1,94 @@
|
|||
"""
|
||||
This script uses
|
||||
energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py to
|
||||
compute a list of features needed by the Quantile Regression model.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
import getopt
|
||||
|
||||
import localpath
|
||||
|
||||
from tsperf.benchmarking.GEFCom2017_D_Prob_MT_hourly.feature_engineering import compute_features
|
||||
|
||||
SUBMISSIONS_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
DATA_DIR = os.path.join(SUBMISSIONS_DIR, "data")
|
||||
print("Data directory used: {}".format(DATA_DIR))
|
||||
|
||||
OUTPUT_DIR = os.path.join(DATA_DIR, "features")
|
||||
TRAIN_DATA_DIR = os.path.join(DATA_DIR, "train")
|
||||
TEST_DATA_DIR = os.path.join(DATA_DIR, "test")
|
||||
|
||||
DF_CONFIG = {
|
||||
"time_col_name": "Datetime",
|
||||
"ts_id_col_names": "Zone",
|
||||
"target_col_name": "DEMAND",
|
||||
"frequency": "H",
|
||||
"time_format": "%Y-%m-%d %H:%M:%S",
|
||||
}
|
||||
|
||||
HOLIDAY_COLNAME = "Holiday"
|
||||
|
||||
# Feature configuration list used to specify the features to be computed by
|
||||
# compute_features.
|
||||
# Each feature configuration is a tuple in the format of (feature_name,
|
||||
# featurizer_args)
|
||||
# feature_name is used to determine the featurizer to use, see FEATURE_MAP in
|
||||
# energy_load/GEFCom2017_D_Prob_MT_hourly/common/feature_engineering.py
|
||||
# featurizer_args is a dictionary of arguments passed to the
|
||||
# featurizer
|
||||
feature_config_list = [
|
||||
(
|
||||
"temporal",
|
||||
{
|
||||
"feature_list": [
|
||||
"hour_of_day",
|
||||
"day_of_week",
|
||||
"day_of_month",
|
||||
"normalized_hour_of_year",
|
||||
"week_of_year",
|
||||
"month_of_year",
|
||||
]
|
||||
},
|
||||
),
|
||||
("annual_fourier", {"n_harmonics": 3}),
|
||||
("weekly_fourier", {"n_harmonics": 3}),
|
||||
("daily_fourier", {"n_harmonics": 2}),
|
||||
("normalized_date", {}),
|
||||
("normalized_datehour", {}),
|
||||
("normalized_year", {}),
|
||||
("day_type", {"holiday_col_name": HOLIDAY_COLNAME}),
|
||||
("previous_year_load_lag", {"input_col_names": "DEMAND", "round_agg_result": True},),
|
||||
("previous_year_temp_lag", {"input_col_names": ["DryBulb", "DewPnt"], "round_agg_result": True},),
|
||||
(
|
||||
"recent_load_lag",
|
||||
{"input_col_names": "DEMAND", "start_week": 10, "window_size": 4, "agg_count": 8, "round_agg_result": True,},
|
||||
),
|
||||
(
|
||||
"recent_temp_lag",
|
||||
{
|
||||
"input_col_names": ["DryBulb", "DewPnt"],
|
||||
"start_week": 10,
|
||||
"window_size": 4,
|
||||
"agg_count": 8,
|
||||
"round_agg_result": True,
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
opts, args = getopt.getopt(sys.argv[1:], "", ["submission="])
|
||||
for opt, arg in opts:
|
||||
if opt == "--submission":
|
||||
submission_folder = arg
|
||||
output_data_dir = os.path.join(SUBMISSIONS_DIR, submission_folder, "data")
|
||||
if not os.path.isdir(output_data_dir):
|
||||
os.mkdir(output_data_dir)
|
||||
OUTPUT_DIR = os.path.join(output_data_dir, "features")
|
||||
if not os.path.isdir(OUTPUT_DIR):
|
||||
os.mkdir(OUTPUT_DIR)
|
||||
|
||||
compute_features(
|
||||
TRAIN_DATA_DIR, TEST_DATA_DIR, OUTPUT_DIR, DF_CONFIG, feature_config_list, filter_by_month=False,
|
||||
)
|
|
@ -9,3 +9,4 @@ dependencies:
|
|||
- urllib3=1.21.1
|
||||
- scikit-garden=0.1.3
|
||||
- joblib=0.12.5
|
||||
- scikit-learn=0.20.3
|
|
@ -13,6 +13,7 @@ from skgarden.quantile.tree import DecisionTreeQuantileRegressor
|
|||
from skgarden.quantile.ensemble import generate_sample_indices
|
||||
from ensemble_parallel_utils import weighted_percentile_vectorized
|
||||
|
||||
|
||||
class BaseForestQuantileRegressor(ForestRegressor):
|
||||
"""Training and scoring of Quantile Regression Random Forest
|
||||
|
||||
|
@ -34,6 +35,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
|
|||
a weight of zero when estimator j is fit, then the value is -1.
|
||||
|
||||
"""
|
||||
|
||||
def fit(self, X, y):
|
||||
"""Builds a forest from the training set (X, y).
|
||||
|
||||
|
@ -68,8 +70,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
|
|||
Returns self.
|
||||
"""
|
||||
# apply method requires X to be of dtype np.float32
|
||||
X, y = check_X_y(
|
||||
X, y, accept_sparse="csc", dtype=np.float32, multi_output=False)
|
||||
X, y = check_X_y(X, y, accept_sparse="csc", dtype=np.float32, multi_output=False)
|
||||
super(BaseForestQuantileRegressor, self).fit(X, y)
|
||||
|
||||
self.y_train_ = y
|
||||
|
@ -78,8 +79,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
|
|||
|
||||
for i, est in enumerate(self.estimators_):
|
||||
if self.bootstrap:
|
||||
bootstrap_indices = generate_sample_indices(
|
||||
est.random_state, len(y))
|
||||
bootstrap_indices = generate_sample_indices(est.random_state, len(y))
|
||||
else:
|
||||
bootstrap_indices = np.arange(len(y))
|
||||
|
||||
|
@ -87,8 +87,7 @@ class BaseForestQuantileRegressor(ForestRegressor):
|
|||
y_train_leaves = est.y_train_leaves_
|
||||
for curr_leaf in np.unique(y_train_leaves):
|
||||
y_ind = y_train_leaves == curr_leaf
|
||||
self.y_weights_[i, y_ind] = (
|
||||
est_weights[y_ind] / np.sum(est_weights[y_ind]))
|
||||
self.y_weights_[i, y_ind] = est_weights[y_ind] / np.sum(est_weights[y_ind])
|
||||
|
||||
self.y_train_leaves_[i, bootstrap_indices] = y_train_leaves[bootstrap_indices]
|
||||
|
||||
|
@ -167,21 +166,24 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
|
|||
oob_prediction_ : array of shape = [n_samples]
|
||||
Prediction computed with out-of-bag estimate on the training set.
|
||||
"""
|
||||
def __init__(self,
|
||||
n_estimators=10,
|
||||
criterion='mse',
|
||||
max_depth=None,
|
||||
min_samples_split=2,
|
||||
min_samples_leaf=1,
|
||||
min_weight_fraction_leaf=0.0,
|
||||
max_features='auto',
|
||||
max_leaf_nodes=None,
|
||||
bootstrap=True,
|
||||
oob_score=False,
|
||||
n_jobs=1,
|
||||
random_state=None,
|
||||
verbose=0,
|
||||
warm_start=False):
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
n_estimators=10,
|
||||
criterion="mse",
|
||||
max_depth=None,
|
||||
min_samples_split=2,
|
||||
min_samples_leaf=1,
|
||||
min_weight_fraction_leaf=0.0,
|
||||
max_features="auto",
|
||||
max_leaf_nodes=None,
|
||||
bootstrap=True,
|
||||
oob_score=False,
|
||||
n_jobs=1,
|
||||
random_state=None,
|
||||
verbose=0,
|
||||
warm_start=False,
|
||||
):
|
||||
"""Initialize RandomForestQuantileRegressor class
|
||||
|
||||
Args:
|
||||
|
@ -271,16 +273,23 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
|
|||
super(RandomForestQuantileRegressor, self).__init__(
|
||||
base_estimator=DecisionTreeQuantileRegressor(),
|
||||
n_estimators=n_estimators,
|
||||
estimator_params=("criterion", "max_depth", "min_samples_split",
|
||||
"min_samples_leaf", "min_weight_fraction_leaf",
|
||||
"max_features", "max_leaf_nodes",
|
||||
"random_state"),
|
||||
estimator_params=(
|
||||
"criterion",
|
||||
"max_depth",
|
||||
"min_samples_split",
|
||||
"min_samples_leaf",
|
||||
"min_weight_fraction_leaf",
|
||||
"max_features",
|
||||
"max_leaf_nodes",
|
||||
"random_state",
|
||||
),
|
||||
bootstrap=bootstrap,
|
||||
oob_score=oob_score,
|
||||
n_jobs=n_jobs,
|
||||
random_state=random_state,
|
||||
verbose=verbose,
|
||||
warm_start=warm_start)
|
||||
warm_start=warm_start,
|
||||
)
|
||||
|
||||
self.criterion = criterion
|
||||
self.max_depth = max_depth
|
||||
|
@ -289,5 +298,3 @@ class RandomForestQuantileRegressor(BaseForestQuantileRegressor):
|
|||
self.min_weight_fraction_leaf = min_weight_fraction_leaf
|
||||
self.max_features = max_features
|
||||
self.max_leaf_nodes = max_leaf_nodes
|
||||
|
||||
|
|
@ -3,6 +3,7 @@
|
|||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def weighted_percentile_vectorized(a, quantiles, weights=None, sorter=None):
|
||||
"""Returns the weighted percentile of a at q given weights.
|
||||
|
||||
|
@ -69,8 +70,7 @@ def weighted_percentile_vectorized(a, quantiles, weights=None, sorter=None):
|
|||
percentiles = np.zeros_like(quantiles)
|
||||
for i, q in enumerate(quantiles):
|
||||
if q > 100 or q < 0:
|
||||
raise ValueError("q should be in-between 0 and 100, "
|
||||
"got %d" % q)
|
||||
raise ValueError("q should be in-between 0 and 100, " "got %d" % q)
|
||||
|
||||
start = np.searchsorted(partial_sum, q) - 1
|
||||
if start == len(sorted_cum_weights) - 1:
|
|
@ -0,0 +1,12 @@
|
|||
"""
|
||||
This script inserts the TSPerf directory into sys.path, so that scripts can import
|
||||
all the modules in TSPerf. Each submission folder needs its own localpath.py file.
|
||||
"""
|
||||
|
||||
import os, sys
|
||||
|
||||
_CURR_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
TSPERF_DIR = os.path.dirname(os.path.dirname(os.path.dirname(_CURR_DIR)))
|
||||
|
||||
if TSPERF_DIR not in sys.path:
|
||||
sys.path.insert(0, TSPERF_DIR)
|
|
@ -0,0 +1,80 @@
|
|||
# This script performs training and scoring with Quantile Random Forest model
|
||||
|
||||
from os.path import join
|
||||
import argparse
|
||||
import pandas as pd
|
||||
from numpy import arange
|
||||
from ensemble_parallel import RandomForestQuantileRegressor
|
||||
|
||||
# get seed value
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--data-folder", type=str, dest="data_folder", help="data folder mounting point",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-folder", type=str, dest="output_folder", help="output folder mounting point",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, dest="seed", help="random seed")
|
||||
args = parser.parse_args()
|
||||
|
||||
# initialize location of input and output files
|
||||
data_dir = join(args.data_folder, "features")
|
||||
train_dir = join(data_dir, "train")
|
||||
test_dir = join(data_dir, "test")
|
||||
output_file = join(args.output_folder, "submission_seed_{}.csv".format(args.seed))
|
||||
|
||||
# do 6 rounds of forecasting, at each round output 9 quantiles
|
||||
n_rounds = 6
|
||||
quantiles = arange(0.1, 1, 0.1)
|
||||
|
||||
# schema of the output
|
||||
y_test = pd.DataFrame(columns=["Datetime", "Zone", "Round", "q", "Prediction"])
|
||||
|
||||
for i in range(1, n_rounds + 1):
|
||||
print("Round {}".format(i))
|
||||
|
||||
# read training and test files for the current round
|
||||
train_file = join(train_dir, "train_round_{}.csv".format(i))
|
||||
train_df = pd.read_csv(train_file)
|
||||
|
||||
test_file = join(test_dir, "test_round_{}.csv".format(i))
|
||||
test_df = pd.read_csv(test_file)
|
||||
|
||||
# train and test for each hour separately
|
||||
for hour in arange(0, 24):
|
||||
print(hour)
|
||||
|
||||
# select training sets
|
||||
train_df_hour = train_df[(train_df["hour_of_day"] == hour)]
|
||||
# create one-hot encoding of Zone
|
||||
# (scikit-garden works only with numerical columns)
|
||||
train_df_hour = pd.get_dummies(train_df_hour, columns=["Zone"])
|
||||
# remove column that are not useful (Datetime) or are not
|
||||
# available in the test set (DEMAND, DryBulb, DewPnt)
|
||||
X_train = train_df_hour.drop(columns=["Datetime", "DEMAND", "DryBulb", "DewPnt"]).values
|
||||
|
||||
y_train = train_df_hour["DEMAND"].values
|
||||
|
||||
# train a model
|
||||
rfqr = RandomForestQuantileRegressor(
|
||||
random_state=args.seed, n_jobs=-1, n_estimators=1000, max_features="sqrt", max_depth=12,
|
||||
)
|
||||
rfqr.fit(X_train, y_train)
|
||||
|
||||
# select test set
|
||||
test_df_hour = test_df[test_df["hour_of_day"] == hour]
|
||||
y_test_baseline = test_df_hour[["Datetime", "Zone"]]
|
||||
test_df_cat = pd.get_dummies(test_df_hour, columns=["Zone"])
|
||||
X_test = test_df_cat.drop(columns=["Datetime"]).values
|
||||
|
||||
# generate forecast for each quantile
|
||||
percentiles = rfqr.predict(X_test, quantiles * 100)
|
||||
for j, quantile in enumerate(quantiles):
|
||||
y_test_round_quantile = y_test_baseline.copy(deep=True)
|
||||
y_test_round_quantile["Round"] = i
|
||||
y_test_round_quantile["q"] = quantile
|
||||
y_test_round_quantile["Prediction"] = percentiles[:, j]
|
||||
y_test = pd.concat([y_test, y_test_round_quantile])
|
||||
|
||||
# store forecasts
|
||||
y_test.to_csv(output_file, index=False)
|
|
@ -0,0 +1,15 @@
|
|||
path=benchmarks/GEFCom2017_D_Prob_MT_hourly
|
||||
for i in `seq 1 5`;
|
||||
do
|
||||
echo "Run $i"
|
||||
start=`date +%s`
|
||||
echo 'Creating features...'
|
||||
python $path/qrf/compute_features.py --submission qrf
|
||||
|
||||
echo 'Training and predicting...'
|
||||
python $path/qrf/train_score.py --data-folder $path/qrf/data --output-folder $path/qrf --seed $i
|
||||
|
||||
end=`date +%s`
|
||||
echo 'Running time '$((end-start))' seconds'
|
||||
done
|
||||
echo 'Training and scoring are completed'
|
|
@ -87,27 +87,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
|
||||
`source deactivate`.
|
||||
|
||||
5. Log into Azure Container Registry (ACR):
|
||||
5. Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
6. Pull a Docker image from ACR using the following command
|
||||
6. Build a local Docker image by running the following command from `~/Forecasting` directory
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA
|
||||
```
|
||||
|
||||
7. Choose a name for a new Docker container (e.g. arima_container) and create it using command:
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name arima_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name arima_container baseline_image:v1
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
|
||||
|
@ -145,7 +143,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
|
||||
**Data storage:** Premium SSD
|
||||
|
||||
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ARIMA/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* R
|
|
@ -94,27 +94,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
|
||||
`source deactivate`.
|
||||
|
||||
5. Log into Azure Container Registry (ACR):
|
||||
5. Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
6. Pull a Docker image from ACR using the following command
|
||||
6. Build a local Docker image by running the following command from `~/Forecasting` directory
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
|
||||
sudo docker build -t dcnn_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN
|
||||
```
|
||||
|
||||
7. Choose a name for a new Docker container (e.g. dcnn_container) and create it using command:
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name dcnn_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --runtime=nvidia --name dcnn_container dcnn_image:v1
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
|
||||
|
@ -152,7 +150,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
|
||||
**Data storage:** Standard HDD
|
||||
|
||||
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/dcnn_image:v1
|
||||
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/DilatedCNN/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
|
@ -1,445 +1,445 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tuning Hyperparameters of Dilated CNN Model with AML SDK and HyperDrive\n",
|
||||
"\n",
|
||||
"This notebook performs hyperparameter tuning of Dilated CNN model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
|
||||
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
|
||||
"`jupyter nbextension install --py --user azureml.widgets` \n",
|
||||
"`jupyter nbextension enable --py --user azureml.widgets` \n",
|
||||
"\n",
|
||||
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml\n",
|
||||
"from azureml.core import Workspace, Run\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.telemetry import set_diagnostics_collection\n",
|
||||
"\n",
|
||||
"# Opt-in diagnostics for better experience of future releases\n",
|
||||
"set_diagnostics_collection(send_diagnostics=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace & Create an Azure ML Experiment\n",
|
||||
"\n",
|
||||
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print('Workspace name: ' + ws.name, \n",
|
||||
" 'Azure region: ' + ws.location, \n",
|
||||
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Experiment\n",
|
||||
"\n",
|
||||
"exp = Experiment(workspace=ws, name='tune_dcnn')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Validate Script Locally"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import RunConfiguration\n",
|
||||
"\n",
|
||||
"# Configure local, user managed environment\n",
|
||||
"run_config_user_managed = RunConfiguration()\n",
|
||||
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
|
||||
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
"\n",
|
||||
"# Please update data-folder argument before submitting the job\n",
|
||||
"src = ScriptRunConfig(source_directory='./', \n",
|
||||
" script='train_validate.py', \n",
|
||||
" arguments=['--data-folder', \n",
|
||||
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
|
||||
" '--dropout-rate', '0.2'],\n",
|
||||
" run_config=run_config_user_managed)\n",
|
||||
"run_local = exp.submit(src)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check job status\n",
|
||||
"run_local.get_status()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check results\n",
|
||||
"while(run_local.get_status() != 'Completed'): {}\n",
|
||||
"run_local.get_details()\n",
|
||||
"run_local.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run Script on Remote Compute Target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a GPU cluster as compute target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
"\n",
|
||||
"# Choose a name for your cluster\n",
|
||||
"cluster_name = \"gpucluster\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # Look for the existing cluster by name\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" if type(compute_target) is AmlCompute:\n",
|
||||
" print('Found existing compute target {}.'.format(cluster_name))\n",
|
||||
" else:\n",
|
||||
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\", # GPU-based VM\n",
|
||||
" #vm_priority='lowpriority', # optional\n",
|
||||
" min_nodes=0, \n",
|
||||
" max_nodes=4,\n",
|
||||
" idle_seconds_before_scaledown=3600)\n",
|
||||
" # Create the cluster\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
|
||||
" # if no min node count is provided it uses the scale settings for the cluster\n",
|
||||
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
|
||||
" # Get a detailed status for the current cluster. \n",
|
||||
" print(compute_target.serialize())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# If you have created the compute target, you should see one entry named 'gpucluster' of type AmlCompute \n",
|
||||
"# in the workspace's compute_targets property.\n",
|
||||
"compute_targets = ws.compute_targets\n",
|
||||
"for name, ct in compute_targets.items():\n",
|
||||
" print(name, ct.type, ct.provisioning_state)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure Docker environment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"env = EnvironmentDefinition()\n",
|
||||
"env.python.user_managed_dependencies = False\n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],\n",
|
||||
" python_version='3.6.2')\n",
|
||||
"env.python.conda_dependencies.add_channel('conda-forge')\n",
|
||||
"env.docker.enabled=True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload data to default datastore\n",
|
||||
"\n",
|
||||
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ds = ws.get_default_datastore()\n",
|
||||
"print(ds.datastore_type, ds.account_name, ds.container_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path_on_datastore = 'data'\n",
|
||||
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get data reference object for the data path\n",
|
||||
"ds_data = ds.path(path_on_datastore)\n",
|
||||
"print(ds_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create estimator\n",
|
||||
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.train.estimator import Estimator\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount(),\n",
|
||||
" '--dropout-rate': 0.2\n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit job"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit job to compute target\n",
|
||||
"run_remote = exp.submit(config=est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Check job status"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"\n",
|
||||
"RunDetails(run_remote).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"run_remote.get_details()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get metric value after the job finishes \n",
|
||||
"while(run_remote.get_status() != 'Completed'): {}\n",
|
||||
"run_remote.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
|
||||
"## Tune Hyperparameters using HyperDrive"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.hyperdrive import *\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount() \n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)\n",
|
||||
"ps = BayesianParameterSampling({\n",
|
||||
" '--seq-len': quniform(5, 40, 1),\n",
|
||||
" '--dropout-rate': uniform(0, 0.4),\n",
|
||||
" '--batch-size': choice(32, 64),\n",
|
||||
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
|
||||
" '--epochs': quniform(2, 80, 1)\n",
|
||||
"})\n",
|
||||
"htc = HyperDriveRunConfig(estimator=est, \n",
|
||||
" hyperparameter_sampling=ps, \n",
|
||||
" primary_metric_name='MAPE', \n",
|
||||
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
|
||||
" max_total_runs=200,\n",
|
||||
" max_concurrent_runs=4)\n",
|
||||
"htr = exp.submit(config=htc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"RunDetails(htr).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"while(htr.get_status() != 'Completed'): {}\n",
|
||||
"htr.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = htr.get_best_run_by_primary_metric()\n",
|
||||
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
|
||||
"print(parameter_values)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tuning Hyperparameters of Dilated CNN Model with AML SDK and HyperDrive\n",
|
||||
"\n",
|
||||
"This notebook performs hyperparameter tuning of Dilated CNN model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains Dilated CNN models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
|
||||
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
|
||||
"`jupyter nbextension install --py --user azureml.widgets` \n",
|
||||
"`jupyter nbextension enable --py --user azureml.widgets` \n",
|
||||
"\n",
|
||||
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml\n",
|
||||
"from azureml.core import Workspace, Run\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.telemetry import set_diagnostics_collection\n",
|
||||
"\n",
|
||||
"# Opt-in diagnostics for better experience of future releases\n",
|
||||
"set_diagnostics_collection(send_diagnostics=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace & Create an Azure ML Experiment\n",
|
||||
"\n",
|
||||
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print('Workspace name: ' + ws.name, \n",
|
||||
" 'Azure region: ' + ws.location, \n",
|
||||
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Experiment\n",
|
||||
"\n",
|
||||
"exp = Experiment(workspace=ws, name='tune_dcnn')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Validate Script Locally"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import RunConfiguration\n",
|
||||
"\n",
|
||||
"# Configure local, user managed environment\n",
|
||||
"run_config_user_managed = RunConfiguration()\n",
|
||||
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
|
||||
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
"\n",
|
||||
"# Please update data-folder argument before submitting the job\n",
|
||||
"src = ScriptRunConfig(source_directory='./', \n",
|
||||
" script='train_validate.py', \n",
|
||||
" arguments=['--data-folder', \n",
|
||||
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
|
||||
" '--dropout-rate', '0.2'],\n",
|
||||
" run_config=run_config_user_managed)\n",
|
||||
"run_local = exp.submit(src)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check job status\n",
|
||||
"run_local.get_status()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check results\n",
|
||||
"while(run_local.get_status() != 'Completed'): {}\n",
|
||||
"run_local.get_details()\n",
|
||||
"run_local.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run Script on Remote Compute Target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a GPU cluster as compute target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
"\n",
|
||||
"# Choose a name for your cluster\n",
|
||||
"cluster_name = \"gpucluster\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # Look for the existing cluster by name\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" if type(compute_target) is AmlCompute:\n",
|
||||
" print('Found existing compute target {}.'.format(cluster_name))\n",
|
||||
" else:\n",
|
||||
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_NC6\", # GPU-based VM\n",
|
||||
" #vm_priority='lowpriority', # optional\n",
|
||||
" min_nodes=0, \n",
|
||||
" max_nodes=4,\n",
|
||||
" idle_seconds_before_scaledown=3600)\n",
|
||||
" # Create the cluster\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
|
||||
" # if no min node count is provided it uses the scale settings for the cluster\n",
|
||||
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
|
||||
" # Get a detailed status for the current cluster. \n",
|
||||
" print(compute_target.serialize())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# If you have created the compute target, you should see one entry named 'gpucluster' of type AmlCompute \n",
|
||||
"# in the workspace's compute_targets property.\n",
|
||||
"compute_targets = ws.compute_targets\n",
|
||||
"for name, ct in compute_targets.items():\n",
|
||||
" print(name, ct.type, ct.provisioning_state)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure Docker environment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"env = EnvironmentDefinition()\n",
|
||||
"env.python.user_managed_dependencies = False\n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'tensorflow-gpu', 'keras', 'joblib'],\n",
|
||||
" python_version='3.6.2')\n",
|
||||
"env.python.conda_dependencies.add_channel('conda-forge')\n",
|
||||
"env.docker.enabled=True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload data to default datastore\n",
|
||||
"\n",
|
||||
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ds = ws.get_default_datastore()\n",
|
||||
"print(ds.datastore_type, ds.account_name, ds.container_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path_on_datastore = 'data'\n",
|
||||
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get data reference object for the data path\n",
|
||||
"ds_data = ds.path(path_on_datastore)\n",
|
||||
"print(ds_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create estimator\n",
|
||||
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.train.estimator import Estimator\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount(),\n",
|
||||
" '--dropout-rate': 0.2\n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit job"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit job to compute target\n",
|
||||
"run_remote = exp.submit(config=est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Check job status"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"\n",
|
||||
"RunDetails(run_remote).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"run_remote.get_details()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get metric value after the job finishes \n",
|
||||
"while(run_remote.get_status() != 'Completed'): {}\n",
|
||||
"run_remote.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
|
||||
"## Tune Hyperparameters using HyperDrive"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.hyperdrive import *\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount() \n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)\n",
|
||||
"ps = BayesianParameterSampling({\n",
|
||||
" '--seq-len': quniform(5, 40, 1),\n",
|
||||
" '--dropout-rate': uniform(0, 0.4),\n",
|
||||
" '--batch-size': choice(32, 64),\n",
|
||||
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
|
||||
" '--epochs': quniform(2, 80, 1)\n",
|
||||
"})\n",
|
||||
"htc = HyperDriveRunConfig(estimator=est, \n",
|
||||
" hyperparameter_sampling=ps, \n",
|
||||
" primary_metric_name='MAPE', \n",
|
||||
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
|
||||
" max_total_runs=200,\n",
|
||||
" max_concurrent_runs=4)\n",
|
||||
"htr = exp.submit(config=htc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"RunDetails(htr).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"while(htr.get_status() != 'Completed'): {}\n",
|
||||
"htr.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = htr.get_best_run_by_primary_metric()\n",
|
||||
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
|
||||
"print(parameter_values)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -0,0 +1,88 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Create input features for the Dilated Convolutional Neural Network (CNN) model.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import math
|
||||
import datetime
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Append TSPerf path to sys.path
|
||||
tsperf_dir = "."
|
||||
if tsperf_dir not in sys.path:
|
||||
sys.path.append(tsperf_dir)
|
||||
|
||||
# Import TSPerf components
|
||||
from utils import *
|
||||
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
|
||||
|
||||
|
||||
def make_features(pred_round, train_dir, pred_steps, offset, store_list, brand_list):
|
||||
"""Create a dataframe of the input features.
|
||||
|
||||
Args:
|
||||
pred_round (Integer): Prediction round
|
||||
train_dir (String): Path of the training data directory
|
||||
pred_steps (Integer): Number of prediction steps
|
||||
offset (Integer): Length of training data skipped in the retraining
|
||||
store_list (Numpy Array): List of all the store IDs
|
||||
brand_list (Numpy Array): List of all the brand IDs
|
||||
|
||||
Returns:
|
||||
data_filled (Dataframe): Dataframe including the input features
|
||||
data_scaled (Dataframe): Dataframe including the normalized features
|
||||
"""
|
||||
# Load training data
|
||||
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
|
||||
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
|
||||
train_df = train_df[["store", "brand", "week", "move"]]
|
||||
|
||||
# Create a dataframe to hold all necessary data
|
||||
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
|
||||
d = {"store": store_list, "brand": brand_list, "week": week_list}
|
||||
data_grid = df_from_cartesian_product(d)
|
||||
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Get future price, deal, and advertisement info
|
||||
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
|
||||
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Create relative price feature
|
||||
price_cols = [
|
||||
"price1",
|
||||
"price2",
|
||||
"price3",
|
||||
"price4",
|
||||
"price5",
|
||||
"price6",
|
||||
"price7",
|
||||
"price8",
|
||||
"price9",
|
||||
"price10",
|
||||
"price11",
|
||||
]
|
||||
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
|
||||
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
|
||||
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
|
||||
data_filled.drop(price_cols, axis=1, inplace=True)
|
||||
|
||||
# Fill missing values
|
||||
data_filled = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: x.fillna(method="ffill").fillna(method="bfill")
|
||||
)
|
||||
|
||||
# Create datetime features
|
||||
data_filled["week_start"] = data_filled["week"].apply(
|
||||
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
|
||||
)
|
||||
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
|
||||
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
|
||||
data_filled.drop("week_start", axis=1, inplace=True)
|
||||
|
||||
# Normalize the dataframe of features
|
||||
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
|
||||
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
|
||||
|
||||
return data_filled, data_scaled
|
|
@ -0,0 +1,223 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Train and score a Dilated Convolutional Neural Network (CNN) model using Keras package with TensorFlow backend.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import keras
|
||||
import random
|
||||
import argparse
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import tensorflow as tf
|
||||
|
||||
from keras import optimizers
|
||||
from keras.layers import *
|
||||
from keras.models import Model, load_model
|
||||
from keras.callbacks import ModelCheckpoint
|
||||
|
||||
# Append TSPerf path to sys.path (assume we run the script from TSPerf directory)
|
||||
tsperf_dir = "."
|
||||
if tsperf_dir not in sys.path:
|
||||
sys.path.append(tsperf_dir)
|
||||
|
||||
# Import TSPerf components
|
||||
from utils import *
|
||||
from make_features import make_features
|
||||
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
|
||||
|
||||
# Model definition
|
||||
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
|
||||
"""Create a Dilated CNN model.
|
||||
|
||||
Args:
|
||||
seq_len (Integer): Input sequence length
|
||||
kernel_size (Integer): Kernel size of each convolutional layer
|
||||
n_filters (Integer): Number of filters in each convolutional layer
|
||||
n_outputs (Integer): Number of outputs in the last layer
|
||||
|
||||
Returns:
|
||||
Keras Model object
|
||||
"""
|
||||
# Sequential input
|
||||
seq_in = Input(shape=(seq_len, n_input_series))
|
||||
|
||||
# Categorical input
|
||||
cat_fea_in = Input(shape=(2,), dtype="uint8")
|
||||
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
|
||||
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
|
||||
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
|
||||
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
|
||||
|
||||
# Dilated convolutional layers
|
||||
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
|
||||
seq_in
|
||||
)
|
||||
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
|
||||
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
|
||||
|
||||
# Skip connections
|
||||
c4 = concatenate([c1, c3])
|
||||
|
||||
# Output of convolutional layers
|
||||
conv_out = Conv1D(8, 1, activation="relu")(c4)
|
||||
conv_out = Dropout(args.dropout_rate)(conv_out)
|
||||
conv_out = Flatten()(conv_out)
|
||||
|
||||
# Concatenate with categorical features
|
||||
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
|
||||
x = Dense(16, activation="relu")(x)
|
||||
output = Dense(n_outputs, activation="linear")(x)
|
||||
|
||||
# Define model interface, loss function, and optimizer
|
||||
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse input arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--seed", type=int, dest="seed", default=1, help="random seed")
|
||||
parser.add_argument("--seq-len", type=int, dest="seq_len", default=15, help="length of the input sequence")
|
||||
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.01, help="dropout ratio")
|
||||
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
|
||||
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.015, help="learning rate")
|
||||
parser.add_argument("--epochs", type=int, dest="epochs", default=25, help="# of epochs")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Fix random seeds
|
||||
np.random.seed(args.seed)
|
||||
random.seed(args.seed)
|
||||
tf.set_random_seed(args.seed)
|
||||
|
||||
# Data paths
|
||||
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
|
||||
SUBMISSION_DIR = os.path.join(
|
||||
tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "DilatedCNN"
|
||||
)
|
||||
TRAIN_DIR = os.path.join(DATA_DIR, "train")
|
||||
|
||||
# Dataset parameters
|
||||
MAX_STORE_ID = 137
|
||||
MAX_BRAND_ID = 11
|
||||
|
||||
# Parameters of the model
|
||||
PRED_HORIZON = 3
|
||||
PRED_STEPS = 2
|
||||
SEQ_LEN = args.seq_len
|
||||
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
|
||||
STATIC_FEATURES = ["store", "brand"]
|
||||
|
||||
# Get unique stores and brands
|
||||
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
|
||||
store_list = train_df["store"].unique()
|
||||
brand_list = train_df["brand"].unique()
|
||||
store_brand = [(x, y) for x in store_list for y in brand_list]
|
||||
|
||||
# Train and predict for all forecast rounds
|
||||
pred_all = []
|
||||
file_name = os.path.join(SUBMISSION_DIR, "dcnn_model.h5")
|
||||
for r in range(bs.NUM_ROUNDS):
|
||||
print("---- Round " + str(r + 1) + " ----")
|
||||
offset = 0 if r == 0 else 40 + r * PRED_STEPS
|
||||
# Create features
|
||||
data_filled, data_scaled = make_features(r, TRAIN_DIR, PRED_STEPS, offset, store_list, brand_list)
|
||||
|
||||
# Create sequence array for 'move'
|
||||
start_timestep = 0
|
||||
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - PRED_HORIZON
|
||||
train_input1 = gen_sequence_array(
|
||||
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep - offset
|
||||
)
|
||||
|
||||
# Create sequence array for other dynamic features
|
||||
start_timestep = PRED_HORIZON
|
||||
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
|
||||
train_input2 = gen_sequence_array(
|
||||
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep - offset
|
||||
)
|
||||
|
||||
seq_in = np.concatenate([train_input1, train_input2], axis=2)
|
||||
|
||||
# Create array of static features
|
||||
total_timesteps = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
|
||||
cat_fea_in = static_feature_array(data_filled, total_timesteps - offset, STATIC_FEATURES)
|
||||
|
||||
# Create training output
|
||||
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
|
||||
end_timestep = bs.TRAIN_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
|
||||
train_output = gen_sequence_array(
|
||||
data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep - offset
|
||||
)
|
||||
train_output = np.squeeze(train_output)
|
||||
|
||||
# Create and train model
|
||||
if r == 0:
|
||||
model = create_dcnn_model(
|
||||
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
|
||||
)
|
||||
adam = optimizers.Adam(lr=args.learning_rate)
|
||||
model.compile(loss="mape", optimizer=adam, metrics=["mape"])
|
||||
# Define checkpoint and fit model
|
||||
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
|
||||
callbacks_list = [checkpoint]
|
||||
history = model.fit(
|
||||
[seq_in, cat_fea_in],
|
||||
train_output,
|
||||
epochs=args.epochs,
|
||||
batch_size=args.batch_size,
|
||||
callbacks=callbacks_list,
|
||||
verbose=0,
|
||||
)
|
||||
else:
|
||||
model = load_model(file_name)
|
||||
checkpoint = ModelCheckpoint(file_name, monitor="loss", save_best_only=True, mode="min", verbose=0)
|
||||
callbacks_list = [checkpoint]
|
||||
history = model.fit(
|
||||
[seq_in, cat_fea_in],
|
||||
train_output,
|
||||
epochs=1,
|
||||
batch_size=args.batch_size,
|
||||
callbacks=callbacks_list,
|
||||
verbose=0,
|
||||
)
|
||||
|
||||
# Get inputs for prediction
|
||||
start_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + PRED_STEPS
|
||||
end_timestep = bs.TEST_START_WEEK_LIST[r] - bs.TRAIN_START_WEEK + PRED_STEPS - 1 - PRED_HORIZON
|
||||
test_input1 = gen_sequence_array(
|
||||
data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep - offset, end_timestep - offset
|
||||
)
|
||||
|
||||
start_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK - SEQ_LEN + 1
|
||||
end_timestep = bs.TEST_END_WEEK_LIST[r] - bs.TRAIN_START_WEEK
|
||||
test_input2 = gen_sequence_array(
|
||||
data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep - offset, end_timestep - offset
|
||||
)
|
||||
|
||||
seq_in = np.concatenate([test_input1, test_input2], axis=2)
|
||||
|
||||
total_timesteps = 1
|
||||
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
|
||||
|
||||
# Make prediction
|
||||
pred = np.round(model.predict([seq_in, cat_fea_in]))
|
||||
|
||||
# Create dataframe for submission
|
||||
exp_output = data_filled[data_filled.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
|
||||
exp_output = exp_output[["store", "brand", "week"]]
|
||||
pred_df = (
|
||||
exp_output.sort_values(["store", "brand", "week"]).loc[:, ["store", "brand", "week"]].reset_index(drop=True)
|
||||
)
|
||||
pred_df["weeks_ahead"] = pred_df["week"] - bs.TRAIN_END_WEEK_LIST[r]
|
||||
pred_df["round"] = r + 1
|
||||
pred_df["prediction"] = np.reshape(pred, (pred.size, 1))
|
||||
pred_all.append(pred_df)
|
||||
|
||||
# Generate submission
|
||||
submission = pd.concat(pred_all, axis=0).reset_index(drop=True)
|
||||
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
|
||||
filename = "submission_seed_" + str(args.seed) + ".csv"
|
||||
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)
|
||||
print("Done")
|
|
@ -0,0 +1,212 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Perform cross validation of a Dilated Convolutional Neural Network (CNN) model on the training data of the 1st forecast round.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import math
|
||||
import keras
|
||||
import argparse
|
||||
import datetime
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from utils import *
|
||||
from keras.layers import *
|
||||
from keras.models import Model
|
||||
from keras import optimizers
|
||||
from keras.utils import multi_gpu_model
|
||||
from azureml.core import Run
|
||||
|
||||
# Model definition
|
||||
def create_dcnn_model(seq_len, kernel_size=2, n_filters=3, n_input_series=1, n_outputs=1):
|
||||
"""Create a Dilated CNN model.
|
||||
|
||||
Args:
|
||||
seq_len (Integer): Input sequence length
|
||||
kernel_size (Integer): Kernel size of each convolutional layer
|
||||
n_filters (Integer): Number of filters in each convolutional layer
|
||||
n_outputs (Integer): Number of outputs in the last layer
|
||||
|
||||
Returns:
|
||||
Keras Model object
|
||||
"""
|
||||
# Sequential input
|
||||
seq_in = Input(shape=(seq_len, n_input_series))
|
||||
|
||||
# Categorical input
|
||||
cat_fea_in = Input(shape=(2,), dtype="uint8")
|
||||
store_id = Lambda(lambda x: x[:, 0, None])(cat_fea_in)
|
||||
brand_id = Lambda(lambda x: x[:, 1, None])(cat_fea_in)
|
||||
store_embed = Embedding(MAX_STORE_ID + 1, 7, input_length=1)(store_id)
|
||||
brand_embed = Embedding(MAX_BRAND_ID + 1, 4, input_length=1)(brand_id)
|
||||
|
||||
# Dilated convolutional layers
|
||||
c1 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=1, padding="causal", activation="relu")(
|
||||
seq_in
|
||||
)
|
||||
c2 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=2, padding="causal", activation="relu")(c1)
|
||||
c3 = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=4, padding="causal", activation="relu")(c2)
|
||||
|
||||
# Skip connections
|
||||
c4 = concatenate([c1, c3])
|
||||
|
||||
# Output of convolutional layers
|
||||
conv_out = Conv1D(8, 1, activation="relu")(c4)
|
||||
conv_out = Dropout(args.dropout_rate)(conv_out)
|
||||
conv_out = Flatten()(conv_out)
|
||||
|
||||
# Concatenate with categorical features
|
||||
x = concatenate([conv_out, Flatten()(store_embed), Flatten()(brand_embed)])
|
||||
x = Dense(16, activation="relu")(x)
|
||||
output = Dense(n_outputs, activation="linear")(x)
|
||||
|
||||
# Define model interface, loss function, and optimizer
|
||||
model = Model(inputs=[seq_in, cat_fea_in], outputs=output)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse input arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--data-folder", type=str, dest="data_folder", help="data folder mounting point")
|
||||
parser.add_argument("--seq-len", type=int, dest="seq_len", default=20, help="length of the input sequence")
|
||||
parser.add_argument("--batch-size", type=int, dest="batch_size", default=64, help="mini batch size for training")
|
||||
parser.add_argument("--dropout-rate", type=float, dest="dropout_rate", default=0.10, help="dropout ratio")
|
||||
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.01, help="learning rate")
|
||||
parser.add_argument("--epochs", type=int, dest="epochs", default=30, help="# of epochs")
|
||||
args = parser.parse_args()
|
||||
args.dropout_rate = round(args.dropout_rate, 2)
|
||||
print(args)
|
||||
|
||||
# Start an Azure ML run
|
||||
run = Run.get_context()
|
||||
|
||||
# Data paths
|
||||
DATA_DIR = args.data_folder
|
||||
TRAIN_DIR = os.path.join(DATA_DIR, "train")
|
||||
|
||||
# Data and forecast problem parameters
|
||||
MAX_STORE_ID = 137
|
||||
MAX_BRAND_ID = 11
|
||||
PRED_HORIZON = 3
|
||||
PRED_STEPS = 2
|
||||
TRAIN_START_WEEK = 40
|
||||
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
|
||||
TEST_START_WEEK_LIST = list(range(137, 161, 2))
|
||||
TEST_END_WEEK_LIST = list(range(138, 162, 2))
|
||||
# The start datetime of the first week in the record
|
||||
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
|
||||
|
||||
# Input sequence length and feature names
|
||||
SEQ_LEN = args.seq_len
|
||||
DYNAMIC_FEATURES = ["deal", "feat", "month", "week_of_month", "price", "price_ratio"]
|
||||
STATIC_FEATURES = ["store", "brand"]
|
||||
|
||||
# Get unique stores and brands
|
||||
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
|
||||
store_list = train_df["store"].unique()
|
||||
brand_list = train_df["brand"].unique()
|
||||
store_brand = [(x, y) for x in store_list for y in brand_list]
|
||||
|
||||
# Train and validate the model using only the first round data
|
||||
r = 0
|
||||
print("---- Round " + str(r + 1) + " ----")
|
||||
# Load training data
|
||||
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
|
||||
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
|
||||
train_df = train_df[["store", "brand", "week", "move"]]
|
||||
|
||||
# Create a dataframe to hold all necessary data
|
||||
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
|
||||
d = {"store": store_list, "brand": brand_list, "week": week_list}
|
||||
data_grid = df_from_cartesian_product(d)
|
||||
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Get future price, deal, and advertisement info
|
||||
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
|
||||
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Create relative price feature
|
||||
price_cols = [
|
||||
"price1",
|
||||
"price2",
|
||||
"price3",
|
||||
"price4",
|
||||
"price5",
|
||||
"price6",
|
||||
"price7",
|
||||
"price8",
|
||||
"price9",
|
||||
"price10",
|
||||
"price11",
|
||||
]
|
||||
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
|
||||
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
|
||||
data_filled["price_ratio"] = data_filled.apply(lambda x: x["price"] / x["avg_price"], axis=1)
|
||||
|
||||
# Fill missing values
|
||||
data_filled = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: x.fillna(method="ffill").fillna(method="bfill")
|
||||
)
|
||||
|
||||
# Create datetime features
|
||||
data_filled["week_start"] = data_filled["week"].apply(
|
||||
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
|
||||
)
|
||||
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
|
||||
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
|
||||
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
|
||||
data_filled.drop("week_start", axis=1, inplace=True)
|
||||
|
||||
# Normalize the dataframe of features
|
||||
cols_normalize = data_filled.columns.difference(["store", "brand", "week"])
|
||||
data_scaled, min_max_scaler = normalize_dataframe(data_filled, cols_normalize)
|
||||
|
||||
# Create sequence array for 'move'
|
||||
start_timestep = 0
|
||||
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - PRED_HORIZON
|
||||
train_input1 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, ["move"], start_timestep, end_timestep)
|
||||
|
||||
# Create sequence array for other dynamic features
|
||||
start_timestep = PRED_HORIZON
|
||||
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
|
||||
train_input2 = gen_sequence_array(data_scaled, store_brand, SEQ_LEN, DYNAMIC_FEATURES, start_timestep, end_timestep)
|
||||
|
||||
seq_in = np.concatenate((train_input1, train_input2), axis=2)
|
||||
|
||||
# Create array of static features
|
||||
total_timesteps = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK - SEQ_LEN - PRED_HORIZON + 2
|
||||
cat_fea_in = static_feature_array(data_filled, total_timesteps, STATIC_FEATURES)
|
||||
|
||||
# Create training output
|
||||
start_timestep = SEQ_LEN + PRED_HORIZON - PRED_STEPS
|
||||
end_timestep = TRAIN_END_WEEK_LIST[r] - TRAIN_START_WEEK
|
||||
train_output = gen_sequence_array(data_filled, store_brand, PRED_STEPS, ["move"], start_timestep, end_timestep)
|
||||
train_output = np.squeeze(train_output)
|
||||
|
||||
# Create model
|
||||
model = create_dcnn_model(
|
||||
seq_len=SEQ_LEN, n_filters=2, n_input_series=1 + len(DYNAMIC_FEATURES), n_outputs=PRED_STEPS
|
||||
)
|
||||
|
||||
# Convert to GPU model
|
||||
try:
|
||||
model = multi_gpu_model(model)
|
||||
print("Training using multiple GPUs...")
|
||||
except:
|
||||
print("Training using single GPU or CPU...")
|
||||
|
||||
adam = optimizers.Adam(lr=args.learning_rate)
|
||||
model.compile(loss="mape", optimizer=adam, metrics=["mape", "mae"])
|
||||
|
||||
# Model training and validation
|
||||
history = model.fit(
|
||||
[seq_in, cat_fea_in], train_output, epochs=args.epochs, batch_size=args.batch_size, validation_split=0.05
|
||||
)
|
||||
val_loss = history.history["val_loss"][-1]
|
||||
print("Validation loss is {}".format(val_loss))
|
||||
|
||||
# Log the validation loss/MAPE
|
||||
run.log("MAPE", np.float(val_loss))
|
|
@ -1,11 +1,12 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Utility functions for building the Dilated Convolutional Neural Network (CNN) model.
|
||||
# Utility functions for building the Dilated Convolutional Neural Network (CNN) model.
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import MinMaxScaler
|
||||
|
||||
|
||||
def week_of_month(dt):
|
||||
"""Get the week of the month for the specified date.
|
||||
|
||||
|
@ -14,14 +15,16 @@ def week_of_month(dt):
|
|||
|
||||
Returns:
|
||||
wom (Integer): Week of the month of the input date
|
||||
"""
|
||||
"""
|
||||
from math import ceil
|
||||
|
||||
first_day = dt.replace(day=1)
|
||||
dom = dt.day
|
||||
adjusted_dom = dom + first_day.weekday()
|
||||
wom = int(ceil(adjusted_dom/7.0))
|
||||
wom = int(ceil(adjusted_dom / 7.0))
|
||||
return wom
|
||||
|
||||
|
||||
def df_from_cartesian_product(dict_in):
|
||||
"""Generate a Pandas dataframe from Cartesian product of lists.
|
||||
|
||||
|
@ -33,11 +36,13 @@ def df_from_cartesian_product(dict_in):
|
|||
"""
|
||||
from collections import OrderedDict
|
||||
from itertools import product
|
||||
|
||||
od = OrderedDict(sorted(dict_in.items()))
|
||||
cart = list(product(*od.values()))
|
||||
df = pd.DataFrame(cart, columns=od.keys())
|
||||
return df
|
||||
|
||||
|
||||
def gen_sequence(df, seq_len, seq_cols, start_timestep=0, end_timestep=None):
|
||||
"""Reshape features into an array of dimension (time steps, features).
|
||||
|
||||
|
@ -54,9 +59,12 @@ def gen_sequence(df, seq_len, seq_cols, start_timestep=0, end_timestep=None):
|
|||
data_array = df[seq_cols].values
|
||||
if end_timestep is None:
|
||||
end_timestep = df.shape[0]
|
||||
for start, stop in zip(range(start_timestep, end_timestep-seq_len+2), range(start_timestep+seq_len, end_timestep+2)):
|
||||
for start, stop in zip(
|
||||
range(start_timestep, end_timestep - seq_len + 2), range(start_timestep + seq_len, end_timestep + 2)
|
||||
):
|
||||
yield data_array[start:stop, :]
|
||||
|
||||
|
||||
def gen_sequence_array(df_all, store_brand, seq_len, seq_cols, start_timestep=0, end_timestep=None):
|
||||
"""Combine feature sequences for all the combinations of (store, brand) into an 3d array.
|
||||
|
||||
|
@ -70,11 +78,22 @@ def gen_sequence_array(df_all, store_brand, seq_len, seq_cols, start_timestep=0,
|
|||
Returns:
|
||||
seq_array (Numpy Array): An array of the feature sequences of all stores and brands
|
||||
"""
|
||||
seq_gen = (list(gen_sequence(df_all[(df_all['store']==cur_store) & (df_all['brand']==cur_brand)], seq_len, seq_cols, start_timestep, end_timestep)) \
|
||||
for cur_store, cur_brand in store_brand)
|
||||
seq_gen = (
|
||||
list(
|
||||
gen_sequence(
|
||||
df_all[(df_all["store"] == cur_store) & (df_all["brand"] == cur_brand)],
|
||||
seq_len,
|
||||
seq_cols,
|
||||
start_timestep,
|
||||
end_timestep,
|
||||
)
|
||||
)
|
||||
for cur_store, cur_brand in store_brand
|
||||
)
|
||||
seq_array = np.concatenate(list(seq_gen)).astype(np.float32)
|
||||
return seq_array
|
||||
|
||||
|
||||
def static_feature_array(df_all, total_timesteps, seq_cols):
|
||||
"""Generate an array which encodes all the static features.
|
||||
|
||||
|
@ -86,10 +105,11 @@ def static_feature_array(df_all, total_timesteps, seq_cols):
|
|||
Return:
|
||||
fea_array (Numpy Array): An array of static features of all stores and brands
|
||||
"""
|
||||
fea_df = df_all.groupby(['store', 'brand']).apply(lambda x: x.iloc[:total_timesteps,:]).reset_index(drop=True)
|
||||
fea_df = df_all.groupby(["store", "brand"]).apply(lambda x: x.iloc[:total_timesteps, :]).reset_index(drop=True)
|
||||
fea_array = fea_df[seq_cols].values
|
||||
return fea_array
|
||||
|
||||
|
||||
def normalize_dataframe(df, seq_cols, scaler=MinMaxScaler()):
|
||||
"""Normalize a subset of columns of a dataframe.
|
||||
|
||||
|
@ -102,7 +122,6 @@ def normalize_dataframe(df, seq_cols, scaler=MinMaxScaler()):
|
|||
df_scaled (Dataframe): Normalized dataframe
|
||||
"""
|
||||
cols_fixed = df.columns.difference(seq_cols)
|
||||
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]),
|
||||
columns=seq_cols, index=df.index)
|
||||
df_scaled = pd.DataFrame(scaler.fit_transform(df[seq_cols]), columns=seq_cols, index=df.index)
|
||||
df_scaled = pd.concat([df[cols_fixed], df_scaled], axis=1)
|
||||
return df_scaled, scaler
|
||||
return df_scaled, scaler
|
|
@ -84,27 +84,25 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
|
||||
`source deactivate`.
|
||||
|
||||
5. Log into Azure Container Registry (ACR):
|
||||
5. Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
6. Pull a Docker image from ACR using the following command
|
||||
6. Build a local Docker image by running the following command from `~/Forecasting` directory
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
sudo docker build -t baseline_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS
|
||||
```
|
||||
|
||||
7. Choose a name for a new Docker container (e.g. ets_container) and create it using command:
|
||||
|
||||
```bash
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name ets_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name ets_container baseline_image:v1
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
|
||||
|
@ -142,7 +140,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
|
||||
**Data storage:** Premium SSD
|
||||
|
||||
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/ETS/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* R
|
|
@ -94,28 +94,26 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
`/test` under the data directory, respectively. After running the above command, you can deactivate the conda environment by running
|
||||
`source deactivate`.
|
||||
|
||||
5. Log into Azure Container Registry (ACR):
|
||||
5. Make sure Docker is installed
|
||||
|
||||
You can check if Docker is installed on your VM by running
|
||||
|
||||
```bash
|
||||
sudo docker login --username tsperf --password <ACR Access Key> tsperf.azurecr.io
|
||||
sudo docker -v
|
||||
```
|
||||
You will see the Docker version if Docker is installed. If not, you can install it by following the instructions [here](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Note that if you want to execute Docker commands without sudo as a non-root user, you need to create a Unix group and add users to it by following the instructions [here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
The `<ACR Acccess Key>` can be found [here](https://github.com/Microsoft/Forecasting/blob/master/common/key.txt). If want to execute docker commands without
|
||||
sudo as a non-root user, you need to create a
|
||||
Unix group and add users to it by following the instructions
|
||||
[here](https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user).
|
||||
|
||||
6. Pull a Docker image from ACR using the following command
|
||||
6. Build a local Docker image by running the following command from `~/Forecasting` directory
|
||||
|
||||
```bash
|
||||
sudo docker pull tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/lightgbm_image:v1
|
||||
sudo docker build -t lightgbm_image:v1 ./retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM
|
||||
```
|
||||
|
||||
7. Choose a name for a new Docker container (e.g. lightgbm_container) and create it using command:
|
||||
|
||||
```bash
|
||||
cd ~/Forecasting
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name lightgbm_container tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/lightgbm_image:v1
|
||||
sudo docker run -it -v ~/Forecasting:/Forecasting --name lightgbm_container lightgbm_image:v1
|
||||
```
|
||||
|
||||
Note that option `-v ~/Forecasting:/Forecasting` allows you to mount `~/Forecasting` folder (the one you cloned) to the container so that you will have
|
||||
|
@ -153,7 +151,7 @@ to check if conda has been installed by runnning command `conda -V`. If it is in
|
|||
|
||||
**Data storage:** Premium SSD
|
||||
|
||||
**Docker image:** tsperf.azurecr.io/retail_sales/orangejuice_pt_3weeks_weekly/baseline_image:v1
|
||||
**Dockerfile:** [retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile](https://github.com/Microsoft/Forecasting/blob/master/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/submissions/LightGBM/Dockerfile)
|
||||
|
||||
**Key packages/dependencies:**
|
||||
* Python
|
|
@ -1,449 +1,449 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tuning Hyperparameters of LightGBM Model with AML SDK and HyperDrive\n",
|
||||
"\n",
|
||||
"This notebook performs hyperparameter tuning of LightGBM model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains LightGBM models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
|
||||
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
|
||||
"`jupyter nbextension install --py --user azureml.widgets` \n",
|
||||
"`jupyter nbextension enable --py --user azureml.widgets` \n",
|
||||
"\n",
|
||||
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml\n",
|
||||
"from azureml.core import Workspace, Run\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.telemetry import set_diagnostics_collection\n",
|
||||
"\n",
|
||||
"# Opt-in diagnostics for better experience of future releases\n",
|
||||
"set_diagnostics_collection(send_diagnostics=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace & Create an Azure ML Experiment\n",
|
||||
"\n",
|
||||
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print('Workspace name: ' + ws.name, \n",
|
||||
" 'Azure region: ' + ws.location, \n",
|
||||
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Experiment\n",
|
||||
"\n",
|
||||
"exp = Experiment(workspace=ws, name='tune_lgbm')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Validate Script Locally"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import RunConfiguration\n",
|
||||
"\n",
|
||||
"# Configure local, user managed environment\n",
|
||||
"run_config_user_managed = RunConfiguration()\n",
|
||||
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
|
||||
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
"\n",
|
||||
"# Please update data-folder argument before submitting the job\n",
|
||||
"src = ScriptRunConfig(source_directory='./', \n",
|
||||
" script='train_validate.py', \n",
|
||||
" arguments=['--data-folder', \n",
|
||||
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
|
||||
" '--bagging-fraction', '0.8'],\n",
|
||||
" run_config=run_config_user_managed)\n",
|
||||
"run_local = exp.submit(src)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check job status\n",
|
||||
"run_local.get_status()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check results\n",
|
||||
"while(run_local.get_status() != 'Completed'): {}\n",
|
||||
"run_local.get_details()\n",
|
||||
"run_local.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run Script on Remote Compute Target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a CPU cluster as compute target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
"\n",
|
||||
"# Choose a name for your cluster\n",
|
||||
"cluster_name = \"cpucluster\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # Look for the existing cluster by name\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" if type(compute_target) is AmlCompute:\n",
|
||||
" print('Found existing compute target {}.'.format(cluster_name))\n",
|
||||
" else:\n",
|
||||
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D14_v2\", # CPU-based VM\n",
|
||||
" #vm_priority='lowpriority', # optional\n",
|
||||
" min_nodes=0, \n",
|
||||
" max_nodes=4,\n",
|
||||
" idle_seconds_before_scaledown=3600)\n",
|
||||
" # Create the cluster\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
|
||||
" # if no min node count is provided it uses the scale settings for the cluster\n",
|
||||
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
|
||||
" # Get a detailed status for the current cluster. \n",
|
||||
" print(compute_target.serialize())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# If you have created the compute target, you should see one entry named 'cpucluster' of type AmlCompute \n",
|
||||
"# in the workspace's compute_targets property.\n",
|
||||
"compute_targets = ws.compute_targets\n",
|
||||
"for name, ct in compute_targets.items():\n",
|
||||
" print(name, ct.type, ct.provisioning_state)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure Docker environment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"env = EnvironmentDefinition()\n",
|
||||
"env.python.user_managed_dependencies = False\n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'lightgbm', 'joblib'],\n",
|
||||
" python_version='3.6.2')\n",
|
||||
"env.python.conda_dependencies.add_channel('conda-forge')\n",
|
||||
"env.docker.enabled=True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload data to default datastore\n",
|
||||
"\n",
|
||||
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ds = ws.get_default_datastore()\n",
|
||||
"print(ds.datastore_type, ds.account_name, ds.container_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path_on_datastore = 'data'\n",
|
||||
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get data reference object for the data path\n",
|
||||
"ds_data = ds.path(path_on_datastore)\n",
|
||||
"print(ds_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create estimator\n",
|
||||
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.train.estimator import Estimator\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount(),\n",
|
||||
" '--bagging-fraction': 0.8\n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit job"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit job to compute target\n",
|
||||
"run_remote = exp.submit(config=est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Check job status"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"\n",
|
||||
"RunDetails(run_remote).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"run_remote.get_details()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get metric value after the job finishes \n",
|
||||
"while(run_remote.get_status() != 'Completed'): {}\n",
|
||||
"run_remote.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
|
||||
"## Tune Hyperparameters using HyperDrive"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.hyperdrive import *\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount() \n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)\n",
|
||||
"ps = BayesianParameterSampling({\n",
|
||||
" '--num-leaves': quniform(8, 128, 1),\n",
|
||||
" '--min-data-in-leaf': quniform(20, 500, 10),\n",
|
||||
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
|
||||
" '--feature-fraction': uniform(0.2, 1), \n",
|
||||
" '--bagging-fraction': uniform(0.1, 1), \n",
|
||||
" '--bagging-freq': quniform(1, 20, 1), \n",
|
||||
" '--max-rounds': quniform(50, 2000, 10),\n",
|
||||
" '--max-lag': quniform(3, 40, 1), \n",
|
||||
" '--window-size': quniform(3, 40, 1), \n",
|
||||
"})\n",
|
||||
"htc = HyperDriveRunConfig(estimator=est, \n",
|
||||
" hyperparameter_sampling=ps, \n",
|
||||
" primary_metric_name='MAPE', \n",
|
||||
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
|
||||
" max_total_runs=200,\n",
|
||||
" max_concurrent_runs=4)\n",
|
||||
"htr = exp.submit(config=htc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"RunDetails(htr).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"while(htr.get_status() != 'Completed'): {}\n",
|
||||
"htr.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = htr.get_best_run_by_primary_metric()\n",
|
||||
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
|
||||
"print(parameter_values)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tuning Hyperparameters of LightGBM Model with AML SDK and HyperDrive\n",
|
||||
"\n",
|
||||
"This notebook performs hyperparameter tuning of LightGBM model with AML SDK and HyperDrive. It selects the best model by cross validation using the training data in the first forecast round. Specifically, it splits the training data into sub-training data and validation data. Then, it trains LightGBM models with different sets of hyperparameters using the sub-training data and evaluate the accuracy of each model with the validation data. The set of hyperparameters which yield the best validation accuracy will be used to train models and forecast sales across all 12 forecast rounds.\n",
|
||||
"\n",
|
||||
"## Prerequisites\n",
|
||||
"To run this notebook, you need to install AML SDK and its widget extension in your environment by running the following commands in a terminal. Before running the commands, you need to activate your environment by executing `source activate <your env>` in a Linux VM. \n",
|
||||
"`pip3 install --upgrade azureml-sdk[notebooks,automl]` \n",
|
||||
"`jupyter nbextension install --py --user azureml.widgets` \n",
|
||||
"`jupyter nbextension enable --py --user azureml.widgets` \n",
|
||||
"\n",
|
||||
"To add the environment to your Jupyter kernels, you can do `python3 -m ipykernel install --name <your env>`. Besides, you need to create an Azure ML workspace and download its configuration file (`config.json`) by following the [configuration.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import azureml\n",
|
||||
"from azureml.core import Workspace, Run\n",
|
||||
"\n",
|
||||
"# Check core SDK version number\n",
|
||||
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.telemetry import set_diagnostics_collection\n",
|
||||
"\n",
|
||||
"# Opt-in diagnostics for better experience of future releases\n",
|
||||
"set_diagnostics_collection(send_diagnostics=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Initialize Workspace & Create an Azure ML Experiment\n",
|
||||
"\n",
|
||||
"Initialize a [Machine Learning Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the workspace you created in the Prerequisites step. `Workspace.from_config()` below creates a workspace object from the details stored in `config.json` that you have downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.workspace import Workspace\n",
|
||||
"\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"print('Workspace name: ' + ws.name, \n",
|
||||
" 'Azure region: ' + ws.location, \n",
|
||||
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Experiment\n",
|
||||
"\n",
|
||||
"exp = Experiment(workspace=ws, name='tune_lgbm')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Validate Script Locally"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import RunConfiguration\n",
|
||||
"\n",
|
||||
"# Configure local, user managed environment\n",
|
||||
"run_config_user_managed = RunConfiguration()\n",
|
||||
"run_config_user_managed.environment.python.user_managed_dependencies = True\n",
|
||||
"run_config_user_managed.environment.python.interpreter_path = '/usr/bin/python3.5'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import ScriptRunConfig\n",
|
||||
"\n",
|
||||
"# Please update data-folder argument before submitting the job\n",
|
||||
"src = ScriptRunConfig(source_directory='./', \n",
|
||||
" script='train_validate.py', \n",
|
||||
" arguments=['--data-folder', \n",
|
||||
" '/home/chenhui/TSPerf/retail_sales/OrangeJuice_Pt_3Weeks_Weekly/data/', \n",
|
||||
" '--bagging-fraction', '0.8'],\n",
|
||||
" run_config=run_config_user_managed)\n",
|
||||
"run_local = exp.submit(src)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check job status\n",
|
||||
"run_local.get_status()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Check results\n",
|
||||
"while(run_local.get_status() != 'Completed'): {}\n",
|
||||
"run_local.get_details()\n",
|
||||
"run_local.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run Script on Remote Compute Target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create a CPU cluster as compute target"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
"\n",
|
||||
"# Choose a name for your cluster\n",
|
||||
"cluster_name = \"cpucluster\"\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" # Look for the existing cluster by name\n",
|
||||
" compute_target = ComputeTarget(workspace=ws, name=cluster_name)\n",
|
||||
" if type(compute_target) is AmlCompute:\n",
|
||||
" print('Found existing compute target {}.'.format(cluster_name))\n",
|
||||
" else:\n",
|
||||
" print('{} exists but it is not an AML Compute target. Please choose a different name.'.format(cluster_name))\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size=\"STANDARD_D14_v2\", # CPU-based VM\n",
|
||||
" #vm_priority='lowpriority', # optional\n",
|
||||
" min_nodes=0, \n",
|
||||
" max_nodes=4,\n",
|
||||
" idle_seconds_before_scaledown=3600)\n",
|
||||
" # Create the cluster\n",
|
||||
" compute_target = ComputeTarget.create(ws, cluster_name, compute_config)\n",
|
||||
" # Can poll for a minimum number of nodes and for a specific timeout. \n",
|
||||
" # if no min node count is provided it uses the scale settings for the cluster\n",
|
||||
" compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
|
||||
" # Get a detailed status for the current cluster. \n",
|
||||
" print(compute_target.serialize())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# If you have created the compute target, you should see one entry named 'cpucluster' of type AmlCompute \n",
|
||||
"# in the workspace's compute_targets property.\n",
|
||||
"compute_targets = ws.compute_targets\n",
|
||||
"for name, ct in compute_targets.items():\n",
|
||||
" print(name, ct.type, ct.provisioning_state)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Configure Docker environment"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"env = EnvironmentDefinition()\n",
|
||||
"env.python.user_managed_dependencies = False\n",
|
||||
"env.python.conda_dependencies = CondaDependencies.create(conda_packages=['pandas', 'numpy', 'scipy', 'scikit-learn', 'lightgbm', 'joblib'],\n",
|
||||
" python_version='3.6.2')\n",
|
||||
"env.python.conda_dependencies.add_channel('conda-forge')\n",
|
||||
"env.docker.enabled=True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Upload data to default datastore\n",
|
||||
"\n",
|
||||
"Upload the Orange Juice dataset to the workspace's default datastore, which will later be mounted on the cluster for model training and validation. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ds = ws.get_default_datastore()\n",
|
||||
"print(ds.datastore_type, ds.account_name, ds.container_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"path_on_datastore = 'data'\n",
|
||||
"ds.upload(src_dir='../../data', target_path=path_on_datastore, overwrite=True, show_progress=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get data reference object for the data path\n",
|
||||
"ds_data = ds.path(path_on_datastore)\n",
|
||||
"print(ds_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Create estimator\n",
|
||||
"Next, we will check if the remote compute target is successfully created by submitting a job to the target. This compute target will be used by HyperDrive to tune the hyperparameters later. You may skip this part of code and directly jump into [Tune Hyperparameters using HyperDrive](#tune-hyperparameters-using-hyperdrive)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.runconfig import EnvironmentDefinition\n",
|
||||
"from azureml.train.estimator import Estimator\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount(),\n",
|
||||
" '--bagging-fraction': 0.8\n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Submit job"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit job to compute target\n",
|
||||
"run_remote = exp.submit(config=est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Check job status"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.widgets import RunDetails\n",
|
||||
"\n",
|
||||
"RunDetails(run_remote).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"run_remote.get_details()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get metric value after the job finishes \n",
|
||||
"while(run_remote.get_status() != 'Completed'): {}\n",
|
||||
"run_remote.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id='tune-hyperparameters-using-hyperdrive'></a>\n",
|
||||
"## Tune Hyperparameters using HyperDrive"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.train.hyperdrive import *\n",
|
||||
"\n",
|
||||
"script_folder = './'\n",
|
||||
"script_params = {\n",
|
||||
" '--data-folder': ds_data.as_mount() \n",
|
||||
"}\n",
|
||||
"est = Estimator(source_directory=script_folder,\n",
|
||||
" script_params=script_params,\n",
|
||||
" compute_target=compute_target,\n",
|
||||
" use_docker=True,\n",
|
||||
" entry_script='train_validate.py',\n",
|
||||
" environment_definition=env)\n",
|
||||
"ps = BayesianParameterSampling({\n",
|
||||
" '--num-leaves': quniform(8, 128, 1),\n",
|
||||
" '--min-data-in-leaf': quniform(20, 500, 10),\n",
|
||||
" '--learning-rate': choice(1e-4, 1e-3, 5e-3, 1e-2, 1.5e-2, 2e-2, 3e-2, 5e-2, 1e-1),\n",
|
||||
" '--feature-fraction': uniform(0.2, 1), \n",
|
||||
" '--bagging-fraction': uniform(0.1, 1), \n",
|
||||
" '--bagging-freq': quniform(1, 20, 1), \n",
|
||||
" '--max-rounds': quniform(50, 2000, 10),\n",
|
||||
" '--max-lag': quniform(3, 40, 1), \n",
|
||||
" '--window-size': quniform(3, 40, 1), \n",
|
||||
"})\n",
|
||||
"htc = HyperDriveRunConfig(estimator=est, \n",
|
||||
" hyperparameter_sampling=ps, \n",
|
||||
" primary_metric_name='MAPE', \n",
|
||||
" primary_metric_goal=PrimaryMetricGoal.MINIMIZE, \n",
|
||||
" max_total_runs=200,\n",
|
||||
" max_concurrent_runs=4)\n",
|
||||
"htr = exp.submit(config=htc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"RunDetails(htr).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"while(htr.get_status() != 'Completed'): {}\n",
|
||||
"htr.get_metrics()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"best_run = htr.get_best_run_by_primary_metric()\n",
|
||||
"parameter_values = best_run.get_details()['runDefinition']['Arguments']\n",
|
||||
"print(parameter_values)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -1,6 +1,6 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Create input features for the boosted decision tree model.
|
||||
# Create input features for the boosted decision tree model.
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
@ -9,9 +9,9 @@ import itertools
|
|||
import datetime
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import lightgbm as lgb
|
||||
import lightgbm as lgb
|
||||
|
||||
# Append TSPerf path to sys.path
|
||||
# Append TSPerf path to sys.path
|
||||
tsperf_dir = os.getcwd()
|
||||
if tsperf_dir not in sys.path:
|
||||
sys.path.append(tsperf_dir)
|
||||
|
@ -20,6 +20,7 @@ if tsperf_dir not in sys.path:
|
|||
from utils import *
|
||||
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
|
||||
|
||||
|
||||
def lagged_features(df, lags):
|
||||
"""Create lagged features based on time series data.
|
||||
|
||||
|
@ -33,11 +34,12 @@ def lagged_features(df, lags):
|
|||
df_list = []
|
||||
for lag in lags:
|
||||
df_shifted = df.shift(lag)
|
||||
df_shifted.columns = [x + '_lag' + str(lag) for x in df_shifted.columns]
|
||||
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
|
||||
df_list.append(df_shifted)
|
||||
fea = pd.concat(df_list, axis=1)
|
||||
return fea
|
||||
|
||||
|
||||
def moving_averages(df, start_step, window_size=None):
|
||||
"""Compute averages of every feature over moving time windows.
|
||||
|
||||
|
@ -49,12 +51,13 @@ def moving_averages(df, start_step, window_size=None):
|
|||
Returns:
|
||||
fea (Dataframe): Dataframe consisting of the moving averages
|
||||
"""
|
||||
if window_size == None: # Use a large window to compute average over all historical data
|
||||
if window_size == None: # Use a large window to compute average over all historical data
|
||||
window_size = df.shape[0]
|
||||
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
|
||||
fea.columns = fea.columns + '_mean'
|
||||
fea.columns = fea.columns + "_mean"
|
||||
return fea
|
||||
|
||||
|
||||
def combine_features(df, lag_fea, lags, window_size, used_columns):
|
||||
"""Combine different features for a certain store-brand.
|
||||
|
||||
|
@ -73,6 +76,7 @@ def combine_features(df, lag_fea, lags, window_size, used_columns):
|
|||
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
|
||||
return fea_all
|
||||
|
||||
|
||||
def make_features(pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list):
|
||||
"""Create a dataframe of the input features.
|
||||
|
||||
|
@ -88,46 +92,59 @@ def make_features(pred_round, train_dir, lags, window_size, offset, used_columns
|
|||
|
||||
Returns:
|
||||
features (Dataframe): Dataframe including all the input features and target variable
|
||||
"""
|
||||
"""
|
||||
# Load training data
|
||||
train_df = pd.read_csv(os.path.join(train_dir, 'train_round_'+str(pred_round+1)+'.csv'))
|
||||
train_df['move'] = train_df['logmove'].apply(lambda x: round(math.exp(x)))
|
||||
train_df = train_df[['store', 'brand', 'week', 'move']]
|
||||
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
|
||||
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
|
||||
train_df = train_df[["store", "brand", "week", "move"]]
|
||||
|
||||
# Create a dataframe to hold all necessary data
|
||||
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round]+1)
|
||||
d = {'store': store_list,
|
||||
'brand': brand_list,
|
||||
'week': week_list}
|
||||
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
|
||||
d = {"store": store_list, "brand": brand_list, "week": week_list}
|
||||
data_grid = df_from_cartesian_product(d)
|
||||
data_filled = pd.merge(data_grid, train_df, how='left',
|
||||
on=['store', 'brand', 'week'])
|
||||
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Get future price, deal, and advertisement info
|
||||
aux_df = pd.read_csv(os.path.join(train_dir, 'aux_round_'+str(pred_round+1)+'.csv'))
|
||||
data_filled = pd.merge(data_filled, aux_df, how='left',
|
||||
on=['store', 'brand', 'week'])
|
||||
|
||||
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
|
||||
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Create relative price feature
|
||||
price_cols = ['price1', 'price2', 'price3', 'price4', 'price5', 'price6', 'price7', 'price8', \
|
||||
'price9', 'price10', 'price11']
|
||||
data_filled['price'] = data_filled.apply(lambda x: x.loc['price' + str(int(x.loc['brand']))], axis=1)
|
||||
data_filled['avg_price'] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
|
||||
data_filled['price_ratio'] = data_filled['price'] / data_filled['avg_price']
|
||||
data_filled.drop(price_cols, axis=1, inplace=True)
|
||||
price_cols = [
|
||||
"price1",
|
||||
"price2",
|
||||
"price3",
|
||||
"price4",
|
||||
"price5",
|
||||
"price6",
|
||||
"price7",
|
||||
"price8",
|
||||
"price9",
|
||||
"price10",
|
||||
"price11",
|
||||
]
|
||||
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
|
||||
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
|
||||
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
|
||||
data_filled.drop(price_cols, axis=1, inplace=True)
|
||||
|
||||
# Fill missing values
|
||||
data_filled = data_filled.groupby(['store', 'brand']).apply(lambda x: x.fillna(method='ffill').fillna(method='bfill'))
|
||||
data_filled = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: x.fillna(method="ffill").fillna(method="bfill")
|
||||
)
|
||||
|
||||
# Create datetime features
|
||||
data_filled['week_start'] = data_filled['week'].apply(lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x-1)*7))
|
||||
data_filled['year'] = data_filled['week_start'].apply(lambda x: x.year)
|
||||
data_filled['month'] = data_filled['week_start'].apply(lambda x: x.month)
|
||||
data_filled['week_of_month'] = data_filled['week_start'].apply(lambda x: week_of_month(x))
|
||||
data_filled['day'] = data_filled['week_start'].apply(lambda x: x.day)
|
||||
data_filled.drop('week_start', axis=1, inplace=True)
|
||||
data_filled["week_start"] = data_filled["week"].apply(
|
||||
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
|
||||
)
|
||||
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
|
||||
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
|
||||
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
|
||||
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
|
||||
data_filled.drop("week_start", axis=1, inplace=True)
|
||||
|
||||
# Create other features (lagged features, moving averages, etc.)
|
||||
features = data_filled.groupby(['store','brand']).apply(lambda x: combine_features(x, ['move'], lags, window_size, used_columns))
|
||||
features = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: combine_features(x, ["move"], lags, window_size, used_columns)
|
||||
)
|
||||
|
||||
return features
|
||||
return features
|
|
@ -0,0 +1,201 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Create input features for the boosted decision tree model.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import math
|
||||
import datetime
|
||||
import pandas as pd
|
||||
|
||||
from sklearn.pipeline import Pipeline
|
||||
from common.features.lag import LagFeaturizer
|
||||
from common.features.rolling_window import RollingWindowFeaturizer
|
||||
from common.features.stats import PopularityFeaturizer
|
||||
from common.features.temporal import TemporalFeaturizer
|
||||
|
||||
# Append TSPerf path to sys.path
|
||||
tsperf_dir = os.getcwd()
|
||||
if tsperf_dir not in sys.path:
|
||||
sys.path.append(tsperf_dir)
|
||||
|
||||
# Import TSPerf components
|
||||
from utils import df_from_cartesian_product
|
||||
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
|
||||
|
||||
pd.set_option("display.max_columns", None)
|
||||
|
||||
|
||||
def oj_preprocess(df, aux_df, week_list, store_list, brand_list, train_df=None):
|
||||
|
||||
df["move"] = df["logmove"].apply(lambda x: round(math.exp(x)))
|
||||
df = df[["store", "brand", "week", "move"]].copy()
|
||||
|
||||
# Create a dataframe to hold all necessary data
|
||||
d = {"store": store_list, "brand": brand_list, "week": week_list}
|
||||
data_grid = df_from_cartesian_product(d)
|
||||
data_filled = pd.merge(data_grid, df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Get future price, deal, and advertisement info
|
||||
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Fill missing values
|
||||
if train_df is not None:
|
||||
data_filled = pd.concat(train_df, data_filled)
|
||||
forecast_creation_time = train_df["week_start"].max()
|
||||
|
||||
data_filled = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: x.fillna(method="ffill").fillna(method="bfill")
|
||||
)
|
||||
|
||||
data_filled["week_start"] = data_filled["week"].apply(
|
||||
lambda x: bs.FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
|
||||
)
|
||||
|
||||
if train_df is not None:
|
||||
data_filled = data_filled.loc[data_filled["week_start"] > forecast_creation_time].copy()
|
||||
|
||||
return data_filled
|
||||
|
||||
|
||||
def make_features(
|
||||
pred_round, train_dir, lags, window_size, offset, used_columns, store_list, brand_list,
|
||||
):
|
||||
"""Create a dataframe of the input features.
|
||||
|
||||
Args:
|
||||
pred_round (Integer): Prediction round
|
||||
train_dir (String): Path of the training data directory
|
||||
lags (Numpy Array): Numpy array including all the lags
|
||||
window_size (Integer): Maximum step for computing the moving average
|
||||
offset (Integer): Length of training data skipped in the retraining
|
||||
used_columns (List): A list of names of columns used in model training
|
||||
(including target variable)
|
||||
store_list (Numpy Array): List of all the store IDs
|
||||
brand_list (Numpy Array): List of all the brand IDs
|
||||
|
||||
Returns:
|
||||
features (Dataframe): Dataframe including all the input features and
|
||||
target variable
|
||||
"""
|
||||
# Load training data
|
||||
train_df = pd.read_csv(os.path.join(train_dir, "train_round_" + str(pred_round + 1) + ".csv"))
|
||||
aux_df = pd.read_csv(os.path.join(train_dir, "aux_round_" + str(pred_round + 1) + ".csv"))
|
||||
week_list = range(bs.TRAIN_START_WEEK + offset, bs.TEST_END_WEEK_LIST[pred_round] + 1)
|
||||
|
||||
train_df_preprocessed = oj_preprocess(train_df, aux_df, week_list, store_list, brand_list)
|
||||
|
||||
df_config = {
|
||||
"time_col_name": "week_start",
|
||||
"ts_id_col_names": ["brand", "store"],
|
||||
"target_col_name": "move",
|
||||
"frequency": "W",
|
||||
"time_format": "%Y-%m-%d",
|
||||
}
|
||||
|
||||
temporal_featurizer = TemporalFeaturizer(df_config=df_config, feature_list=["month_of_year", "week_of_month"])
|
||||
|
||||
popularity_featurizer = PopularityFeaturizer(
|
||||
df_config=df_config,
|
||||
id_col_name="brand",
|
||||
data_format="wide",
|
||||
feature_col_name="price",
|
||||
wide_col_names=[
|
||||
"price1",
|
||||
"price2",
|
||||
"price3",
|
||||
"price4",
|
||||
"price5",
|
||||
"price6",
|
||||
"price7",
|
||||
"price8",
|
||||
"price9",
|
||||
"price10",
|
||||
"price11",
|
||||
],
|
||||
output_col_name="price_ratio",
|
||||
return_feature_col=True,
|
||||
)
|
||||
|
||||
lag_featurizer = LagFeaturizer(df_config=df_config, input_col_names="move", lags=lags, future_value_available=True,)
|
||||
moving_average_featurizer = RollingWindowFeaturizer(
|
||||
df_config=df_config,
|
||||
input_col_names="move",
|
||||
window_size=window_size,
|
||||
window_args={"min_periods": 1, "center": False},
|
||||
future_value_available=True,
|
||||
rolling_gap=2,
|
||||
)
|
||||
|
||||
feature_engineering_pipeline = Pipeline(
|
||||
[
|
||||
("temporal", temporal_featurizer),
|
||||
("popularity", popularity_featurizer),
|
||||
("lag", lag_featurizer),
|
||||
("moving_average", moving_average_featurizer),
|
||||
]
|
||||
)
|
||||
|
||||
features = feature_engineering_pipeline.transform(train_df_preprocessed)
|
||||
|
||||
# Temporary code for result verification
|
||||
features.rename(
|
||||
mapper={
|
||||
"move_lag_2": "move_lag2",
|
||||
"move_lag_3": "move_lag3",
|
||||
"move_lag_4": "move_lag4",
|
||||
"move_lag_5": "move_lag5",
|
||||
"move_lag_6": "move_lag6",
|
||||
"move_lag_7": "move_lag7",
|
||||
"move_lag_8": "move_lag8",
|
||||
"move_lag_9": "move_lag9",
|
||||
"move_lag_10": "move_lag10",
|
||||
"move_lag_11": "move_lag11",
|
||||
"move_lag_12": "move_lag12",
|
||||
"move_lag_13": "move_lag13",
|
||||
"move_lag_14": "move_lag14",
|
||||
"move_lag_15": "move_lag15",
|
||||
"move_lag_16": "move_lag16",
|
||||
"move_lag_17": "move_lag17",
|
||||
"move_lag_18": "move_lag18",
|
||||
"move_lag_19": "move_lag19",
|
||||
"month_of_year": "month",
|
||||
},
|
||||
axis=1,
|
||||
inplace=True,
|
||||
)
|
||||
features = features[
|
||||
[
|
||||
"store",
|
||||
"brand",
|
||||
"week",
|
||||
"week_of_month",
|
||||
"month",
|
||||
"deal",
|
||||
"feat",
|
||||
"move",
|
||||
"price",
|
||||
"price_ratio",
|
||||
"move_lag2",
|
||||
"move_lag3",
|
||||
"move_lag4",
|
||||
"move_lag5",
|
||||
"move_lag6",
|
||||
"move_lag7",
|
||||
"move_lag8",
|
||||
"move_lag9",
|
||||
"move_lag10",
|
||||
"move_lag11",
|
||||
"move_lag12",
|
||||
"move_lag13",
|
||||
"move_lag14",
|
||||
"move_lag15",
|
||||
"move_lag16",
|
||||
"move_lag17",
|
||||
"move_lag18",
|
||||
"move_lag19",
|
||||
"move_mean",
|
||||
]
|
||||
]
|
||||
|
||||
return features
|
|
@ -0,0 +1,137 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Train and score a boosted decision tree model using [LightGBM Python package](https://github.com/Microsoft/LightGBM) from Microsoft,
|
||||
# which is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import lightgbm as lgb
|
||||
|
||||
import warnings
|
||||
|
||||
warnings.filterwarnings("ignore")
|
||||
|
||||
# Append TSPerf path to sys.path
|
||||
tsperf_dir = os.getcwd()
|
||||
if tsperf_dir not in sys.path:
|
||||
sys.path.append(tsperf_dir)
|
||||
|
||||
from make_features import make_features
|
||||
import retail_sales.OrangeJuice_Pt_3Weeks_Weekly.common.benchmark_settings as bs
|
||||
|
||||
|
||||
def make_predictions(df, model):
|
||||
"""Predict sales with the trained GBM model.
|
||||
|
||||
Args:
|
||||
df (Dataframe): Dataframe including all needed features
|
||||
model (Model): Trained GBM model
|
||||
|
||||
Returns:
|
||||
Dataframe including the predicted sales of every store-brand
|
||||
"""
|
||||
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
|
||||
predictions["move"] = predictions["move"].apply(lambda x: round(x))
|
||||
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse input arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--seed", type=int, dest="seed", default=1, help="Random seed of GBM model")
|
||||
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=124, help="# of leaves of the tree")
|
||||
parser.add_argument(
|
||||
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=340, help="minimum # of samples in each leaf"
|
||||
)
|
||||
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.1, help="learning rate")
|
||||
parser.add_argument(
|
||||
"--feature-fraction",
|
||||
type=float,
|
||||
dest="feature_fraction",
|
||||
default=0.65,
|
||||
help="ratio of features used in each iteration",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--bagging-fraction",
|
||||
type=float,
|
||||
dest="bagging_fraction",
|
||||
default=0.87,
|
||||
help="ratio of samples used in each iteration",
|
||||
)
|
||||
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=19, help="bagging frequency")
|
||||
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=940, help="# of boosting iterations")
|
||||
parser.add_argument("--max-lag", type=int, dest="max_lag", default=19, help="max lag of unit sales")
|
||||
parser.add_argument(
|
||||
"--window-size", type=int, dest="window_size", default=40, help="window size of moving average of unit sales"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
print(args)
|
||||
|
||||
# Data paths
|
||||
DATA_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "data")
|
||||
SUBMISSION_DIR = os.path.join(tsperf_dir, "retail_sales", "OrangeJuice_Pt_3Weeks_Weekly", "submissions", "LightGBM")
|
||||
TRAIN_DIR = os.path.join(DATA_DIR, "train")
|
||||
|
||||
# Parameters of GBM model
|
||||
params = {
|
||||
"objective": "mape",
|
||||
"num_leaves": args.num_leaves,
|
||||
"min_data_in_leaf": args.min_data_in_leaf,
|
||||
"learning_rate": args.learning_rate,
|
||||
"feature_fraction": args.feature_fraction,
|
||||
"bagging_fraction": args.bagging_fraction,
|
||||
"bagging_freq": args.bagging_freq,
|
||||
"num_rounds": args.max_rounds,
|
||||
"early_stopping_rounds": 125,
|
||||
"num_threads": 4,
|
||||
"seed": args.seed,
|
||||
}
|
||||
|
||||
# Lags and categorical features
|
||||
lags = np.arange(2, args.max_lag + 1)
|
||||
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
|
||||
categ_fea = ["store", "brand", "deal"]
|
||||
|
||||
# Get unique stores and brands
|
||||
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_1.csv"))
|
||||
store_list = train_df["store"].unique()
|
||||
brand_list = train_df["brand"].unique()
|
||||
|
||||
# Train and predict for all forecast rounds
|
||||
pred_all = []
|
||||
metric_all = []
|
||||
for r in range(bs.NUM_ROUNDS):
|
||||
print("---- Round " + str(r + 1) + " ----")
|
||||
# Create features
|
||||
features = make_features(r, TRAIN_DIR, lags, args.window_size, 0, used_columns, store_list, brand_list)
|
||||
train_fea = features[features.week <= bs.TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
|
||||
|
||||
# Drop rows with NaN values
|
||||
train_fea.dropna(inplace=True)
|
||||
|
||||
# Create training set
|
||||
dtrain = lgb.Dataset(train_fea.drop("move", axis=1, inplace=False), label=train_fea["move"])
|
||||
if r % 3 == 0:
|
||||
# Train GBM model
|
||||
print("Training model...")
|
||||
bst = lgb.train(params, dtrain, valid_sets=[dtrain], categorical_feature=categ_fea, verbose_eval=False)
|
||||
|
||||
# Generate forecasts
|
||||
print("Making predictions...")
|
||||
test_fea = features[features.week >= bs.TEST_START_WEEK_LIST[r]].reset_index(drop=True)
|
||||
pred = make_predictions(test_fea, bst).sort_values(by=["store", "brand", "week"]).reset_index(drop=True)
|
||||
# Additional columns required by the submission format
|
||||
pred["round"] = r + 1
|
||||
pred["weeks_ahead"] = pred["week"] - bs.TRAIN_END_WEEK_LIST[r]
|
||||
# Keep the predictions
|
||||
pred_all.append(pred)
|
||||
|
||||
# Generate submission
|
||||
submission = pd.concat(pred_all, axis=0)
|
||||
submission.rename(columns={"move": "prediction"}, inplace=True)
|
||||
submission = submission[["round", "store", "brand", "week", "weeks_ahead", "prediction"]]
|
||||
filename = "submission_seed_" + str(args.seed) + ".csv"
|
||||
submission.to_csv(os.path.join(SUBMISSION_DIR, filename), index=False)
|
|
@ -0,0 +1,241 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Perform cross validation of a boosted decision tree model on the training data of the 1st forecast round.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import math
|
||||
import argparse
|
||||
import datetime
|
||||
import itertools
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import lightgbm as lgb
|
||||
from azureml.core import Run
|
||||
from sklearn.model_selection import train_test_split
|
||||
from utils import week_of_month, df_from_cartesian_product
|
||||
|
||||
|
||||
def lagged_features(df, lags):
|
||||
"""Create lagged features based on time series data.
|
||||
|
||||
Args:
|
||||
df (Dataframe): Input time series data sorted by time
|
||||
lags (List): Lag lengths
|
||||
|
||||
Returns:
|
||||
fea (Dataframe): Lagged features
|
||||
"""
|
||||
df_list = []
|
||||
for lag in lags:
|
||||
df_shifted = df.shift(lag)
|
||||
df_shifted.columns = [x + "_lag" + str(lag) for x in df_shifted.columns]
|
||||
df_list.append(df_shifted)
|
||||
fea = pd.concat(df_list, axis=1)
|
||||
return fea
|
||||
|
||||
|
||||
def moving_averages(df, start_step, window_size=None):
|
||||
"""Compute averages of every feature over moving time windows.
|
||||
|
||||
Args:
|
||||
df (Dataframe): Input features as a dataframe
|
||||
start_step (Integer): Starting time step of rolling mean
|
||||
window_size (Integer): Windows size of rolling mean
|
||||
|
||||
Returns:
|
||||
fea (Dataframe): Dataframe consisting of the moving averages
|
||||
"""
|
||||
if window_size == None: # Use a large window to compute average over all historical data
|
||||
window_size = df.shape[0]
|
||||
fea = df.shift(start_step).rolling(min_periods=1, center=False, window=window_size).mean()
|
||||
fea.columns = fea.columns + "_mean"
|
||||
return fea
|
||||
|
||||
|
||||
def combine_features(df, lag_fea, lags, window_size, used_columns):
|
||||
"""Combine different features for a certain store-brand.
|
||||
|
||||
Args:
|
||||
df (Dataframe): Time series data of a certain store-brand
|
||||
lag_fea (List): A list of column names for creating lagged features
|
||||
lags (Numpy Array): Numpy array including all the lags
|
||||
window_size (Integer): Windows size of rolling mean
|
||||
used_columns (List): A list of names of columns used in model training (including target variable)
|
||||
|
||||
Returns:
|
||||
fea_all (Dataframe): Dataframe including all features for the specific store-brand
|
||||
"""
|
||||
lagged_fea = lagged_features(df[lag_fea], lags)
|
||||
moving_avg = moving_averages(df[lag_fea], 2, window_size)
|
||||
fea_all = pd.concat([df[used_columns], lagged_fea, moving_avg], axis=1)
|
||||
return fea_all
|
||||
|
||||
|
||||
def make_predictions(df, model):
|
||||
"""Predict sales with the trained GBM model.
|
||||
|
||||
Args:
|
||||
df (Dataframe): Dataframe including all needed features
|
||||
model (Model): Trained GBM model
|
||||
|
||||
Returns:
|
||||
Dataframe including the predicted sales of a certain store-brand
|
||||
"""
|
||||
predictions = pd.DataFrame({"move": model.predict(df.drop("move", axis=1))})
|
||||
predictions["move"] = predictions["move"].apply(lambda x: round(x))
|
||||
return pd.concat([df[["brand", "store", "week"]].reset_index(drop=True), predictions], axis=1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse input arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--data-folder", type=str, dest="data_folder", default=".", help="data folder mounting point")
|
||||
parser.add_argument("--num-leaves", type=int, dest="num_leaves", default=64, help="# of leaves of the tree")
|
||||
parser.add_argument(
|
||||
"--min-data-in-leaf", type=int, dest="min_data_in_leaf", default=50, help="minimum # of samples in each leaf"
|
||||
)
|
||||
parser.add_argument("--learning-rate", type=float, dest="learning_rate", default=0.001, help="learning rate")
|
||||
parser.add_argument(
|
||||
"--feature-fraction",
|
||||
type=float,
|
||||
dest="feature_fraction",
|
||||
default=1.0,
|
||||
help="ratio of features used in each iteration",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--bagging-fraction",
|
||||
type=float,
|
||||
dest="bagging_fraction",
|
||||
default=1.0,
|
||||
help="ratio of samples used in each iteration",
|
||||
)
|
||||
parser.add_argument("--bagging-freq", type=int, dest="bagging_freq", default=1, help="bagging frequency")
|
||||
parser.add_argument("--max-rounds", type=int, dest="max_rounds", default=400, help="# of boosting iterations")
|
||||
parser.add_argument("--max-lag", type=int, dest="max_lag", default=10, help="max lag of unit sales")
|
||||
parser.add_argument(
|
||||
"--window-size", type=int, dest="window_size", default=10, help="window size of moving average of unit sales"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
args.feature_fraction = round(args.feature_fraction, 2)
|
||||
args.bagging_fraction = round(args.bagging_fraction, 2)
|
||||
print(args)
|
||||
|
||||
# Start an Azure ML run
|
||||
run = Run.get_context()
|
||||
|
||||
# Data paths
|
||||
DATA_DIR = args.data_folder
|
||||
TRAIN_DIR = os.path.join(DATA_DIR, "train")
|
||||
|
||||
# Data and forecast problem parameters
|
||||
TRAIN_START_WEEK = 40
|
||||
TRAIN_END_WEEK_LIST = list(range(135, 159, 2))
|
||||
TEST_START_WEEK_LIST = list(range(137, 161, 2))
|
||||
TEST_END_WEEK_LIST = list(range(138, 162, 2))
|
||||
# The start datetime of the first week in the record
|
||||
FIRST_WEEK_START = pd.to_datetime("1989-09-14 00:00:00")
|
||||
|
||||
# Parameters of GBM model
|
||||
params = {
|
||||
"objective": "mape",
|
||||
"num_leaves": args.num_leaves,
|
||||
"min_data_in_leaf": args.min_data_in_leaf,
|
||||
"learning_rate": args.learning_rate,
|
||||
"feature_fraction": args.feature_fraction,
|
||||
"bagging_fraction": args.bagging_fraction,
|
||||
"bagging_freq": args.bagging_freq,
|
||||
"num_rounds": args.max_rounds,
|
||||
"early_stopping_rounds": 125,
|
||||
"num_threads": 16,
|
||||
}
|
||||
|
||||
# Lags and used column names
|
||||
lags = np.arange(2, args.max_lag + 1)
|
||||
used_columns = ["store", "brand", "week", "week_of_month", "month", "deal", "feat", "move", "price", "price_ratio"]
|
||||
categ_fea = ["store", "brand", "deal"]
|
||||
|
||||
# Train and validate the model using only the first round data
|
||||
r = 0
|
||||
print("---- Round " + str(r + 1) + " ----")
|
||||
# Load training data
|
||||
train_df = pd.read_csv(os.path.join(TRAIN_DIR, "train_round_" + str(r + 1) + ".csv"))
|
||||
train_df["move"] = train_df["logmove"].apply(lambda x: round(math.exp(x)))
|
||||
train_df = train_df[["store", "brand", "week", "move"]]
|
||||
|
||||
# Create a dataframe to hold all necessary data
|
||||
store_list = train_df["store"].unique()
|
||||
brand_list = train_df["brand"].unique()
|
||||
week_list = range(TRAIN_START_WEEK, TEST_END_WEEK_LIST[r] + 1)
|
||||
d = {"store": store_list, "brand": brand_list, "week": week_list}
|
||||
data_grid = df_from_cartesian_product(d)
|
||||
data_filled = pd.merge(data_grid, train_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Get future price, deal, and advertisement info
|
||||
aux_df = pd.read_csv(os.path.join(TRAIN_DIR, "aux_round_" + str(r + 1) + ".csv"))
|
||||
data_filled = pd.merge(data_filled, aux_df, how="left", on=["store", "brand", "week"])
|
||||
|
||||
# Create relative price feature
|
||||
price_cols = [
|
||||
"price1",
|
||||
"price2",
|
||||
"price3",
|
||||
"price4",
|
||||
"price5",
|
||||
"price6",
|
||||
"price7",
|
||||
"price8",
|
||||
"price9",
|
||||
"price10",
|
||||
"price11",
|
||||
]
|
||||
data_filled["price"] = data_filled.apply(lambda x: x.loc["price" + str(int(x.loc["brand"]))], axis=1)
|
||||
data_filled["avg_price"] = data_filled[price_cols].sum(axis=1).apply(lambda x: x / len(price_cols))
|
||||
data_filled["price_ratio"] = data_filled["price"] / data_filled["avg_price"]
|
||||
data_filled.drop(price_cols, axis=1, inplace=True)
|
||||
|
||||
# Fill missing values
|
||||
data_filled = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: x.fillna(method="ffill").fillna(method="bfill")
|
||||
)
|
||||
|
||||
# Create datetime features
|
||||
data_filled["week_start"] = data_filled["week"].apply(
|
||||
lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7)
|
||||
)
|
||||
data_filled["year"] = data_filled["week_start"].apply(lambda x: x.year)
|
||||
data_filled["month"] = data_filled["week_start"].apply(lambda x: x.month)
|
||||
data_filled["week_of_month"] = data_filled["week_start"].apply(lambda x: week_of_month(x))
|
||||
data_filled["day"] = data_filled["week_start"].apply(lambda x: x.day)
|
||||
data_filled.drop("week_start", axis=1, inplace=True)
|
||||
|
||||
# Create other features (lagged features, moving averages, etc.)
|
||||
features = data_filled.groupby(["store", "brand"]).apply(
|
||||
lambda x: combine_features(x, ["move"], lags, args.window_size, used_columns)
|
||||
)
|
||||
train_fea = features[features.week <= TRAIN_END_WEEK_LIST[r]].reset_index(drop=True)
|
||||
|
||||
# Drop rows with NaN values
|
||||
train_fea.dropna(inplace=True)
|
||||
|
||||
# Model training and validation
|
||||
# Create a training/validation split
|
||||
train_fea, valid_fea, train_label, valid_label = train_test_split(
|
||||
train_fea.drop("move", axis=1, inplace=False), train_fea["move"], test_size=0.05, random_state=1
|
||||
)
|
||||
dtrain = lgb.Dataset(train_fea, train_label)
|
||||
dvalid = lgb.Dataset(valid_fea, valid_label)
|
||||
# A dictionary to record training results
|
||||
evals_result = {}
|
||||
# Train GBM model
|
||||
bst = lgb.train(
|
||||
params, dtrain, valid_sets=[dtrain, dvalid], categorical_feature=categ_fea, evals_result=evals_result
|
||||
)
|
||||
# Get final training loss & validation loss
|
||||
train_loss = evals_result["training"]["mape"][-1]
|
||||
valid_loss = evals_result["valid_1"]["mape"][-1]
|
||||
print("Final training loss is {}".format(train_loss))
|
||||
print("Final validation loss is {}".format(valid_loss))
|
||||
|
||||
# Log the validation loss/MAPE
|
||||
run.log("MAPE", np.float(valid_loss) * 100)
|
|
@ -1,9 +1,10 @@
|
|||
# coding: utf-8
|
||||
|
||||
# Utility functions for building the boosted decision tree model.
|
||||
# Utility functions for building the boosted decision tree model.
|
||||
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def week_of_month(dt):
|
||||
"""Get the week of the month for the specified date.
|
||||
|
||||
|
@ -12,15 +13,17 @@ def week_of_month(dt):
|
|||
|
||||
Returns:
|
||||
wom (Integer): Week of the month of the input date
|
||||
"""
|
||||
"""
|
||||
from math import ceil
|
||||
|
||||
first_day = dt.replace(day=1)
|
||||
dom = dt.day
|
||||
adjusted_dom = dom + first_day.weekday()
|
||||
wom = int(ceil(adjusted_dom/7.0))
|
||||
wom = int(ceil(adjusted_dom / 7.0))
|
||||
return wom
|
||||
|
||||
def df_from_cartesian_product(dict_in):
|
||||
|
||||
def df_from_cartesian_product(dict_in):
|
||||
"""Generate a Pandas dataframe from Cartesian product of lists.
|
||||
|
||||
Args:
|
||||
|
@ -31,7 +34,8 @@ def df_from_cartesian_product(dict_in):
|
|||
"""
|
||||
from collections import OrderedDict
|
||||
from itertools import product
|
||||
|
||||
od = OrderedDict(sorted(dict_in.items()))
|
||||
cart = list(product(*od.values()))
|
||||
df = pd.DataFrame(cart, columns=od.keys())
|
||||
return df
|
||||
return df
|
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Загрузка…
Ссылка в новой задаче