Merge branch 'master' of https://github.com/Microsoft/acceleratoRs
This commit is contained in:
Коммит
e32719a791
|
@ -19,7 +19,7 @@ Each of accelerators shared in this repo is structured following the project tem
|
|||
* `Docs` - Normally related documentations, references, and perhaps
|
||||
yielded reports will be put in this directory.
|
||||
|
||||
* An accelerator should be able to run interactively as an R notebooks in RStudio.
|
||||
* An accelerator should be able to run interactively in an IDE that supports R markdown such as [R Tools for Visual Studio (RTVS)](https://docs.microsoft.com/en-us/visualstudio/rtvs/rmarkdown) or RStudio.
|
||||
* Makefile is by default provided to generate documents of other formats, or alternatively rmarkdown::render can be used for the same purpose.
|
||||
|
||||
# Contributing
|
||||
|
|
|
@ -0,0 +1,44 @@
|
|||
RMD=$(wildcard *_*.Rmd)
|
||||
|
||||
RCD=$(RMD:.Rmd=.R)
|
||||
HTM=$(RMD:.Rmd=.html)
|
||||
PDF=$(RMD:.Rmd=.pdf)
|
||||
ODT=$(RMD:.Rmd=.odt)
|
||||
DOC=$(RMD:.Rmd=.docx)
|
||||
MDN=$(RMD:.Rmd=.md)
|
||||
IPY=$(RMD:.Rmd=.ipynb)
|
||||
|
||||
%.R: %.Rmd
|
||||
Rscript -e 'knitr::purl("$*.Rmd")'
|
||||
|
||||
%.md: %.Rmd
|
||||
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::md_document")'
|
||||
|
||||
%.html: %.Rmd
|
||||
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::html_document")'
|
||||
|
||||
.PRECIOUS: %.pdf
|
||||
%.pdf: %.Rmd
|
||||
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::pdf_document")'
|
||||
|
||||
%.view: %.pdf
|
||||
evince $^ &
|
||||
|
||||
%.ipynb: %.Rmd
|
||||
notedown $^ --nomagic > $@
|
||||
sh support/fix_ipynb.sh $@
|
||||
|
||||
%.docx: %.Rmd
|
||||
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::word_document")'
|
||||
|
||||
%.odt: %.Rmd
|
||||
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::odt_document")'
|
||||
|
||||
clean:
|
||||
rm -f *.docx *.R *.odt *.pdf *.html *.md *.ipynb
|
||||
|
||||
realclean: clean
|
||||
rm -f *~
|
||||
rm -rf _book _site _html data models
|
||||
rm -rf app_education_files
|
||||
|
|
@ -0,0 +1,35 @@
|
|||
# Prerequisites
|
||||
|
||||
*Place the prerequisites for running the codes*
|
||||
|
||||
* R >= 3.3.1
|
||||
* rmarkdown >= 1.3
|
||||
* AzureSMR >= 0.2.6
|
||||
* AzureDSVM >= 0.2.0
|
||||
* keras >= 2.0.6
|
||||
* ggplot2 >= 2.2.1
|
||||
* magrittr >= 1.5
|
||||
* dplyr >= 0.7.1.9000
|
||||
* readr >= 0.2.2
|
||||
|
||||
# Use of template
|
||||
|
||||
The codes for analytics, embedded with step-by-step instructions, are written in R markdown, and can be run interactively within the code chunks of the markdown file.
|
||||
|
||||
Makefile in the folder can be used to produce report in various formats based upon the R markdown script. Suported output formats include
|
||||
|
||||
* R - pure R codes,
|
||||
* md - markdown,
|
||||
* html - html,
|
||||
* pdf - pdf,
|
||||
* ipynb - Jupyter notebook,
|
||||
* docx - Microsoft Word document, and
|
||||
* odt - OpenDocument document.
|
||||
|
||||
To generate an output of the above format, simply run
|
||||
|
||||
```
|
||||
make <filename>.<supported format>
|
||||
```
|
||||
|
||||
The geneated files can be removed by `make clean` or `make realclean`
|
|
@ -0,0 +1,419 @@
|
|||
---
|
||||
title: "Solar power forecasting with Long Short-Term Memory (LSTM)"
|
||||
author: "Le Zhang, Data Scientist, Cloud and AI, Microsoft"
|
||||
date: '`r Sys.Date()`'
|
||||
output:
|
||||
html_notebook: default
|
||||
---
|
||||
|
||||
This accelerator is a reproduction of CNTK tutorial 106 B - using LSTM
|
||||
for time series forecasting in R. The original tutorial can be found [here](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb).
|
||||
|
||||
The accelerator here mainly demonstrates how one can use `keras` R interface
|
||||
together with CNTK backend, to train a LSTM model for solar power forecasting,
|
||||
in a Azure Data Science Virtual Machine (DSVM).
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
### 1.1 Context.
|
||||
|
||||
[Solar power forecasting](https://en.wikipedia.org/wiki/Solar_power_forecasting)
|
||||
is a challenging and important problem. Analyzing historical time-series data of
|
||||
solar power generation may help predict the total amount of energy produced by
|
||||
solar panels.
|
||||
|
||||
More discussion about solar power forecasting can be found in the Wikipedia page. The
|
||||
model illustrated in this accelerator, is a simplified one, which is to merely demonstrate
|
||||
how an R based LSTM model can be trained in an Azure DSVM.
|
||||
|
||||
### 1.2 Overall introduction
|
||||
|
||||
Overall introduction of model techniques, training framework, and cloud
|
||||
computing resources management can be found in another markdown file.
|
||||
|
||||
## 2 Step by step tutorial
|
||||
|
||||
### 2.1 Set up
|
||||
|
||||
Load the following R packages for this tutorial.
|
||||
|
||||
```{r}
|
||||
library(keras)
|
||||
library(magrittr)
|
||||
library(dplyr)
|
||||
library(readr)
|
||||
library(ggplot2)
|
||||
```
|
||||
|
||||
### 2.2 Data pre-processing
|
||||
|
||||
#### 2.2.1 Data downloading.
|
||||
|
||||
The original data set is preserved [here](https://guschmueds.blob.core.windows.net/datasets/solar.csv).
|
||||
|
||||
For convenience of reproduction, the data is downloaded onto local system.
|
||||
|
||||
```{r}
|
||||
data_url <- "https://guschmueds.blob.core.windows.net/datasets/solar.csv"
|
||||
|
||||
data_dir <- tempdir()
|
||||
data_file <- tempfile(tmpdir=data_dir, fileext="csv")
|
||||
|
||||
# download data.
|
||||
|
||||
download.file(url=data_url,
|
||||
destfile=data_file)
|
||||
```
|
||||
|
||||
```{r}
|
||||
# Read the data into memory.
|
||||
|
||||
df_panel <- read_csv(data_file)
|
||||
```
|
||||
|
||||
#### 2.2.2 Data understanding
|
||||
|
||||
The original data set is in the form of
|
||||
|
||||
|Time | solar.current | solar.total|
|
||||
|------------------|----------|-----------|
|
||||
|2013-12-01 7:00|6.30|1.69|
|
||||
|2013-12-01 7:30|44.30|11.36|
|
||||
|2013-12-01 8:00|208.00|67.50|
|
||||
|...|...|...|
|
||||
|2016-12-01 12:00|1815.00|5330.00|
|
||||
|
||||
The first column is the time stamp of when solar panel is recorded. The frequency of
|
||||
reading is once per half an hour. The second and the third columns are current
|
||||
power at the time of reading and the total reading so far on that day.
|
||||
|
||||
The data can be interactively explored by the following codes.
|
||||
|
||||
```{r}
|
||||
# Take a glimpse of the data.
|
||||
|
||||
glimpse(df_panel)
|
||||
|
||||
ggplot(df_panel, aes(x=solar.current)) +
|
||||
geom_histogram()
|
||||
```
|
||||
|
||||
#### 2.2.3 Data re-formatting.
|
||||
|
||||
The objective is, to predict the max
|
||||
value of total power reading on a day, by using a sequence of historical solar power readings.
|
||||
|
||||
Since every day the number of solar panel power readings may be different - a unique
|
||||
length, 14, is then used for each day. That is, in a daily basis, a univariate
|
||||
times series of 14 elements (14 readings of solar panel power) are
|
||||
formed as input data, in order to predict the maximum value of total power
|
||||
generation of that day.
|
||||
|
||||
Following this principle, the data of a day is then re-formatted as
|
||||
|
||||
|Time series | Predicted target|
|
||||
|-------------------|-----------|
|
||||
|1.7, 11.4|10300|
|
||||
|1.7, 11.4, 67.5|10300|
|
||||
|1.7, 11.4, 67.5, 250.5|10300|
|
||||
|1.7, 11.4, 67.5, 250.5, 573.5|10300|
|
||||
|...|...|
|
||||
|
||||
For training purpose, time stamp is not necessary so the re-formed data are
|
||||
aggregated as a set of sequences.
|
||||
|
||||
The following codes accomplish the processing task, in which there are also
|
||||
sub-tasks for normalization, maximization and minimization, grouping, etc.
|
||||
|
||||
1. Normalize the data as LSTM does not perform well on the un-scaled data.
|
||||
```{r}
|
||||
# Functions for 0-1 normalization.
|
||||
|
||||
normalizeData <- function(data) {
|
||||
(data - min(data)) / (max(data) - min(data))
|
||||
}
|
||||
|
||||
denomalizeData <- function(data, max, min) {
|
||||
data * (max - min) + min
|
||||
}
|
||||
|
||||
df_panel_norm <-
|
||||
mutate(df_panel, solar.current=normalizeData(solar.current)) %>%
|
||||
mutate(solar.total=normalizeData(solar.total)) %T>%
|
||||
print()
|
||||
|
||||
# Save max and min values for later reference, to reconcile original data
|
||||
# when necessary.
|
||||
|
||||
normal_ref <- list(current_max=max(df_panel$solar.current),
|
||||
current_min=min(df_panel$solar.current),
|
||||
total_max=max(df_panel$solar.total),
|
||||
total_min=min(df_panel$solar.total))
|
||||
```
|
||||
|
||||
2. Grouping the data by day.
|
||||
```{r}
|
||||
df_panel_group <-
|
||||
mutate(df_panel_norm, date = as.Date(time)) %>%
|
||||
group_by(date) %>%
|
||||
arrange(date) %T>%
|
||||
print()
|
||||
```
|
||||
|
||||
3. Append the columns "solar.current.max" and "solar.total.max"
|
||||
```{r}
|
||||
# Compute the max values for current and totatl power generation for each day.
|
||||
|
||||
df_panel_current_max <-
|
||||
summarise(df_panel_group, solar.current.max = max(solar.current)) %T>%
|
||||
print()
|
||||
|
||||
df_panel_total_max <-
|
||||
summarise(df_panel_group, solar.total.max = max(solar.total)) %T>%
|
||||
print()
|
||||
|
||||
# Append the max values of power generation.
|
||||
|
||||
df_panel_max <-
|
||||
df_panel_current_max %>%
|
||||
mutate(solar.total.max=df_panel_total_max$solar.total.max) %>%
|
||||
mutate(day_id=row_number())
|
||||
|
||||
df_panel_group$solar.current.max <- df_panel_max$solar.current.max[match(df_panel_group$date, df_panel_max$date)]
|
||||
df_panel_group$solar.total.max <- df_panel_max$solar.total.max[match(df_panel_group$date, df_panel_max$date)]
|
||||
|
||||
df_panel_all <-
|
||||
df_panel_group %T>%
|
||||
print()
|
||||
```
|
||||
|
||||
4. Generate the time series sequences for each day.
|
||||
|
||||
NOTE: **according to the original [CNTK tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb), those days with less than 8 readings are
|
||||
omitted from the data, and those with more than 14 readings are truncated to be exactly 14.
|
||||
```{r}
|
||||
# Find the days that have more than 8 readings.
|
||||
|
||||
day_more_than_8 <-
|
||||
summarise(df_panel_all, group_size = n()) %>%
|
||||
filter(group_size > 8) %>%
|
||||
select(date)
|
||||
|
||||
# Get those days with more than 8 readings, and truncate the number of readings
|
||||
# to be equal or less than 14.
|
||||
|
||||
df_panel_seq <-
|
||||
df_panel_all[which(as.Date(df_panel_all$date) %in% as.Date(day_more_than_8$date)), ] %>%
|
||||
filter(row_number() <= 14) %>%
|
||||
mutate(ndata = n()) %T>%
|
||||
print()
|
||||
```
|
||||
|
||||
According to the data format, for each day, the first sequence is composed by
|
||||
the initial two readings, and the next is generated by appending it with the
|
||||
power reading at next time step. The process iterates until all the readings
|
||||
on that day form the last sequence.
|
||||
|
||||
Function to generate the sequence is as follows.
|
||||
|
||||
```{r}
|
||||
genSequence <- function(data) {
|
||||
if (!"day_id" %in% names(data))
|
||||
stop("Input data frame does not have Day ID (day_id) column!")
|
||||
|
||||
# since 14 is the maximum value so each day there are 13 readings as presumbly
|
||||
# it starts with 2 initial readings.
|
||||
# NOTE: the difference from approach in this tutorial to that in CNTK official tutorial is here
|
||||
# the meter readings are padded with 0s. This is because keras interface does not take list as input.
|
||||
|
||||
date <- as.character(0)
|
||||
x <- array(0, dim=c(14 * n_groups(data), 14, 1))
|
||||
y <- array(0, dim=c(14 * n_groups(data), 1))
|
||||
|
||||
index <- 1
|
||||
|
||||
cat("Generating data ...")
|
||||
|
||||
for (j in unique(data$day_id)) {
|
||||
readings <- select(filter(data, day_id == j),
|
||||
solar.total,
|
||||
solar.total.max,
|
||||
date)
|
||||
|
||||
readings_date <- readings$date
|
||||
readings_x <- as.vector(readings$solar.total)
|
||||
readings_y <- as.vector(readings$solar.total.max)
|
||||
|
||||
reading_date <- unique(readings_date)
|
||||
reading_y <- unique(readings_y)
|
||||
|
||||
for (i in 2:nrow(readings)) {
|
||||
x[index, 1:i, 1] <- readings_x[1:i]
|
||||
y[index, 1] <- reading_y
|
||||
date[index] <- as.character(reading_date)
|
||||
|
||||
# day_id is different form group index! So we use another separate iterator.
|
||||
|
||||
index <- index + 1
|
||||
}
|
||||
}
|
||||
|
||||
return(list(x=array(x[1:(index - 1), 1:14, 1], dim=c(index - 1, 14, 1)),
|
||||
y=y[1:(index - 1)],
|
||||
date=date[1:(index - 1)]))
|
||||
}
|
||||
```
|
||||
|
||||
#### 2.2.4 Data splitting
|
||||
|
||||
The whole data set is split into training, validating, and testing sets, and
|
||||
the data sets are sampled in the following scenario:
|
||||
|
||||
|Day1|Day2|...|DayN-1|DayN|DayN+1|DayN+2|...|Day2N-1|Day2N|
|
||||
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|
||||
|Train|Train|...|Val|Test|Train|Train|...|Val|Test|
|
||||
|
||||
To follow the original tutorial, training, validating, and testing data are
|
||||
sampled from 8 sequential days, 1 day, 1 day, in every 10 day of the original
|
||||
data set.
|
||||
|
||||
```{r}
|
||||
df_panel_seq_sample <-
|
||||
mutate(df_panel_seq, sample_index = day_id %% 10) %T>%
|
||||
print()
|
||||
|
||||
df_train <- filter(df_panel_seq_sample, sample_index <= 8 & sample_index > 0)
|
||||
df_val <- filter(df_panel_seq_sample, sample_index == 9)
|
||||
df_test <- filter(df_panel_seq_sample, sample_index == 0)
|
||||
```
|
||||
|
||||
The data sets are then processed with `genSequence` function to generate the
|
||||
time sequence data into the required format.
|
||||
|
||||
```{r}
|
||||
seq_train <- genSequence(df_train)
|
||||
seq_val <- genSequence(df_val)
|
||||
seq_test <- genSequence(df_test)
|
||||
|
||||
x_train <- seq_train$x
|
||||
y_train <- seq_train$y
|
||||
|
||||
x_val <- seq_val$x
|
||||
y_val <- seq_val$y
|
||||
|
||||
x_test <- seq_test$x
|
||||
y_test <- seq_test$y
|
||||
date_test <- as.Date(seq_test$date)
|
||||
```
|
||||
|
||||
### 2.3 Model definition and creation
|
||||
|
||||
The overall structure of the LSTM neural network is shown as below.
|
||||
|
||||
![](../Docs/Figs/lstm.png)
|
||||
|
||||
There are 14 LSTM cells, each taking an input of solar power reading from a
|
||||
series. To minimize overfitting, a dropout layer with 0.2 dropout rate is added. The final layer is
|
||||
a neuron that is densely connected with the dropout layer, and the output is the predicted solar power value.
|
||||
|
||||
#### 2.3.1 Model definition.
|
||||
|
||||
In Keras, one type of neural network model is to stack basic layers, and this type of model
|
||||
starts with `keras_model_sequential()` function. According
|
||||
to the model description, the R code to define the model is
|
||||
|
||||
```{r}
|
||||
# The neural network topology is the same as that in the original CNTK tutorial.
|
||||
|
||||
model <-
|
||||
keras_model_sequential() %>%
|
||||
layer_lstm(units=14,
|
||||
input_shape=c(14, 1)) %>%
|
||||
layer_dropout(rate=0.2) %>%
|
||||
layer_dense(units=1)
|
||||
```
|
||||
|
||||
The defined model is then compiled, where loss function (mean squared error)
|
||||
and optimization method (Adam method) are specified.
|
||||
|
||||
```{r}
|
||||
model %>% compile(loss='mse', optimizer='adam')
|
||||
```
|
||||
|
||||
After the compilation, basic information of the model can be visualized by `summary`.
|
||||
|
||||
```{r}
|
||||
summary(model)
|
||||
```
|
||||
|
||||
#### 2.3.2 Model training
|
||||
|
||||
After model definition and data pre-processing, the model is trained with the
|
||||
training set. Epoch size and batch size can be varied as parameters to
|
||||
fine tune the model performance.
|
||||
|
||||
```{r}
|
||||
# Large sizes of epoch and batch will induce longer training time.
|
||||
|
||||
epoch_size <- 200
|
||||
batch_size <- 1
|
||||
|
||||
# Validating sets can be used to validate the model.
|
||||
|
||||
model %>% fit(x_train,
|
||||
y_train,
|
||||
validation_data=list(x_val, y_val),
|
||||
batch_size=batch_size,
|
||||
epochs=epoch_size)
|
||||
```
|
||||
|
||||
#### 2.3.4 Model scoring
|
||||
|
||||
After training, the model can be scored with the loss metric.
|
||||
|
||||
```{r}
|
||||
# evaluation on the test data.
|
||||
|
||||
score <-
|
||||
evaluate(model, x_test, y_test) %T>%
|
||||
print()
|
||||
```
|
||||
|
||||
#### 2.3.5 Result visualization
|
||||
|
||||
```{r eval=FALSE}
|
||||
# Use the model for prediction.
|
||||
|
||||
y_pred <- predict(model,
|
||||
x_test)
|
||||
|
||||
# Reconcile the original data.
|
||||
|
||||
y_pred <- denomalizeData(y_pred, normal_ref$total_max, normal_ref$total_min)
|
||||
y_test <- denomalizeData(y_test, normal_ref$total_max, normal_ref$total_min)
|
||||
|
||||
# Plot the comparison results.
|
||||
|
||||
df_plot <- data.frame(
|
||||
date=date_test,
|
||||
index=1:length(y_test),
|
||||
true=y_test,
|
||||
pred=y_pred)
|
||||
|
||||
ggplot(df_plot, aes(x=date)) +
|
||||
geom_line(aes(y=y_test, color="True")) +
|
||||
geom_line(aes(y=y_pred, color="Pred")) +
|
||||
theme_bw() +
|
||||
ggtitle("Solar power forecasting") +
|
||||
xlab("Date") +
|
||||
ylab("Max of total solar power")
|
||||
```
|
||||
|
||||
The result comparing prediction and ground-truth power values is shown as follows.
|
||||
![](../Docs/Figs/result.png)
|
||||
|
||||
It shows in the plot that the prediction results align well with the true values.
|
||||
There are ways of improving the model such as
|
||||
* increasing the number of epochs.
|
||||
* further preprocessing the training data to smooth out missing values.
|
||||
* complicating the network topology.
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,264 @@
|
|||
---
|
||||
title: "Solar power forecasting with Long Short-Term Memory (LSTM)"
|
||||
author: "Le Zhang, Data Scientist, Cloud and AI, Microsoft"
|
||||
date: '`r Sys.Date()`'
|
||||
output:
|
||||
html_notebook: default
|
||||
---
|
||||
|
||||
This accelerator is a reproduction of CNTK tutorial 106 B - using LSTM
|
||||
for time series forecasting in R. The original tutorial can be found [here](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb).
|
||||
|
||||
The accelerator here mainly demonstrates how one can use `keras` R interface
|
||||
together with Cognitive Toolkit backend, to train a LSTM model for solar power forecasting,
|
||||
in a Azure Data Science Virtual Machine (DSVM).
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
### 1.1 Context.
|
||||
|
||||
[Solar power forecasting](https://en.wikipedia.org/wiki/Solar_power_forecasting)
|
||||
is a challenging and important problem. Analyzing historical time-series data of
|
||||
solar power generation may help predict the total amount of energy produced by
|
||||
solar panels.
|
||||
|
||||
More discussion about solar power forecasting can be found in the Wikipedia page. The
|
||||
model illustrated in this accelerator, is a simplified one, which is to merely demonstrate
|
||||
how an R based LSTM model can be trained with Cognitive Toolkit backend in an Azure DSVM.
|
||||
|
||||
### 1.2 LSTM
|
||||
|
||||
LSTM is a type of Recurrent Neural Network, which is featued by its capability in
|
||||
modeling long-term dependencies. It's been pratically applied in many fields such
|
||||
as Natural Language Processing (NLP), action recognition, time series prediction,
|
||||
etc.
|
||||
|
||||
While a comprehensive discussion of LSTM is not the focus of this accelerator,
|
||||
more information can be found in [Chris Olah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).
|
||||
|
||||
### 1.3 Cognitive Toolkit and Keras
|
||||
|
||||
#### 1.3.1 Cognitive Toolkit
|
||||
|
||||
[Microsoft Cognitive Toolkit (previously known as CNTK)](https://www.microsoft.com/en-us/cognitive-toolkit/) is a free, easy-to-use,
|
||||
open-source, and commercial-grade toolkit that trains deep learning algorithms
|
||||
to learn like the human brains.
|
||||
|
||||
It is featured by
|
||||
* Highly optimized and built-in components that handle multi-dimensional data
|
||||
from different language environment, deal with various types of deep learning
|
||||
algorithms, add user-defined core components on the GPU, etc.
|
||||
* Efficient resource usage that avails parallelism with multiple GPU/machines.
|
||||
* Easy expression of neural network with full APIs of Python, C++, and BrainScript.
|
||||
* Training support on Azure.
|
||||
|
||||
#### 1.3.2 Keras
|
||||
|
||||
[Keras](https://keras.io/) is a high-level neural networks API, that is capable of running various
|
||||
backends such as Cognitive Toolkit, Tensorflow, and Theano. It made experimenting
|
||||
deep learning neural networks from idea to result easier than ever before.
|
||||
|
||||
#### 1.3.3 Cognitive Toolkit + Keras in R
|
||||
|
||||
Since version 2.0, Cognitive Toolkit starts to support keras.
|
||||
|
||||
Cognitive Toolkit has not supported R yet. However, by using [Keras R interface](https://rstudio.github.io/keras/), one can try training neural network
|
||||
models by using keras API with Cognitive Toolkit backend.
|
||||
|
||||
## 2 Cloud resource deployment
|
||||
|
||||
Azure cloud platform offers varieties of resources for elastically running
|
||||
scalable analytical jobs. Especially, VMs or VM clusters incorporated with
|
||||
high-performance computing engines make it convenient for researchers and
|
||||
developers to prototype and validate models easily.
|
||||
|
||||
The following sections demonstrate how to train an LSTM model on a DSVM with
|
||||
Cognitive Toolkit and Keras R interface.
|
||||
|
||||
NOTE: **the script for demonstrating Cognitive Toolkit + Keras can also be run
|
||||
in a local environment, but one needs to manually download and install
|
||||
Cognitive Toolkit, Keras, Keras R package, CUDA Toolkit (if GPU device is available and GPU acceleration
|
||||
is wanted), and their dependencies.**
|
||||
|
||||
### 2.1 Data Science Virtual Machine (DSVM)
|
||||
|
||||
[Azure DSVM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm) is a curated VM that is pre-installed with
|
||||
rich set of commonly used data science and AI development tools such as R/Python
|
||||
environment, Cognitive Toolkit, SQL Server, etc.
|
||||
|
||||
DSVM is a desirable workplace to experiment, prototype, and productize data
|
||||
analytical and AI solutions. Elasticity of the offering also guarantees the cost
|
||||
effectiveness, which makes it more economically efficient compared to on-premise
|
||||
servers.
|
||||
|
||||
### 2.2 Configuration and setup
|
||||
|
||||
Both Cognitive Toolkit and keras are pre-installed on DSVM. However, so far the
|
||||
keras R packages as well as its dependencies are not available. Preliminary
|
||||
installation and configuration are therefore required.
|
||||
|
||||
### 2.3 Resource deployment with `AzureDSVM`
|
||||
|
||||
[`AzureDSVM`](https://github.com/Azure/AzureDSVM) is an R packages that allow R users to directly interact with Azure
|
||||
account to administrate DSVM instances.
|
||||
|
||||
To fire up a DSVM, one just needs to specify information such as DSVM name, user
|
||||
name, operating system, VM size, etc. For example, the following script fires up
|
||||
a Ubuntu DSVM of size D1_V2, located at Southeast Asia. NC-series VMs, which are
|
||||
incorporated with GPU devices, are available in certain regions such as East US,
|
||||
West Europe, etc. Compared to D-series VMs, NC-series have higher pricing rate, so
|
||||
there are trade-offs for choosing an appropriate machine for training work.
|
||||
|
||||
NOTE: **Keras and Keras R interface promise seamless utilization of GPU device
|
||||
that is properly configured in the VM for accelerating deep learning model training.**
|
||||
|
||||
```{r}
|
||||
# load the packages
|
||||
|
||||
library(AzureSMR)
|
||||
library(AzureDSVM)
|
||||
```
|
||||
|
||||
```{r}
|
||||
# Credentials for authentication against Azure account are preserved in a json
|
||||
# formmated file, named "config.json", which is located at ~/.azuresmr
|
||||
|
||||
# The credentials needed for authetication include Client ID, Tenant ID,
|
||||
# authentication key, password, and public key.
|
||||
|
||||
settingsfile <- getOption("AzureSMR.config")
|
||||
config <- read.AzureSMR.config()
|
||||
```
|
||||
|
||||
```{r}
|
||||
# Authentication with the credential information.
|
||||
|
||||
asc <- createAzureContext()
|
||||
|
||||
with(config,
|
||||
setAzureContext(asc, tenantID=tenantID, clientID=clientID, authKey=authKey)
|
||||
)
|
||||
azureAuthenticate(asc)
|
||||
```
|
||||
|
||||
```{r}
|
||||
# location and resource group name.
|
||||
|
||||
dsvm_location <- "southeastasia"
|
||||
dsvm_rg <- paste0("rg", paste(sample(letters, 3), collapse=""))
|
||||
|
||||
# VM size, operating system, and VM name.
|
||||
|
||||
dsvm_size <- "Standard_D1_v2"
|
||||
dsvm_os <- "Ubuntu"
|
||||
dsvm_name <- paste0("dsvm",
|
||||
paste(sample(letters, 3), collapse=""))
|
||||
|
||||
# VM user name, authentication method (password in this case), and login password.
|
||||
|
||||
dsvm_username <- "dsvmuser"
|
||||
dsvm_authen <- "Password"
|
||||
dsvm_password <- config$PASSWORD
|
||||
```
|
||||
|
||||
```{r eval=FALSE}
|
||||
# deploy the DSVM.
|
||||
|
||||
deployDSVM(asc,
|
||||
resource.group=dsvm_rg,
|
||||
location=dsvm_location,
|
||||
hostname=dsvm_name,
|
||||
username=dsvm_username,
|
||||
size=dsvm_size,
|
||||
os=dsvm_os,
|
||||
authen=dsvm_authen,
|
||||
password=dsvm_password,
|
||||
mode="Sync")
|
||||
```
|
||||
|
||||
As originally a DSVM does not have keras R interface installed and
|
||||
configured. A post-deployment installation and configuration of the package can
|
||||
be achieved by adding an extension to the deployed DSVM. Basically it runs a
|
||||
shell script that is located on a remote place.
|
||||
|
||||
```{r}
|
||||
# URL of the shell script and the command to run the script.
|
||||
|
||||
dsvm_fileurl <- "https://raw.githubusercontent.com/yueguoguo/Azure-R-Interface/master/demos/demo-5/script.sh"
|
||||
dsvm_command <- "sudo sh script.sh"
|
||||
```
|
||||
|
||||
```{r eval=FALSE}
|
||||
# Add extension to the DSVM.
|
||||
|
||||
addExtensionDSVM(asc,
|
||||
location=dsvm_location,
|
||||
resource.group=dsvm_rg,
|
||||
hostname=dsvm_name,
|
||||
os=dsvm_os,
|
||||
fileurl=dsvm_fileurl,
|
||||
command=dsvm_command)
|
||||
```
|
||||
|
||||
### 2.4 Remote access to the DSVM.
|
||||
|
||||
After a successful deployment and extension, the DSVM can be remotely accessed
|
||||
by
|
||||
1. Rstudio Server - http://<dsvm_name>.<location>.cloudapp.azure.com:8787
|
||||
2. Jupyter Notebook - https://<dsvm_name>.<location>.cloudapp.azure.com:8000
|
||||
3. X2Go client.
|
||||
|
||||
NOTE: **it was found that keras R interface does not work well in Rstudio server
|
||||
owing to SSL certificate issue. This may be related to "http" protocol. Jupyter
|
||||
Notebook which is based on "https" protocol works well.**
|
||||
|
||||
Idealy in the R session of the remote DSVM, typing the following
|
||||
|
||||
```{r eval=FALSE}
|
||||
library(keras)
|
||||
|
||||
backend()
|
||||
```
|
||||
|
||||
will show the message of "Using CNTK backend...", which means the interface can
|
||||
detect and load Cognitive Toolkit backend. If the DSVM is an NC-series one, GPU
|
||||
device will be detected and used.
|
||||
|
||||
## 2.5 Model building
|
||||
|
||||
After all the set up, the model can be created by using Keras R interface
|
||||
functions. As the model building follows the original CNTK tutorial, text of
|
||||
introduction and description will not be replicated here.
|
||||
|
||||
Script of the whole step-by-step tutorial is available [here](../Code/lstm.R)
|
||||
|
||||
### 2.6 Run script on the DSVM
|
||||
|
||||
The script can be run on the deployed DSVM in various ways.
|
||||
|
||||
1. Jupyter Notebook - access the Jupyter Hub hosted by the DSVM via https://<dsvm_name>.<location>.cloudapp.azure.com:8000. Create an R-kernel
|
||||
notebook to run the script.
|
||||
2. X2Go client - create a new X2Go session for remote desktop of that machine.
|
||||
The script can be copied onto the DSVM via either SSH or any SSH-based file
|
||||
transfer software, and then be run in Rstudio desktop version.
|
||||
|
||||
NOTE: **it was found that if the script is run with R console or Rscript in
|
||||
command line, GPU device will not be activated for acceleration, while running
|
||||
the script in Rstudio IDE does not have such kind of problem.**
|
||||
|
||||
## 3 Closing
|
||||
|
||||
After the experiment, it is recommended to either stop and deallocate, or
|
||||
destroy the computing resource, if it is not needed.
|
||||
|
||||
```{r eval=FALSE}
|
||||
# Stop and deallocate the DSVM.
|
||||
|
||||
operateDSVM(asc, dsvm_rg, dsvm_name, "Stop")
|
||||
```
|
||||
|
||||
```{r eval=FALSE}
|
||||
# Delete the resource group.
|
||||
|
||||
azureDeleteResourceGroup(asc, dsvm_rg)
|
||||
```
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,9 @@
|
|||
# List of data sets
|
||||
| Data Set Name | Link to the Full Data Set | Full Data Set Size (MB) | Link to Report |
|
||||
| ---:| ---: | ---: | ---: |
|
||||
| Data Set 1 | [link](https://guschmueds.blob.core.windows.net/datasets/solar.csv) | 1.1 | N/A|
|
||||
|
||||
# Description of data sets
|
||||
|
||||
* Data Set 1 *Solar panel readings from 2013-12-01 to 2016-12-01,
|
||||
sampled every half an hour in a day.*
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 9.4 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 12 KiB |
|
@ -0,0 +1,4 @@
|
|||
# Documents
|
||||
|
||||
*This folder contains documents such as blogs, installation instructions, etc. is also the default diretory where the generated reports from R markdown are placed.*
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
# Data Science Accelerator - *Solar power forecasting with Cognitive Toolkit in R*
|
||||
|
||||
## Overview
|
||||
|
||||
This repo reproduces [CNTK tutorial 106
|
||||
B](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb)
|
||||
- Deep Learning time series forecasting with Long Short-Term Memory
|
||||
(LSTM) in R, by using the Keras R interface with Microsoft Cognitive
|
||||
Toolkit in an Azure Data Science Virtual Machine (DSVM).
|
||||
|
||||
An Azure account can be created for free by visiting [Microsoft
|
||||
Azure](https://azure.microsoft.com/free). This will then allow you to
|
||||
deploy a [Ubuntu Data Science Virtual
|
||||
Machine](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-virtual-machine-overview)
|
||||
through the [Azure Portal](https://ms.portal.azure.com). You can then
|
||||
connect to the server's [RStudio
|
||||
Server](https://www.rstudio.com/products/rstudio/#Server) instance
|
||||
through a local web browser via ```http://<ip address>:8787```.
|
||||
|
||||
The repository contains three parts
|
||||
|
||||
- **Data** Solar panel readings collected from Internet-of-Things (IoTs)
|
||||
devices are used.
|
||||
- **Code** Two R markdown files are available - the first one titled
|
||||
[SolarPanelForecastingTutorial](https://github.com/Microsoft/acceleratoRs/blob/master/SolarPanelForecasting/Code/SolarPanelForecastingTutorial.Rmd) provides a general introduction of
|
||||
the accelerator and codes for setting up an experimental environment
|
||||
on Azure DSVM; the second one titled [SolarPanelForecastingCode](https://github.com/Microsoft/acceleratoRs/blob/master/SolarPanelForecasting/Code/SolarPanelForecastingCode.Rmd)
|
||||
wraps codes and step-by-step tutorials on build a LSTM model for
|
||||
forecasting from end to end.
|
||||
- **Docs** Blogs and decks will be added soon.
|
||||
|
||||
## Business domain
|
||||
|
||||
The accelerator presents a tutorial on forecasting solar panel power
|
||||
readings by using a LSTM based neural network model trained on the
|
||||
historical data. Solar power forecasting is a critical problem, and a
|
||||
model with desirable estimation accuracy potentially benefits many
|
||||
domain-specific business such as energy trading, management, etc.
|
||||
|
||||
## Data science problem
|
||||
|
||||
The problem is to predict the maximum value of total power generation in
|
||||
a day from the solar panel, by taking the sequential readings of solar
|
||||
power generation at the current and past sampling moments.
|
||||
|
||||
## Data understanding
|
||||
|
||||
The data set used in the accelerator was collected from IoT devices
|
||||
incorporated in solar panels. The data is available at the
|
||||
[URL](https://guschmueds.blob.core.windows.net/datasets/solar.csv).
|
||||
|
||||
## Modeling
|
||||
|
||||
Model used in this accelerator is based on LSTM, which is capable of
|
||||
modeling long-term depenencies in time series data. By properly
|
||||
processing the original data into sequences of power readings, a deep
|
||||
neural network formed by LSTM cells and dropout layers can capture the
|
||||
patterns in the time series so as to predict the output.
|
||||
|
||||
## Solution architecture
|
||||
|
||||
The experiment is conducted on a Ubuntu DSVM.
|
Загрузка…
Ссылка в новой задаче