This commit is contained in:
Zhou Fang 2017-08-02 15:55:57 +08:00
Родитель 55b6f35104 55a20194ec
Коммит e32719a791
12 изменённых файлов: 2103 добавлений и 1 удалений

Просмотреть файл

@ -19,7 +19,7 @@ Each of accelerators shared in this repo is structured following the project tem
* `Docs` - Normally related documentations, references, and perhaps
yielded reports will be put in this directory.
* An accelerator should be able to run interactively as an R notebooks in RStudio.
* An accelerator should be able to run interactively in an IDE that supports R markdown such as [R Tools for Visual Studio (RTVS)](https://docs.microsoft.com/en-us/visualstudio/rtvs/rmarkdown) or RStudio.
* Makefile is by default provided to generate documents of other formats, or alternatively rmarkdown::render can be used for the same purpose.
# Contributing

Просмотреть файл

@ -0,0 +1,44 @@
RMD=$(wildcard *_*.Rmd)
RCD=$(RMD:.Rmd=.R)
HTM=$(RMD:.Rmd=.html)
PDF=$(RMD:.Rmd=.pdf)
ODT=$(RMD:.Rmd=.odt)
DOC=$(RMD:.Rmd=.docx)
MDN=$(RMD:.Rmd=.md)
IPY=$(RMD:.Rmd=.ipynb)
%.R: %.Rmd
Rscript -e 'knitr::purl("$*.Rmd")'
%.md: %.Rmd
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::md_document")'
%.html: %.Rmd
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::html_document")'
.PRECIOUS: %.pdf
%.pdf: %.Rmd
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::pdf_document")'
%.view: %.pdf
evince $^ &
%.ipynb: %.Rmd
notedown $^ --nomagic > $@
sh support/fix_ipynb.sh $@
%.docx: %.Rmd
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::word_document")'
%.odt: %.Rmd
Rscript -e 'rmarkdown::render("$*.Rmd", "rmarkdown::odt_document")'
clean:
rm -f *.docx *.R *.odt *.pdf *.html *.md *.ipynb
realclean: clean
rm -f *~
rm -rf _book _site _html data models
rm -rf app_education_files

Просмотреть файл

@ -0,0 +1,35 @@
# Prerequisites
*Place the prerequisites for running the codes*
* R >= 3.3.1
* rmarkdown >= 1.3
* AzureSMR >= 0.2.6
* AzureDSVM >= 0.2.0
* keras >= 2.0.6
* ggplot2 >= 2.2.1
* magrittr >= 1.5
* dplyr >= 0.7.1.9000
* readr >= 0.2.2
# Use of template
The codes for analytics, embedded with step-by-step instructions, are written in R markdown, and can be run interactively within the code chunks of the markdown file.
Makefile in the folder can be used to produce report in various formats based upon the R markdown script. Suported output formats include
* R - pure R codes,
* md - markdown,
* html - html,
* pdf - pdf,
* ipynb - Jupyter notebook,
* docx - Microsoft Word document, and
* odt - OpenDocument document.
To generate an output of the above format, simply run
```
make <filename>.<supported format>
```
The geneated files can be removed by `make clean` or `make realclean`

Просмотреть файл

@ -0,0 +1,419 @@
---
title: "Solar power forecasting with Long Short-Term Memory (LSTM)"
author: "Le Zhang, Data Scientist, Cloud and AI, Microsoft"
date: '`r Sys.Date()`'
output:
html_notebook: default
---
This accelerator is a reproduction of CNTK tutorial 106 B - using LSTM
for time series forecasting in R. The original tutorial can be found [here](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb).
The accelerator here mainly demonstrates how one can use `keras` R interface
together with CNTK backend, to train a LSTM model for solar power forecasting,
in a Azure Data Science Virtual Machine (DSVM).
## 1 Introduction
### 1.1 Context.
[Solar power forecasting](https://en.wikipedia.org/wiki/Solar_power_forecasting)
is a challenging and important problem. Analyzing historical time-series data of
solar power generation may help predict the total amount of energy produced by
solar panels.
More discussion about solar power forecasting can be found in the Wikipedia page. The
model illustrated in this accelerator, is a simplified one, which is to merely demonstrate
how an R based LSTM model can be trained in an Azure DSVM.
### 1.2 Overall introduction
Overall introduction of model techniques, training framework, and cloud
computing resources management can be found in another markdown file.
## 2 Step by step tutorial
### 2.1 Set up
Load the following R packages for this tutorial.
```{r}
library(keras)
library(magrittr)
library(dplyr)
library(readr)
library(ggplot2)
```
### 2.2 Data pre-processing
#### 2.2.1 Data downloading.
The original data set is preserved [here](https://guschmueds.blob.core.windows.net/datasets/solar.csv).
For convenience of reproduction, the data is downloaded onto local system.
```{r}
data_url <- "https://guschmueds.blob.core.windows.net/datasets/solar.csv"
data_dir <- tempdir()
data_file <- tempfile(tmpdir=data_dir, fileext="csv")
# download data.
download.file(url=data_url,
destfile=data_file)
```
```{r}
# Read the data into memory.
df_panel <- read_csv(data_file)
```
#### 2.2.2 Data understanding
The original data set is in the form of
|Time | solar.current | solar.total|
|------------------|----------|-----------|
|2013-12-01 7:00|6.30|1.69|
|2013-12-01 7:30|44.30|11.36|
|2013-12-01 8:00|208.00|67.50|
|...|...|...|
|2016-12-01 12:00|1815.00|5330.00|
The first column is the time stamp of when solar panel is recorded. The frequency of
reading is once per half an hour. The second and the third columns are current
power at the time of reading and the total reading so far on that day.
The data can be interactively explored by the following codes.
```{r}
# Take a glimpse of the data.
glimpse(df_panel)
ggplot(df_panel, aes(x=solar.current)) +
geom_histogram()
```
#### 2.2.3 Data re-formatting.
The objective is, to predict the max
value of total power reading on a day, by using a sequence of historical solar power readings.
Since every day the number of solar panel power readings may be different - a unique
length, 14, is then used for each day. That is, in a daily basis, a univariate
times series of 14 elements (14 readings of solar panel power) are
formed as input data, in order to predict the maximum value of total power
generation of that day.
Following this principle, the data of a day is then re-formatted as
|Time series | Predicted target|
|-------------------|-----------|
|1.7, 11.4|10300|
|1.7, 11.4, 67.5|10300|
|1.7, 11.4, 67.5, 250.5|10300|
|1.7, 11.4, 67.5, 250.5, 573.5|10300|
|...|...|
For training purpose, time stamp is not necessary so the re-formed data are
aggregated as a set of sequences.
The following codes accomplish the processing task, in which there are also
sub-tasks for normalization, maximization and minimization, grouping, etc.
1. Normalize the data as LSTM does not perform well on the un-scaled data.
```{r}
# Functions for 0-1 normalization.
normalizeData <- function(data) {
(data - min(data)) / (max(data) - min(data))
}
denomalizeData <- function(data, max, min) {
data * (max - min) + min
}
df_panel_norm <-
mutate(df_panel, solar.current=normalizeData(solar.current)) %>%
mutate(solar.total=normalizeData(solar.total)) %T>%
print()
# Save max and min values for later reference, to reconcile original data
# when necessary.
normal_ref <- list(current_max=max(df_panel$solar.current),
current_min=min(df_panel$solar.current),
total_max=max(df_panel$solar.total),
total_min=min(df_panel$solar.total))
```
2. Grouping the data by day.
```{r}
df_panel_group <-
mutate(df_panel_norm, date = as.Date(time)) %>%
group_by(date) %>%
arrange(date) %T>%
print()
```
3. Append the columns "solar.current.max" and "solar.total.max"
```{r}
# Compute the max values for current and totatl power generation for each day.
df_panel_current_max <-
summarise(df_panel_group, solar.current.max = max(solar.current)) %T>%
print()
df_panel_total_max <-
summarise(df_panel_group, solar.total.max = max(solar.total)) %T>%
print()
# Append the max values of power generation.
df_panel_max <-
df_panel_current_max %>%
mutate(solar.total.max=df_panel_total_max$solar.total.max) %>%
mutate(day_id=row_number())
df_panel_group$solar.current.max <- df_panel_max$solar.current.max[match(df_panel_group$date, df_panel_max$date)]
df_panel_group$solar.total.max <- df_panel_max$solar.total.max[match(df_panel_group$date, df_panel_max$date)]
df_panel_all <-
df_panel_group %T>%
print()
```
4. Generate the time series sequences for each day.
NOTE: **according to the original [CNTK tutorial](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb), those days with less than 8 readings are
omitted from the data, and those with more than 14 readings are truncated to be exactly 14.
```{r}
# Find the days that have more than 8 readings.
day_more_than_8 <-
summarise(df_panel_all, group_size = n()) %>%
filter(group_size > 8) %>%
select(date)
# Get those days with more than 8 readings, and truncate the number of readings
# to be equal or less than 14.
df_panel_seq <-
df_panel_all[which(as.Date(df_panel_all$date) %in% as.Date(day_more_than_8$date)), ] %>%
filter(row_number() <= 14) %>%
mutate(ndata = n()) %T>%
print()
```
According to the data format, for each day, the first sequence is composed by
the initial two readings, and the next is generated by appending it with the
power reading at next time step. The process iterates until all the readings
on that day form the last sequence.
Function to generate the sequence is as follows.
```{r}
genSequence <- function(data) {
if (!"day_id" %in% names(data))
stop("Input data frame does not have Day ID (day_id) column!")
# since 14 is the maximum value so each day there are 13 readings as presumbly
# it starts with 2 initial readings.
# NOTE: the difference from approach in this tutorial to that in CNTK official tutorial is here
# the meter readings are padded with 0s. This is because keras interface does not take list as input.
date <- as.character(0)
x <- array(0, dim=c(14 * n_groups(data), 14, 1))
y <- array(0, dim=c(14 * n_groups(data), 1))
index <- 1
cat("Generating data ...")
for (j in unique(data$day_id)) {
readings <- select(filter(data, day_id == j),
solar.total,
solar.total.max,
date)
readings_date <- readings$date
readings_x <- as.vector(readings$solar.total)
readings_y <- as.vector(readings$solar.total.max)
reading_date <- unique(readings_date)
reading_y <- unique(readings_y)
for (i in 2:nrow(readings)) {
x[index, 1:i, 1] <- readings_x[1:i]
y[index, 1] <- reading_y
date[index] <- as.character(reading_date)
# day_id is different form group index! So we use another separate iterator.
index <- index + 1
}
}
return(list(x=array(x[1:(index - 1), 1:14, 1], dim=c(index - 1, 14, 1)),
y=y[1:(index - 1)],
date=date[1:(index - 1)]))
}
```
#### 2.2.4 Data splitting
The whole data set is split into training, validating, and testing sets, and
the data sets are sampled in the following scenario:
|Day1|Day2|...|DayN-1|DayN|DayN+1|DayN+2|...|Day2N-1|Day2N|
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|Train|Train|...|Val|Test|Train|Train|...|Val|Test|
To follow the original tutorial, training, validating, and testing data are
sampled from 8 sequential days, 1 day, 1 day, in every 10 day of the original
data set.
```{r}
df_panel_seq_sample <-
mutate(df_panel_seq, sample_index = day_id %% 10) %T>%
print()
df_train <- filter(df_panel_seq_sample, sample_index <= 8 & sample_index > 0)
df_val <- filter(df_panel_seq_sample, sample_index == 9)
df_test <- filter(df_panel_seq_sample, sample_index == 0)
```
The data sets are then processed with `genSequence` function to generate the
time sequence data into the required format.
```{r}
seq_train <- genSequence(df_train)
seq_val <- genSequence(df_val)
seq_test <- genSequence(df_test)
x_train <- seq_train$x
y_train <- seq_train$y
x_val <- seq_val$x
y_val <- seq_val$y
x_test <- seq_test$x
y_test <- seq_test$y
date_test <- as.Date(seq_test$date)
```
### 2.3 Model definition and creation
The overall structure of the LSTM neural network is shown as below.
![](../Docs/Figs/lstm.png)
There are 14 LSTM cells, each taking an input of solar power reading from a
series. To minimize overfitting, a dropout layer with 0.2 dropout rate is added. The final layer is
a neuron that is densely connected with the dropout layer, and the output is the predicted solar power value.
#### 2.3.1 Model definition.
In Keras, one type of neural network model is to stack basic layers, and this type of model
starts with `keras_model_sequential()` function. According
to the model description, the R code to define the model is
```{r}
# The neural network topology is the same as that in the original CNTK tutorial.
model <-
keras_model_sequential() %>%
layer_lstm(units=14,
input_shape=c(14, 1)) %>%
layer_dropout(rate=0.2) %>%
layer_dense(units=1)
```
The defined model is then compiled, where loss function (mean squared error)
and optimization method (Adam method) are specified.
```{r}
model %>% compile(loss='mse', optimizer='adam')
```
After the compilation, basic information of the model can be visualized by `summary`.
```{r}
summary(model)
```
#### 2.3.2 Model training
After model definition and data pre-processing, the model is trained with the
training set. Epoch size and batch size can be varied as parameters to
fine tune the model performance.
```{r}
# Large sizes of epoch and batch will induce longer training time.
epoch_size <- 200
batch_size <- 1
# Validating sets can be used to validate the model.
model %>% fit(x_train,
y_train,
validation_data=list(x_val, y_val),
batch_size=batch_size,
epochs=epoch_size)
```
#### 2.3.4 Model scoring
After training, the model can be scored with the loss metric.
```{r}
# evaluation on the test data.
score <-
evaluate(model, x_test, y_test) %T>%
print()
```
#### 2.3.5 Result visualization
```{r eval=FALSE}
# Use the model for prediction.
y_pred <- predict(model,
x_test)
# Reconcile the original data.
y_pred <- denomalizeData(y_pred, normal_ref$total_max, normal_ref$total_min)
y_test <- denomalizeData(y_test, normal_ref$total_max, normal_ref$total_min)
# Plot the comparison results.
df_plot <- data.frame(
date=date_test,
index=1:length(y_test),
true=y_test,
pred=y_pred)
ggplot(df_plot, aes(x=date)) +
geom_line(aes(y=y_test, color="True")) +
geom_line(aes(y=y_pred, color="Pred")) +
theme_bw() +
ggtitle("Solar power forecasting") +
xlab("Date") +
ylab("Max of total solar power")
```
The result comparing prediction and ground-truth power values is shown as follows.
![](../Docs/Figs/result.png)
It shows in the plot that the prediction results align well with the true values.
There are ways of improving the model such as
* increasing the number of epochs.
* further preprocessing the training data to smooth out missing values.
* complicating the network topology.

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,264 @@
---
title: "Solar power forecasting with Long Short-Term Memory (LSTM)"
author: "Le Zhang, Data Scientist, Cloud and AI, Microsoft"
date: '`r Sys.Date()`'
output:
html_notebook: default
---
This accelerator is a reproduction of CNTK tutorial 106 B - using LSTM
for time series forecasting in R. The original tutorial can be found [here](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb).
The accelerator here mainly demonstrates how one can use `keras` R interface
together with Cognitive Toolkit backend, to train a LSTM model for solar power forecasting,
in a Azure Data Science Virtual Machine (DSVM).
## 1 Introduction
### 1.1 Context.
[Solar power forecasting](https://en.wikipedia.org/wiki/Solar_power_forecasting)
is a challenging and important problem. Analyzing historical time-series data of
solar power generation may help predict the total amount of energy produced by
solar panels.
More discussion about solar power forecasting can be found in the Wikipedia page. The
model illustrated in this accelerator, is a simplified one, which is to merely demonstrate
how an R based LSTM model can be trained with Cognitive Toolkit backend in an Azure DSVM.
### 1.2 LSTM
LSTM is a type of Recurrent Neural Network, which is featued by its capability in
modeling long-term dependencies. It's been pratically applied in many fields such
as Natural Language Processing (NLP), action recognition, time series prediction,
etc.
While a comprehensive discussion of LSTM is not the focus of this accelerator,
more information can be found in [Chris Olah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).
### 1.3 Cognitive Toolkit and Keras
#### 1.3.1 Cognitive Toolkit
[Microsoft Cognitive Toolkit (previously known as CNTK)](https://www.microsoft.com/en-us/cognitive-toolkit/) is a free, easy-to-use,
open-source, and commercial-grade toolkit that trains deep learning algorithms
to learn like the human brains.
It is featured by
* Highly optimized and built-in components that handle multi-dimensional data
from different language environment, deal with various types of deep learning
algorithms, add user-defined core components on the GPU, etc.
* Efficient resource usage that avails parallelism with multiple GPU/machines.
* Easy expression of neural network with full APIs of Python, C++, and BrainScript.
* Training support on Azure.
#### 1.3.2 Keras
[Keras](https://keras.io/) is a high-level neural networks API, that is capable of running various
backends such as Cognitive Toolkit, Tensorflow, and Theano. It made experimenting
deep learning neural networks from idea to result easier than ever before.
#### 1.3.3 Cognitive Toolkit + Keras in R
Since version 2.0, Cognitive Toolkit starts to support keras.
Cognitive Toolkit has not supported R yet. However, by using [Keras R interface](https://rstudio.github.io/keras/), one can try training neural network
models by using keras API with Cognitive Toolkit backend.
## 2 Cloud resource deployment
Azure cloud platform offers varieties of resources for elastically running
scalable analytical jobs. Especially, VMs or VM clusters incorporated with
high-performance computing engines make it convenient for researchers and
developers to prototype and validate models easily.
The following sections demonstrate how to train an LSTM model on a DSVM with
Cognitive Toolkit and Keras R interface.
NOTE: **the script for demonstrating Cognitive Toolkit + Keras can also be run
in a local environment, but one needs to manually download and install
Cognitive Toolkit, Keras, Keras R package, CUDA Toolkit (if GPU device is available and GPU acceleration
is wanted), and their dependencies.**
### 2.1 Data Science Virtual Machine (DSVM)
[Azure DSVM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm) is a curated VM that is pre-installed with
rich set of commonly used data science and AI development tools such as R/Python
environment, Cognitive Toolkit, SQL Server, etc.
DSVM is a desirable workplace to experiment, prototype, and productize data
analytical and AI solutions. Elasticity of the offering also guarantees the cost
effectiveness, which makes it more economically efficient compared to on-premise
servers.
### 2.2 Configuration and setup
Both Cognitive Toolkit and keras are pre-installed on DSVM. However, so far the
keras R packages as well as its dependencies are not available. Preliminary
installation and configuration are therefore required.
### 2.3 Resource deployment with `AzureDSVM`
[`AzureDSVM`](https://github.com/Azure/AzureDSVM) is an R packages that allow R users to directly interact with Azure
account to administrate DSVM instances.
To fire up a DSVM, one just needs to specify information such as DSVM name, user
name, operating system, VM size, etc. For example, the following script fires up
a Ubuntu DSVM of size D1_V2, located at Southeast Asia. NC-series VMs, which are
incorporated with GPU devices, are available in certain regions such as East US,
West Europe, etc. Compared to D-series VMs, NC-series have higher pricing rate, so
there are trade-offs for choosing an appropriate machine for training work.
NOTE: **Keras and Keras R interface promise seamless utilization of GPU device
that is properly configured in the VM for accelerating deep learning model training.**
```{r}
# load the packages
library(AzureSMR)
library(AzureDSVM)
```
```{r}
# Credentials for authentication against Azure account are preserved in a json
# formmated file, named "config.json", which is located at ~/.azuresmr
# The credentials needed for authetication include Client ID, Tenant ID,
# authentication key, password, and public key.
settingsfile <- getOption("AzureSMR.config")
config <- read.AzureSMR.config()
```
```{r}
# Authentication with the credential information.
asc <- createAzureContext()
with(config,
setAzureContext(asc, tenantID=tenantID, clientID=clientID, authKey=authKey)
)
azureAuthenticate(asc)
```
```{r}
# location and resource group name.
dsvm_location <- "southeastasia"
dsvm_rg <- paste0("rg", paste(sample(letters, 3), collapse=""))
# VM size, operating system, and VM name.
dsvm_size <- "Standard_D1_v2"
dsvm_os <- "Ubuntu"
dsvm_name <- paste0("dsvm",
paste(sample(letters, 3), collapse=""))
# VM user name, authentication method (password in this case), and login password.
dsvm_username <- "dsvmuser"
dsvm_authen <- "Password"
dsvm_password <- config$PASSWORD
```
```{r eval=FALSE}
# deploy the DSVM.
deployDSVM(asc,
resource.group=dsvm_rg,
location=dsvm_location,
hostname=dsvm_name,
username=dsvm_username,
size=dsvm_size,
os=dsvm_os,
authen=dsvm_authen,
password=dsvm_password,
mode="Sync")
```
As originally a DSVM does not have keras R interface installed and
configured. A post-deployment installation and configuration of the package can
be achieved by adding an extension to the deployed DSVM. Basically it runs a
shell script that is located on a remote place.
```{r}
# URL of the shell script and the command to run the script.
dsvm_fileurl <- "https://raw.githubusercontent.com/yueguoguo/Azure-R-Interface/master/demos/demo-5/script.sh"
dsvm_command <- "sudo sh script.sh"
```
```{r eval=FALSE}
# Add extension to the DSVM.
addExtensionDSVM(asc,
location=dsvm_location,
resource.group=dsvm_rg,
hostname=dsvm_name,
os=dsvm_os,
fileurl=dsvm_fileurl,
command=dsvm_command)
```
### 2.4 Remote access to the DSVM.
After a successful deployment and extension, the DSVM can be remotely accessed
by
1. Rstudio Server - http://<dsvm_name>.<location>.cloudapp.azure.com:8787
2. Jupyter Notebook - https://<dsvm_name>.<location>.cloudapp.azure.com:8000
3. X2Go client.
NOTE: **it was found that keras R interface does not work well in Rstudio server
owing to SSL certificate issue. This may be related to "http" protocol. Jupyter
Notebook which is based on "https" protocol works well.**
Idealy in the R session of the remote DSVM, typing the following
```{r eval=FALSE}
library(keras)
backend()
```
will show the message of "Using CNTK backend...", which means the interface can
detect and load Cognitive Toolkit backend. If the DSVM is an NC-series one, GPU
device will be detected and used.
## 2.5 Model building
After all the set up, the model can be created by using Keras R interface
functions. As the model building follows the original CNTK tutorial, text of
introduction and description will not be replicated here.
Script of the whole step-by-step tutorial is available [here](../Code/lstm.R)
### 2.6 Run script on the DSVM
The script can be run on the deployed DSVM in various ways.
1. Jupyter Notebook - access the Jupyter Hub hosted by the DSVM via https://<dsvm_name>.<location>.cloudapp.azure.com:8000. Create an R-kernel
notebook to run the script.
2. X2Go client - create a new X2Go session for remote desktop of that machine.
The script can be copied onto the DSVM via either SSH or any SSH-based file
transfer software, and then be run in Rstudio desktop version.
NOTE: **it was found that if the script is run with R console or Rscript in
command line, GPU device will not be activated for acceleration, while running
the script in Rstudio IDE does not have such kind of problem.**
## 3 Closing
After the experiment, it is recommended to either stop and deallocate, or
destroy the computing resource, if it is not needed.
```{r eval=FALSE}
# Stop and deallocate the DSVM.
operateDSVM(asc, dsvm_rg, dsvm_name, "Stop")
```
```{r eval=FALSE}
# Delete the resource group.
azureDeleteResourceGroup(asc, dsvm_rg)
```

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,9 @@
# List of data sets
| Data Set Name | Link to the Full Data Set | Full Data Set Size (MB) | Link to Report |
| ---:| ---: | ---: | ---: |
| Data Set 1 | [link](https://guschmueds.blob.core.windows.net/datasets/solar.csv) | 1.1 | N/A|
# Description of data sets
* Data Set 1 *Solar panel readings from 2013-12-01 to 2016-12-01,
sampled every half an hour in a day.*

Двоичные данные
SolarPanelForecasting/Docs/Figs/lstm.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 9.4 KiB

Двоичные данные
SolarPanelForecasting/Docs/Figs/result.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Просмотреть файл

@ -0,0 +1,4 @@
# Documents
*This folder contains documents such as blogs, installation instructions, etc. is also the default diretory where the generated reports from R markdown are placed.*

Просмотреть файл

@ -0,0 +1,62 @@
# Data Science Accelerator - *Solar power forecasting with Cognitive Toolkit in R*
## Overview
This repo reproduces [CNTK tutorial 106
B](https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb)
- Deep Learning time series forecasting with Long Short-Term Memory
(LSTM) in R, by using the Keras R interface with Microsoft Cognitive
Toolkit in an Azure Data Science Virtual Machine (DSVM).
An Azure account can be created for free by visiting [Microsoft
Azure](https://azure.microsoft.com/free). This will then allow you to
deploy a [Ubuntu Data Science Virtual
Machine](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-virtual-machine-overview)
through the [Azure Portal](https://ms.portal.azure.com). You can then
connect to the server's [RStudio
Server](https://www.rstudio.com/products/rstudio/#Server) instance
through a local web browser via ```http://<ip address>:8787```.
The repository contains three parts
- **Data** Solar panel readings collected from Internet-of-Things (IoTs)
devices are used.
- **Code** Two R markdown files are available - the first one titled
[SolarPanelForecastingTutorial](https://github.com/Microsoft/acceleratoRs/blob/master/SolarPanelForecasting/Code/SolarPanelForecastingTutorial.Rmd) provides a general introduction of
the accelerator and codes for setting up an experimental environment
on Azure DSVM; the second one titled [SolarPanelForecastingCode](https://github.com/Microsoft/acceleratoRs/blob/master/SolarPanelForecasting/Code/SolarPanelForecastingCode.Rmd)
wraps codes and step-by-step tutorials on build a LSTM model for
forecasting from end to end.
- **Docs** Blogs and decks will be added soon.
## Business domain
The accelerator presents a tutorial on forecasting solar panel power
readings by using a LSTM based neural network model trained on the
historical data. Solar power forecasting is a critical problem, and a
model with desirable estimation accuracy potentially benefits many
domain-specific business such as energy trading, management, etc.
## Data science problem
The problem is to predict the maximum value of total power generation in
a day from the solar panel, by taking the sequential readings of solar
power generation at the current and past sampling moments.
## Data understanding
The data set used in the accelerator was collected from IoT devices
incorporated in solar panels. The data is available at the
[URL](https://guschmueds.blob.core.windows.net/datasets/solar.csv).
## Modeling
Model used in this accelerator is based on LSTM, which is capable of
modeling long-term depenencies in time series data. By properly
processing the original data into sequences of power readings, a deep
neural network formed by LSTM cells and dropout layers can capture the
patterns in the time series so as to predict the output.
## Solution architecture
The experiment is conducted on a Ubuntu DSVM.