This commit is contained in:
yueguoguo 2017-09-11 11:58:13 +08:00
Родитель ae697a97ce
Коммит 067c0f0356
23 изменённых файлов: 2249 добавлений и 182 удалений

Просмотреть файл

@ -260,13 +260,13 @@ After balancing the training set, a model can be created for prediction. For com
Three algorithms, support vector machine with radial basis function kernel, random forest, and extreme gradient boosting (xgboost), are used for model building.
```{r, echo=TRUE, message=FALSE, warning=FALSE}
# initialize training control.
tc <- trainControl(method="boot",
number=3,
repeats=3,
search="grid",
classProbs=TRUE,
savePredictions="final",
summaryFunction=twoClassSummary)
tc <- trainControl(method="repeatedcv",
number=3,
repeats=1,
search="random",
summaryFunction=twoClassSummary,
classProbs=TRUE,
savePredictions=TRUE)
# SVM model.
@ -274,16 +274,16 @@ time_svm <- system.time(
model_svm <- train(Attrition ~ .,
df_train,
method="svmRadial",
trainControl=tc)
trControl=tc)
)
# random forest model
time_rf <- system.time(
model_rf <- train(Attrition ~ .,
df_train,
method="rf",
trainControl=tc)
data=df_train,
method="rf",
trControl=tc)
)
# xgboost model.
@ -292,7 +292,7 @@ time_xgb <- system.time(
model_xgb <- train(Attrition ~ .,
df_train,
method="xgbLinear",
trainControl=tc)
trControl=tc)
)
```
2. Ensemble of models.
@ -642,7 +642,7 @@ SVM with RBF kernel is used as an illustration.
model_svm <- train(Attrition ~ .,
df_txt_train,
method="svmRadial",
trainControl=tc)
trControl=tc)
```
```{r}
# model evaluation

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,591 @@
---
title: "Operationalization of Employee Attrition Prediction on Azure Cloud"
author: "Le Zhang, Data Scientist, Microsoft"
date: "August 19, 2017"
output: html_document
---
## Introduction
It is preferrable to create AI application hosted on cloud for obvious benefits
of elasticity, agility, and flexibility of training model and deploying services.
The tutorial in this markdown will demonstrate how to operationalize the
[Employee Attrition Prediction](https://github.com/Microsoft/acceleratoRs/tree/master/EmployeeAttritionPrediction)
on Azure cloud and then deploy the model as well as analytical functions onto
web-based services.
## Data exploration and model training - Azure Data Science Virtual Machine
### Introduction
[Azure Data Science Virtual Machine (DSVM)](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
is a curated virtual machine image that is configured with a comprehensive set of
commonly used data analytical tools and software. DSVM is a desirable workplace
for data scientists to quickly experiment and prototype a data analytical idea.
R packages [AzureSMR](https://github.com/Microsoft/AzureSMR) and [AzureDSVM](https://github.com/Azure/AzureDSVM)
are to simplify the use and operation of DSVM. One can use functions of the
packages to easily create, stop, and destroy DSVMs in Azure resource group. To
get started, simply do initial set ups with an Azure subscription, as instructed
[here](http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/Authentication.html).
### Set up a DSVM for employee attrition prediction
#### Pre-requisites
For this tutorial, a Ubuntu Linux DSVM is spinned up for the experiment. Since
the analysis is performed on a relatively small data set, a medium-size VM is
sufficient. In this case, a Standard D2 v2 VM is used. It roughly costs 0.158 USD
per hour (more details about pricing can be found [here](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/).
The DSVM can be deployed by using either Azure portal, Azure Command-Line
Interface, or AzureDSVM R package from within an R session.
The following are the codes for deploying a Linux DSVM with Standard D2 v2 size.
```{r, eval=FALSE}
# Load the R packages for resource management.
library(AzureSMR)
library(AzureDSVM)
```
To start with `AzureSMR` and `AzureDSVM` packages for operating Azure resources,
it is required to create and set up an Azure Azure Directory App which is
authorized for consuming Azure REST APIs. Details can be found in the AzureSMR package [vignette](https://github.com/Microsoft/AzureSMR/blob/master/vignettes/Authentication.Rmd)
After the proper set up, credentials such as client ID, tenant ID, and secret key
can be obtained.
Credentials for authentication are suggested to be put in a config.json file which is
located at "~/.azuresmr" directory. `read.AzureSMR.config` function then reads
the config json file into an R object. The credentials are used to set an
Azure Active Context which is then used for authentication.
```{r, eval=FALSE}
settingsfile <- getOption("AzureSMR.config")
config <- read.AzureSMR.config()
asc <- createAzureContext()
setAzureContext(asc,
tenantID=config$tenantID,
clientID=config$clientID,
authKey=config$authKey)
```
Authentication.
```{r, eval=FALSE}
azureAuthenticate(asc)
```
#### Deployment of DSVM
Specifications for deploying the DSVM are given as inputs of the deployment
function from `AzureDSVM`.
In this case, a resource group in Southeast Asia is created, and a Ubuntu DSVM
with Standard D2 v2 size is created.
```{r, eval=FALSE}
dsvm_location <- "southeastasia"
dsvm_rg <- paste0("rg", paste(sample(letters, 3), collapse=""))
dsvm_size <- "Standard_D2_v2"
dsvm_os <- "Ubuntu"
dsvm_name <- paste0("dsvm",
paste(sample(letters, 3), collapse=""))
dsvm_authen <- "Password"
dsvm_password <- "Not$ecure123"
dsvm_username <- "dsvmuser"
```
After that, the resourece group can be created.
```{r, eval=FALSE}
# create resource group.
azureCreateResourceGroup(asc,
location=dsvm_location,
resourceGroup=dsvm_rg)
```
In the resource group, the DSVM with above specifications is created.
```{r, eval=FALSE}
# deploy a DSVM.
deployDSVM(asc,
resource.group=dsvm_rg,
location=dsvm_location,
hostname=dsvm_name,
username=dsvm_username,
size=dsvm_size,
os=dsvm_os,
authen=dsvm_authen,
password=dsvm_password,
mode="Sync")
```
#### Adding extension to DSVM
Some R packages (e.g., `caretEnsemble`) used in the accelerator are not
pre-installed in a freshly deployed Linux DSVM. These packages can be installed
post deployment with [Azure VM Extensions](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/extensions-features) which is also available in `AzureDSVM`.
Basically the Azure Extensions function runs a script located remote on the
target VM. In this case, the script, named `script.sh`, is a Linux shell script
in which R packages that are needed but missing in the DSVM are installed.
The following R codes add extension to the deployed DSVM.
```{r, eval=FALSE}
# add extension to the deployed DSVM.
# NOTE extension is installed as root.
dsvm_command <- "sudo sh script.sh"
dsvm_fileurl <- "https://raw.githubusercontent.com/Microsoft/acceleratoRs/master/EmployeeAttritionPrediction/Code/script.sh"
addExtensionDSVM(asc,
location=dsvm_location,
resource.group=dsvm_rg,
hostname=dsvm_name,
os=dsvm_os,
fileurl=dsvm_fileurl,
command=dsvm_command)
```
Once experiment with the accelerator is finished, deallocate the DSVM by
stopping it so that there will be no charge on the machine,
```{r, eval=FALSE}
# Stop the DSVM if it is not needed.
operateDSVM(asc,
resource.group=dsvm_rg,
hostname=dsvm_name,
operation="Stop")
```
or destroy the whole resource group if the instances are not needed.
```{r, eval=FALSE}
# Resource group can be removed if the resources are no longer needed.
azureDeleteResourceGroup(asc, resourceGroup=dsvm_rg)
```
#### Remote access to DSVM
The DSVM can be accessed via several approaches:
* Remote desktop. [X2Go](https://wiki.x2go.org/doku.php) server is
pre-configured on a DSVM so one can used X2Go client to log onto that machine
and use it as a remote desktop.
* RStudio Server. RStudio Server is installed, configured, but not started
on a Linux DSVM. Starting RStudio Server is embedded in the DSVM extension, so
after running the extension code above, one can access the VM via RStudio Server
("http://<dsvm_name>.<dsvm_location>.cloudapp.azure.com:8787"). The user name
and password used in creating the DSVM can be used for log-in.
* Jupyter notebook. Similar to RStudio Server, R user can also work on the DSVM
within a Jupyter notebook environment. The remote JupyterHub can be accessed via
the address "https://<dsvm_name>.<dsvm_location>.cloudapp.azure.com:8000". To
enable an R environment, select R kernel when creating a new notebook.
The accelerator in both `.md` and `.ipynb` formats are provided for convenient
run in RStudio and Jupyter notebook environments, respectively.
## Service deployment
The section shows how to consume data analytics in the accelerator on web-based
shiny applications.
### Deployment of R application
It is usually desirable to deploy R analytics as applications. This allows non-R
-user data scientist to consume the pre-trained model or analytical results. For
instance, the model created in the employee attrition accelerator can be
consumed by end users for either statistical analysis on raw data or real-time
attrition prediction.
#### Ways of deployment
There are various ways of deploying R analyics.
* Deployment as API. Deployment of API will benefit downstream developers to
consume the data analytics in other applications. It is flexible and efficient.
R packages such as `AzureML` and `mrsdeploy` allow deployment of R codes onto
Azure Machine Learning Studio web service and web service hosted on a machine
where Microsoft R Server is installed and configured, respectively. Other
packages such as `plumbr` also allows publishing R codes on a local host as
a web service.
* Deployment as GUI application. [R Shiny](https://shiny.rstudio.com/) is the most popular framework
for publishing R codes as GUI based application. The application can also be
publically accessible if it is hosted on Shiny server (not free). Shiny
framework provides rich set of functions to define UI and server logic for
static, responsive, and graphical interactions with application.
* Deployment as Container. [Docker](https://www.docker.com/) container becomes increasingly popular along
with the proliferation of microservice architecture. The benefits of running
container as a service is that different services can be easily modularized and
maintainence. For a data analytical or artificial intelligence solution,
models of different purposes can be trained and deployed into different
containers whereever needed.
The following sub-section will talk about how to create shiny applications
for the accelerlator and then containerize them.
#### Shiny + Docker container
R Shiny can be run on either a local host or a server where Shiny Server is
installed.
There is also a [Shiny Server Docker image](https://hub.docker.com/r/rocker/shiny/) available, which makes it easy
for containerizing Shiny applications. The Dockerfile for the Shiny Server is
built based on the `r-base` image and is shown as follows.
```
FROM r-base:latest
MAINTAINER Winston Chang "winston@rstudio.com"
# Install dependencies and Download and install shiny server
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev && \
wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb && \
R -e "install.packages(c('shiny', 'rmarkdown'), repos='https://cran.rstudio.com/')" && \
cp -R /usr/local/lib/R/site-library/shiny/examples/* /srv/shiny-server/ && \
rm -rf /var/lib/apt/lists/*
EXPOSE 3838
COPY shiny-server.sh /usr/bin/shiny-server.sh
CMD ["/usr/bin/shiny-server.sh"]
```
A Docker image can be built by using the Dockerfile with
```
docker build -t <image_name> <path_to_the_dockerfile>
```
and run with
```
docker run --rm -p 3838:3838 <image_name>
```
The Shiny application can be then accessed in a web browser via address "http://localhost:3838" (if it is run on a local host machine) or "http://<ip_address_of_shiny_server:3838".
### Container orchestration
When there are more than one application or service needed in the whole
pipeline, orchestration of multiple containers becomes useful.
There are multiple ways of orchestrating containers, and the three most
representative approaches are [Kubernetes](https://kubernetes.io/), [Docker Swarm](https://docs.docker.com/engine/swarm/), and [DC/OS](https://dcos.io/).
Comparison between these orchestration methods is beyond the scope of this
tutorial. In the following sections, it will be shown how to deploy multiple
Shiny applications on a Kubernetes cluster.
#### Azure Container Service
[Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/) is a cloud-based service on Azure, which simplifies the configuration
for orchestrating containers with various orchestration methods such as
Kubernetes, Docker Swarm, and DC/OS. Azure Container Service offers optimized
configuration of these orchestration tools and technologies for Azure. In
deployment of the orchestration cluster, it is allowed to set VM size, number
of hosts, etc., for scalability, load capacity, cost efficiency, etc.
#### Deployment of multiple Shiny applications with Azure Container Service
The following illustrates how to deploy two Shiny applications derived from
the employee attrition prediction accelerator with Azure Container Service.
While there may be more sophisticated architecture in real-world application,
the demonstration here merely exhibits a how-to on setting up the environment.
The two Shiny applications are for (simple) data exploration and model creation,
respectively. The two applications are built on top of two individual images.
Both obtain data from a Azure Storage blob, where data is persistently
preserved. This enables the real-world scenario where R-user data scientists and
data analysts are working within the same infrascture but tasks for each can be
de-coupled loosely.
The whole architecture is depicted as follows.
##### Step 1 - Create Docker images
Both of the images are created based on the rocker/shiny image.
* Data exploration image
```
FROM r-base:latest
MAINTAINER Le Zhang "zhle@microsoft.com"
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev \
libssl-dev
# Download and install shiny server
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'markdown'), repos='http://cran.rstudio.com/')"
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY /myapp /srv/shiny-server/
EXPOSE 3838
COPY shiny-server.sh /usr/bin/shiny-server.sh
RUN chmod +x /usr/bin/shiny-server.sh
CMD ["/usr/bin/shiny-server.sh"
```
* Model creation image
```
FROM r-base:latest
MAINTAINER Le Zhang "zhle@microsoft.com"
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev \
libssl-dev
# Download and install shiny server
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'caret', 'caretEnsemble', 'kernlab', 'randomForest', 'xgboost', 'DT'), repos='http://cran.rstudio.com/')"
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY /myapp /srv/shiny-server/
EXPOSE 3838
COPY shiny-server.sh /usr/bin/shiny-server.sh
RUN chmod +x /usr/bin/shiny-server.sh
# Download pre-trained model
RUN wget --no-verbose https://zhledata.blob.core.windows.net/employee/model.RData -O "/srv/shiny-server/model.RData"
CMD ["/usr/bin/shiny-server.sh"
```
All of the layers are the same as those in the original rocker/shiny image, except for installation of additional R packages and their required
run time libraries (e.g., caretEnsemble, xgboost, etc.).
Docker images can be built similar to the rocker/shiny image. After the images
are built, they can be pushed onto a public repository such as on [Dockerhub](https://hub.docker.com/) or a private repository on [Azure Container Registry](https://azure.microsoft.com/en-us/services/container-registry/).
The following shows how to do that with Dockerhub.
1. Build the image.
```
docker build -t <name_of_image> <path_to_dockerfile>
```
2. Tag the image.
```
docker tag <name_of_image> <dockerhub_account_name>/<name_of_repo>
```
3. Login with Dockerhub.
```
docker login
```
4. Push image onto Dockerhub repository.
```
docker push <dockerhub_account_name>/<name_of_repo>
```
In this case, both of the two images are pushed on Dockerhub.
##### Step 2 - Create Azure Container Service
Creation of Azure Container Service can be achieved with either Azure portal or
Azure Command-Line Interface (CLI).
The following shows how to create a Kubernetes type orchestrator in a specified
resource group with Azure CLI (installation of Azure CLI can be found [here](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).
1. Login with Azure subscription.
```
az login
```
2. Create a resource group where the Azure Container Service cluster resides.
```
az group create --name=<resource_group> --location=<location>
```
3. Create an Azure Container Service with Kubernetes orchestrator. The
cluster is made of one master node and three agent nodes. Name of
cluster, DNS prefix, and authentication private key can also be specified as
requested.
```
az acs create --orchestrator-type=kubernetes --resource-group <resource_group> --name=<cluster_name> --dns-prefix=<dns_prefix> --ssh-key-value ~/.ssh/id_rsa.pub --admin-username=<user_name> --master-count=1 --agent-count=2 --agent-vm-size=<vm_size>
```
##### Step 3 - Deploy Shiny applications on the Azure Container Service
The status of Azure Container Service deployment can be checked in Azure portal.
Once it is successfully done, there will be the resources listed in the resource
group.
In this tutorial, there are two Shiny applications hosted on the cluster. For
simplicity reason, these two applications do not have dependency on each other,
so they are deployed independently and exposed as invidual service.
The deployment is done with [Kubernetes command line tool](https://kubernetes.io/docs/tasks/tools/install-kubectl/), which can be installed on the local machine.
kubectl should be configured properly in order to communicate with the remote
Kubernetes cluster. This can be done via copy the `config` file located at
`~/.kube` on master node of the Kubernetes cluster to `~/.kube/` of the local
machine.
Each of the two applications can be deployed individually as follows.
```
kubectl run <name_of_deployment> --image <dockerhub_account_name>/<name_of_repo>
--port=3838 --replicas=3
```
The deployment can be exposed as web-based service by the following command:
```
kubectl expose deployments <name_of_deployment> --port=3838 --type=LoadBalancer
```
Status of the deployment and service exposure can be monitored by
```
kubectl get deployments
```
and
```
kubectl get services
```
respectively.
The deployment and exposure of service can be put together into a yaml file for
convenience of operation.
```
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: <name_of_model_app>
spec:
replicas: 1
template:
metadata:
labels:
app: <name_of_model_app>
spec:
containers:
- name: <name_of_model_app>
image: <dockerhub_account_name>/<name_of_model_app>
ports:
- containerPort: 3838
resources:
requests:
cpu: 250m
limits:
cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
name: <name_of_model_app>
spec:
type: LoadBalancer
ports:
- port: 3838
selector:
app: <name_of_model_app>
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: <name_of_data_app>
spec:
replicas: 1
template:
metadata:
labels:
app: <name_of_data_app>
spec:
containers:
- name: <name_of_data_app>
image: <dockerhub_account_name>/<name_of_data_app>
ports:
- containerPort: 3030
resources:
requests:
cpu: 250m
limits:
cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
name: <name_of_data_app>
spec:
type: LoadBalancer
ports:
- port: 3030
selector:
app: <name_of_data_app>
```
The deployment and service can then be created simply by
```
kubectl create -f <path_to_the_yaml_file>
```
##### Step 4 - Test the deployed Shiny applications
Once the deployment is finished, public IP address and port number of the
exposed service can be checked with `kubectl get service --watch`. In the
deployment process, external IP addresses of the exposed services will show
"<pending>". It usually takes a while to finish depending on the size of the
image and capability of cluster.
The deployed Shiny application service can be accessed from web browser via the
public IP address with corresponding port number.
The following snapshots show the deployed Shiny apps.
The readers can find Dockerfile as well as Shiny R codes in the directories.
Images built based on them are pre-published on Dockerhub - `yueguoguo/hrdata`
and `yueguoguo/hrmodel`, corresponding to the data exploration application and
model creation application, respectively. These images are ready for testing
on a deployed Kubernetes typed Azure Container Service cluster.

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,65 @@
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: hrmodel
spec:
replicas: 1
template:
metadata:
labels:
app: hrmodel
spec:
containers:
- name: hrmodel
image: yueguoguo/hrmodel
ports:
- containerPort: 3838
resources:
requests:
cpu: 250m
limits:
cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
name: hrmodel
spec:
type: LoadBalancer
ports:
- port: 3838
selector:
app: hrmodel
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: hrdata
spec:
replicas: 1
template:
metadata:
labels:
app: hrdata
spec:
containers:
- name: hrdata
image: yueguoguo/hrdata
ports:
- containerPort: 3030
resources:
requests:
cpu: 250m
limits:
cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
name: hrdata
spec:
type: LoadBalancer
ports:
- port: 3030
selector:
app: hrdata

Просмотреть файл

@ -0,0 +1,34 @@
FROM r-base:latest
MAINTAINER Le Zhang "zhle@microsoft.com"
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev \
libssl-dev
# Download and install shiny server
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'markdown', 'DT', 'scales'), repos='http://cran.rstudio.com/')"
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY /myapp /srv/shiny-server/
EXPOSE 3030
COPY shiny-server.sh /usr/bin/shiny-server.sh
RUN chmod +x /usr/bin/shiny-server.sh
CMD ["/usr/bin/shiny-server.sh"]

Просмотреть файл

@ -0,0 +1,30 @@
---
title: "about"
author: "Le Zhang"
date: "August 24, 2017"
output: html_document
---
### Employee Attrition Prediction
This is a demonstration on a case study of employee attrition prediction.
Data science and machine learning development process often consists of multiple
steps. Containerizing each of the steps help modularize the whole process and
thus making it easier for DevOps.
For simplicity reason, the demo process is merely composed of two steps, which are
data exploration and model creation.
This web-based app is to show how to do simple data exploration graphically on
the HR data set.
#### R accelerator
The end-to-end tutorial of the R based template for data processing, model
training, etc. (we call it "acceleratoR") can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
#### Operationalization
Operationalization of the case on Azure cloud (i.e., data exploration, model creation,
model management, model deployment, etc.) with [Azure Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
[Azure Storage](https://azure.microsoft.com/en-us/services/storage/), [Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/), etc., can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).

Просмотреть файл

@ -0,0 +1,32 @@
# ------------------------------------------------------------------------------
# R packages needed for the analytics.
# ------------------------------------------------------------------------------
library(shiny)
library(dplyr)
library(magrittr)
library(ggplot2)
library(markdown)
library(scales)
# ------------------------------------------------------------------------------
# Global variables.
# ------------------------------------------------------------------------------
data_url <- "https://zhledata.blob.core.windows.net/employee/DataSet1.csv"
# ------------------------------------------------------------------------------
# Functions.
# ------------------------------------------------------------------------------
# Load HR demographic data.
loadData <- function() {
df <- read.csv(data_url)
return(df)
}
# Load HR data and pre-trained model.
df_hr <- loadData()

Просмотреть файл

@ -0,0 +1,103 @@
source("global.R")
# The actual shiny server function.
shinyServer(function(input, output) {
# Plot a table of the HR data.
output$hrtable <- DT::renderDataTable({
DT::datatable(df_hr[, input$show_vars, drop=FALSE])
})
# Downloadable csv of selected dataset.
output$downloadData <- downloadHandler(
filename = function() {
paste(input$dataset, ".csv", sep = "")
},
content = function(file) {
write.csv(datasetInput(), file, row.names = FALSE)
}
)
# Plot some general summary statistics for those who are predicted attrition.
output$plot3 <- renderPlot({
if (identical(input$att_vars, "Yes")) {
df_hr %<>% filter(as.character(Attrition) == "Yes")
} else if (identical(input$att_vars, "No")) {
df_hr %<>% filter(as.character(Attrition) == "No")
} else if (identical(input$att_vars, c("Yes", "No"))) {
df_hr
} else {
df_hr <- df_hr[0, ]
}
df_hr <- filter(df_hr, JobRole %in% input$disc_vars)
ggplot(df_hr, aes(JobRole, fill=Attrition)) +
geom_bar(aes(y=(..count..)/sum(..count..)),
position="dodge",
alpha=0.6) +
scale_y_continuous(labels=percent) +
xlab(input$disc_vars) +
ylab("Percentage") +
theme_bw() +
ggtitle(paste("Count for", input$disc_vars))
})
output$plot <- renderPlot({
if (identical(input$att_vars, "Yes")) {
df_hr %<>% filter(as.character(Attrition) == "Yes")
} else if (identical(input$att_vars, "No")) {
df_hr %<>% filter(as.character(Attrition) == "No")
} else if (identical(input$att_vars, c("Yes", "No"))) {
df_hr
} else {
df_hr <- df_hr[0, ]
}
df_hr_final <- select(df_hr, one_of("Attrition", input$plot_vars))
ggplot(df_hr_final,
aes_string(input$plot_vars,
color="Attrition",
fill="Attrition")) +
geom_density(alpha=0.2) +
theme_bw() +
xlab(input$plot_vars) +
ylab("Density") +
ggtitle(paste("Estimated density for", input$plot_vars))
})
# Monthly income, service year, etc.
output$plot2 <- renderPlot({
if (identical(input$att_vars, "Yes")) {
df_hr %<>% filter(as.character(Attrition) == "Yes")
} else if (identical(input$att_vars, "No")) {
df_hr %<>% filter(as.character(Attrition) == "No")
} else if (identical(input$att_vars, c("Yes", "No"))) {
df_hr
} else {
df_hr <- df_hr[0, ]
}
df_hr <- filter(df_hr,
YearsAtCompany >= input$years_service[1] &
YearsAtCompany <= input$years_service[2] &
JobLevel < input$job_level &
JobRole %in% input$job_roles)
ggplot(df_hr,
aes(x=factor(JobRole), y=MonthlyIncome, color=factor(Attrition))) +
geom_boxplot() +
xlab("Job Role") +
ylab("Monthly income") +
scale_fill_discrete(guide=guide_legend(title="Attrition")) +
theme_bw() +
theme(text=element_text(size=13), legend.position="top")
})
})

Просмотреть файл

@ -0,0 +1,109 @@
source("global.R")
navbarPage(
"HR Analytics - data exploration",
tabPanel(
"About",
fluidRow(
column(3, includeMarkdown("about.md")),
column(
6,
img(class="img-polaroid",
src=paste0("https://careers.microsoft.com/content/images/services/HomePage_Hero1_Tim.jpg"))
)
)
),
tabPanel(
"Data",
sidebarLayout(
sidebarPanel(
# Variables to select for displayed demographic data.
checkboxGroupInput(
"show_vars",
"Columns in HR data set to show:",
names(df_hr),
selected=names(df_hr)
),
# Button
downloadButton("hrData", "Download")
),
mainPanel(
tabsetPanel(
id="dataset",
tabPanel("HR Demographic data", DT::dataTableOutput("hrtable"))
)
)
)
),
tabPanel(
"Plot",
h4("Select employees of attrition or non-attrition to visualize."),
checkboxGroupInput(
"att_vars",
"Attrition or not:",
c("Yes", "No"),
selected=c("Yes", "No")),
fluidRow(
column(
4,
h4("Count of discrete variable."),
plotOutput("plot3"),
checkboxGroupInput(
"disc_vars",
"Job roles:",
unique(df_hr$JobRole),
selected=unique(df_hr$JobRole)[1:5])
),
column(
4,
h4("Distribution of continuous variable."),
plotOutput("plot"),
selectInput(
"plot_vars",
"Variable to visualize:",
names(select_if(df_hr, is.integer)),
selected=names(select_if(df_hr, is.integer)))
),
column(
4,
h4("Comparison on certain factors."),
plotOutput("plot2"),
# Years of service.
sliderInput(
"years_service",
"Years of service:",
min=1,
max=40,
value=c(2, 5)),
# Job level.
sliderInput(
"job_level",
"Job level:",
min=1,
max=5,
value=3
),
checkboxGroupInput(
"job_roles",
"Job roles:",
unique(df_hr$JobRole),
selected=unique(df_hr$JobRole)[1:5])
)
)
)
)

Просмотреть файл

@ -0,0 +1,26 @@
# Define the user we should use when spawning R Shiny processes
run_as shiny;
# This will show screen shot of errors in docker bash.
sanitize_errors off;
# Define a top-level server which will listen on a port
server {
# Instruct this server to listen on port 80. The app at dokku-alt need expose PORT 80, or 500 e etc. See the docs
listen 3030;
# Define the location available at the base URL
location / {
# Run this location in 'site_dir' mode, which hosts the entire directory
# tree at '/srv/shiny-server'
site_dir /srv/shiny-server;
# Define where we should put the log files for this location
log_dir /var/log/shiny-server;
# Should we list the contents of a (non-Shiny-App) directory when the user
# visits the corresponding URL?
directory_index on;
}
}

Просмотреть файл

@ -0,0 +1,7 @@
#!/bin/sh
# Make sure the directory for individual app logs exists
mkdir -p /var/log/shiny-server
chown shiny.shiny /var/log/shiny-server
exec shiny-server >> /var/log/shiny-server.log 2>&1

Просмотреть файл

@ -0,0 +1,37 @@
FROM r-base:latest
MAINTAINER Le Zhang "zhle@microsoft.com"
RUN apt-get update && apt-get install -y -t unstable \
sudo \
gdebi-core \
pandoc \
pandoc-citeproc \
libcurl4-gnutls-dev \
libcairo2-dev/unstable \
libxt-dev \
libssl-dev \
libxml2-dev
# Download and install shiny server
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
VERSION=$(cat version.txt) && \
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
gdebi -n ss-latest.deb && \
rm -f version.txt ss-latest.deb
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'caret', 'caretEnsemble', 'kernlab', 'randomForest', 'xgboost', 'DT', 'DMwR', 'markdown', 'mlbench', 'devtools', 'XML', 'gridSVG', 'pROC', 'plotROC', 'scales'), repos='http://cran.rstudio.com/')"
RUN R -e "library(devtools);devtools::install_github('sachsmc/plotROC')"
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
COPY /myapp /srv/shiny-server/
EXPOSE 3838
COPY shiny-server.sh /usr/bin/shiny-server.sh
RUN chmod +x /usr/bin/shiny-server.sh
CMD ["/usr/bin/shiny-server.sh"]

Просмотреть файл

@ -0,0 +1,32 @@
---
title: "about"
author: "Le Zhang"
date: "August 24, 2017"
output: html_document
---
### Employee Attrition Prediction
This is a demonstration on a case study of employee attrition prediction.
Data science and machine learning development process often consists of multiple
steps. Containerizing each of the steps help modularize the whole process and
thus making it easier for DevOps.
For simplicity reason, the demo process is merely composed of two steps, which are
data exploration and model creation.
This web-based app is to show how to create a model on the data. The training
candidature algorithms include Support Vector Machine (SVM), Random Forest, and
Extreme Gradient Boosting (XGBoost). For illustration purpose, only a few high-
level parameters are allowed to set.
#### R accelerator
The end-to-end tutorial of the R based template for data processing, model
training, etc. (we call it "acceleratoR") can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
#### Operationalization
Operationalization of the case on Azure cloud (i.e., data exploration, model creation,
model management, model deployment, etc.) with [Azure Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
[Azure Storage](https://azure.microsoft.com/en-us/services/storage/), [Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/), etc., can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).

Просмотреть файл

@ -0,0 +1,142 @@
# ------------------------------------------------------------------------------
# R packages needed for the analytics.
# ------------------------------------------------------------------------------
library(caret)
library(caretEnsemble)
library(DMwR)
library(dplyr)
library(ggplot2)
library(markdown)
library(magrittr)
library(mlbench)
library(pROC)
library(plotROC)
library(shiny)
# ------------------------------------------------------------------------------
# Global variables.
# ------------------------------------------------------------------------------
data_url <- "https://zhledata.blob.core.windows.net/employee/DataSet1.csv"
# ------------------------------------------------------------------------------
# Functions.
# ------------------------------------------------------------------------------
# Load HR demographic data.
loadData <- function() {
df <- read.csv(data_url)
return(df)
}
# Process data - the same data processing steps apply on the data.
processData <- function(data) {
# 1. Remove zero-variance variables.
pred_no_var <- c("EmployeeCount", "StandardHours")
data %<>% select(-one_of(pred_no_var))
# 2. Convert Integer to Factor type of data.
int_2_ftr_vars <- c("Education",
"EnvironmentSatisfaction",
"JobInvolvement",
"JobLevel",
"JobSatisfaction",
"NumCompaniesWorked",
"PerformanceRating",
"RelationshipSatisfaction",
"StockOptionLevel")
data[, int_2_ftr_vars] <- lapply((data[, int_2_ftr_vars]), as.factor)
# 3. Keep the most salient variables.
least_important_vars <- c("Department", "Gender", "PerformanceRating")
data %<>% select(-one_of(least_important_vars))
return(data)
}
# Data split.
splitData <- function(data, ratio) {
if (!("Attrition" %in% names(data)))
stop("No label found in data set.")
train_index <-
createDataPartition(data$Attrition,
times=1,
p=ratio / 100) %>%
unlist()
data_train <- data[train_index, ]
data_test <- data[-train_index, ]
data_split <- list(train=data_train, test=data_test)
return(data_split)
}
# Model training.
trainModel <- function(data,
smote_over,
smote_under,
method="boot",
number=3,
repeats=3,
search="grid",
algorithm="rf") {
# If the training set is imbalanced, SMOTE will be applied.
data %<>% as.data.frame()
data <- SMOTE(Attrition ~ .,
data,
perc.over=smote_over,
perc.under=smote_under)
# Train control.
tc <- trainControl(method=method,
number=number,
repeats=repeats,
search="grid",
classProbs=TRUE,
savePredictions="final",
summaryFunction=twoClassSummary)
# Model training.
model <- train(Attrition ~ .,
data,
method=algorithm,
trControl=tc)
return(model)
}
# Function for predicting attrition based on demographic data.
inference <- function(model, data) {
if ("Attrition" %in% names(data)) {
data %<>% select(-Attrition)
}
labels <- predict(model, newdata=data, type="prob")
return(labels)
}
# Load and pre-process HR data.
df_hr <-
loadData() %>%
processData()

Просмотреть файл

@ -0,0 +1,127 @@
source("global.R")
# The actual shiny server function.
shinyServer(function(input, output) {
# Training and testing data.
dataSplit <- reactive({
df <- splitData(df_hr, input$ratio)
df
})
# Train a reactive model.
modelTrained <- eventReactive(input$goButton, {
df <- dataSplit()
df_train <- df$train
df_test <- df$test
if (input$algorithm == "SVM") {
method <- "svmRadial"
} else if (input$algorithm == "Random Forest") {
method <- "rf"
} else {
method <- "xgbLinear"
}
model <- trainModel(data=df_train,
smote_over=input$smoteOver,
smote_under=input$smoteDown,
method="boot",
number=input$number,
repeats=input$repeats,
search="grid",
algorithm=method)
model
})
# Print summary of data set.
output$summary <- renderPrint({
df <- dataSplit()
# str(df$train)
table(df$train$Attrition)
})
# Print table of training data set.
output$dataTrain <- DT::renderDataTable({
df <- dataSplit()
DT::datatable(df$train)
})
# Plot some general summary statistics for those who are predicted attrition.
output$plot <- renderPlot({
df <- dataSplit()
df_test <- df$test
# Train a model
model <- modelTrained()
# Use the model for inference on testing data.
results <- inference(model, data=df_test)
results <- mutate(results, label=df_test$Attrition)
# Plot the ROC curve.
basic_plot <-
ggplot(results,
aes(m=Yes, d=factor(label, levels=c("No", "Yes")))) +
geom_roc(n.cuts=0)
basic_plot +
style_roc(theme=theme_grey) +
theme(axis.text=element_text(colour="blue")) +
# annotate("text",
# x=.75,
# y=.25,
# label=paste("AUC =", round(calc_auc(basic_plot)$AUC, 2))) +
ggtitle("Plot of ROC curve") +
scale_x_continuous("1 - Specificity", breaks = seq(0, 1, by = .1))
})
output$auc <- renderPrint({
df <- dataSplit()
df_test <- df$test
# Train a model
model <- modelTrained()
# Use the model for inference on testing data.
results <- inference(model, data=df_test)
results <- mutate(results, label=df_test$Attrition)
basic_plot <-
ggplot(results,
aes(m=Yes, d=factor(label, levels=c("No", "Yes")))) +
geom_roc(n.cuts=0)
sprintf("AUC of the ROC curve is %f", round(calc_auc(basic_plot)$AUC, 2))
})
# # Export the trained model.
#
# output$downloadModel <- downloadHandler(
# filename = function() {
# paste(input$algorithm, "_model", ".rds", sep="")
# },
#
# content = function(file) {
# saveRDS(model, file)
# }
# )
})

Просмотреть файл

@ -0,0 +1,109 @@
source("global.R")
navbarPage(
"HR Analytics - model creation",
tabPanel(
"About",
fluidRow(
column(3, includeMarkdown("about.md")),
column(6, img(
class="img-polaroid",
src=paste0("https://careers.microsoft.com/content/images/services/HomePage_Hero1_Tim.jpg"))
)
)
),
tabPanel(
"Model",
sidebarLayout(
sidebarPanel(
# Split ratio for training/testing data.
sliderInput(inputId="ratio",
label="Split ratio (%) for training data.",
min=0,
max=100,
value=70),
# SMOTE upsampling percentage.
p("SMOTE is used for balancing data set"),
numericInput(inputId="smoteOver",
label="Upsampling percentage in SMOTE for minority class.",
value=300),
# SMOTE downsampling percentage.
numericInput(inputId="smoteDown",
label="Downsampling percentage in SMOTE for majority class.",
value=150),
# Repeats in train control.
p("High-level control for cross-validation in training the model."),
numericInput(inputId="repeats",
label="Number of repeats for a k-fold cross-validation.",
min=1,
max=3,
value=1),
# Number of cross-validations in train control.
numericInput(inputId="number",
label="Number of folds in cross-validation.",
min=2,
max=5,
value=1),
# Algorithm for use.
selectInput(inputId="algorithm",
label="Machine learning algorithm to use for training a model:",
choices=c("SVM", "Random Forest", "XGBoost")),
# Train model
p("Click the button to train a model with the above settings (it may
take some time depending on algorithm used for training). After the
training process, a ROC curve which evaluates model performance on
the testing data set is plotted."),
actionButton("goButton", "Train")
# # Export model
#
# p("Export the trained model"),
#
# downloadButton("downloadModel",
# "Download")
),
mainPanel(
# Summary of the training data set.
p("It should be noted that the data set is not balanced, which may
negatively impact model training if no balancing technique is
applied."),
verbatimTextOutput("summary"),
# Print table of training data.
tabsetPanel(
id="dataset",
tabPanel("HR Demographic data for training",
DT::dataTableOutput("dataTrain"))
),
# Plot the model validation results.
plotOutput("plot"),
verbatimTextOutput("auc")
)
)
)
)

Просмотреть файл

@ -0,0 +1,26 @@
# Define the user we should use when spawning R Shiny processes
run_as shiny;
# This will show screen shot of errors in docker bash.
sanitize_errors off;
# Define a top-level server which will listen on a port
server {
# Instruct this server to listen on port 80. The app at dokku-alt need expose PORT 80, or 500 e etc. See the docs
listen 3838;
# Define the location available at the base URL
location / {
# Run this location in 'site_dir' mode, which hosts the entire directory
# tree at '/srv/shiny-server'
site_dir /srv/shiny-server;
# Define where we should put the log files for this location
log_dir /var/log/shiny-server;
# Should we list the contents of a (non-Shiny-App) directory when the user
# visits the corresponding URL?
directory_index on;
}
}

Просмотреть файл

@ -0,0 +1,7 @@
#!/bin/sh
# Make sure the directory for individual app logs exists
mkdir -p /var/log/shiny-server
chown shiny.shiny /var/log/shiny-server
exec shiny-server >> /var/log/shiny-server.log 2>&1

Просмотреть файл

@ -0,0 +1,21 @@
#!/bin/bash
# install R libraries.
sudo mkdir /etc/skel/R
sudo mkdir /etc/skel/R/lib
sudo Rscript -e 'library(devtools);library(withr);withr::with_libpaths(new="/etc/skel/R/lib/", install(c("DMwR", "caretEnsemble", "pROC", "jiebaR")));withr::with_libpaths(new="/etc/skel/R/lib/", install_url("https://github.com/yueguoguo/Azure-R-Interface/raw/master/utils/msLanguageR_0.1.0.tar.gz"))'
# Copy /etc/skel to home directory of all users.
USR=$(ls /home | grep user)
for u in ${USR}; do
DBASE="/home/$u/"
cp -rf /etc/skel/R ${DBASE}/
done
# Start the Rstudio Server
rstudio-server start

Двоичные данные
EmployeeAttritionPrediction/Docs/Misc/pics/about.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 1.1 MiB

Двоичные данные
EmployeeAttritionPrediction/Docs/Misc/pics/datavisual.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 98 KiB

Двоичные данные
EmployeeAttritionPrediction/Docs/Misc/pics/model.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 62 KiB