Operationalization
This commit is contained in:
Родитель
ae697a97ce
Коммит
067c0f0356
|
@ -260,13 +260,13 @@ After balancing the training set, a model can be created for prediction. For com
|
|||
Three algorithms, support vector machine with radial basis function kernel, random forest, and extreme gradient boosting (xgboost), are used for model building.
|
||||
```{r, echo=TRUE, message=FALSE, warning=FALSE}
|
||||
# initialize training control.
|
||||
tc <- trainControl(method="boot",
|
||||
number=3,
|
||||
repeats=3,
|
||||
search="grid",
|
||||
classProbs=TRUE,
|
||||
savePredictions="final",
|
||||
summaryFunction=twoClassSummary)
|
||||
tc <- trainControl(method="repeatedcv",
|
||||
number=3,
|
||||
repeats=1,
|
||||
search="random",
|
||||
summaryFunction=twoClassSummary,
|
||||
classProbs=TRUE,
|
||||
savePredictions=TRUE)
|
||||
|
||||
# SVM model.
|
||||
|
||||
|
@ -274,16 +274,16 @@ time_svm <- system.time(
|
|||
model_svm <- train(Attrition ~ .,
|
||||
df_train,
|
||||
method="svmRadial",
|
||||
trainControl=tc)
|
||||
trControl=tc)
|
||||
)
|
||||
|
||||
# random forest model
|
||||
|
||||
time_rf <- system.time(
|
||||
model_rf <- train(Attrition ~ .,
|
||||
df_train,
|
||||
method="rf",
|
||||
trainControl=tc)
|
||||
data=df_train,
|
||||
method="rf",
|
||||
trControl=tc)
|
||||
)
|
||||
|
||||
# xgboost model.
|
||||
|
@ -292,7 +292,7 @@ time_xgb <- system.time(
|
|||
model_xgb <- train(Attrition ~ .,
|
||||
df_train,
|
||||
method="xgbLinear",
|
||||
trainControl=tc)
|
||||
trControl=tc)
|
||||
)
|
||||
```
|
||||
2. Ensemble of models.
|
||||
|
@ -642,7 +642,7 @@ SVM with RBF kernel is used as an illustration.
|
|||
model_svm <- train(Attrition ~ .,
|
||||
df_txt_train,
|
||||
method="svmRadial",
|
||||
trainControl=tc)
|
||||
trControl=tc)
|
||||
```
|
||||
```{r}
|
||||
# model evaluation
|
||||
|
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,591 @@
|
|||
---
|
||||
title: "Operationalization of Employee Attrition Prediction on Azure Cloud"
|
||||
author: "Le Zhang, Data Scientist, Microsoft"
|
||||
date: "August 19, 2017"
|
||||
output: html_document
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
It is preferrable to create AI application hosted on cloud for obvious benefits
|
||||
of elasticity, agility, and flexibility of training model and deploying services.
|
||||
|
||||
The tutorial in this markdown will demonstrate how to operationalize the
|
||||
[Employee Attrition Prediction](https://github.com/Microsoft/acceleratoRs/tree/master/EmployeeAttritionPrediction)
|
||||
on Azure cloud and then deploy the model as well as analytical functions onto
|
||||
web-based services.
|
||||
|
||||
## Data exploration and model training - Azure Data Science Virtual Machine
|
||||
|
||||
### Introduction
|
||||
|
||||
[Azure Data Science Virtual Machine (DSVM)](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
|
||||
is a curated virtual machine image that is configured with a comprehensive set of
|
||||
commonly used data analytical tools and software. DSVM is a desirable workplace
|
||||
for data scientists to quickly experiment and prototype a data analytical idea.
|
||||
|
||||
R packages [AzureSMR](https://github.com/Microsoft/AzureSMR) and [AzureDSVM](https://github.com/Azure/AzureDSVM)
|
||||
are to simplify the use and operation of DSVM. One can use functions of the
|
||||
packages to easily create, stop, and destroy DSVMs in Azure resource group. To
|
||||
get started, simply do initial set ups with an Azure subscription, as instructed
|
||||
[here](http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/Authentication.html).
|
||||
|
||||
### Set up a DSVM for employee attrition prediction
|
||||
|
||||
#### Pre-requisites
|
||||
|
||||
For this tutorial, a Ubuntu Linux DSVM is spinned up for the experiment. Since
|
||||
the analysis is performed on a relatively small data set, a medium-size VM is
|
||||
sufficient. In this case, a Standard D2 v2 VM is used. It roughly costs 0.158 USD
|
||||
per hour (more details about pricing can be found [here](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/).
|
||||
|
||||
The DSVM can be deployed by using either Azure portal, Azure Command-Line
|
||||
Interface, or AzureDSVM R package from within an R session.
|
||||
|
||||
The following are the codes for deploying a Linux DSVM with Standard D2 v2 size.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# Load the R packages for resource management.
|
||||
|
||||
library(AzureSMR)
|
||||
library(AzureDSVM)
|
||||
```
|
||||
|
||||
To start with `AzureSMR` and `AzureDSVM` packages for operating Azure resources,
|
||||
it is required to create and set up an Azure Azure Directory App which is
|
||||
authorized for consuming Azure REST APIs. Details can be found in the AzureSMR package [vignette](https://github.com/Microsoft/AzureSMR/blob/master/vignettes/Authentication.Rmd)
|
||||
|
||||
After the proper set up, credentials such as client ID, tenant ID, and secret key
|
||||
can be obtained.
|
||||
Credentials for authentication are suggested to be put in a config.json file which is
|
||||
located at "~/.azuresmr" directory. `read.AzureSMR.config` function then reads
|
||||
the config json file into an R object. The credentials are used to set an
|
||||
Azure Active Context which is then used for authentication.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
settingsfile <- getOption("AzureSMR.config")
|
||||
config <- read.AzureSMR.config()
|
||||
|
||||
asc <- createAzureContext()
|
||||
|
||||
setAzureContext(asc,
|
||||
tenantID=config$tenantID,
|
||||
clientID=config$clientID,
|
||||
authKey=config$authKey)
|
||||
```
|
||||
|
||||
Authentication.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
azureAuthenticate(asc)
|
||||
```
|
||||
|
||||
#### Deployment of DSVM
|
||||
|
||||
Specifications for deploying the DSVM are given as inputs of the deployment
|
||||
function from `AzureDSVM`.
|
||||
|
||||
In this case, a resource group in Southeast Asia is created, and a Ubuntu DSVM
|
||||
with Standard D2 v2 size is created.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
dsvm_location <- "southeastasia"
|
||||
dsvm_rg <- paste0("rg", paste(sample(letters, 3), collapse=""))
|
||||
|
||||
dsvm_size <- "Standard_D2_v2"
|
||||
dsvm_os <- "Ubuntu"
|
||||
dsvm_name <- paste0("dsvm",
|
||||
paste(sample(letters, 3), collapse=""))
|
||||
dsvm_authen <- "Password"
|
||||
dsvm_password <- "Not$ecure123"
|
||||
dsvm_username <- "dsvmuser"
|
||||
|
||||
```
|
||||
|
||||
After that, the resourece group can be created.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# create resource group.
|
||||
|
||||
azureCreateResourceGroup(asc,
|
||||
location=dsvm_location,
|
||||
resourceGroup=dsvm_rg)
|
||||
|
||||
```
|
||||
|
||||
In the resource group, the DSVM with above specifications is created.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# deploy a DSVM.
|
||||
|
||||
deployDSVM(asc,
|
||||
resource.group=dsvm_rg,
|
||||
location=dsvm_location,
|
||||
hostname=dsvm_name,
|
||||
username=dsvm_username,
|
||||
size=dsvm_size,
|
||||
os=dsvm_os,
|
||||
authen=dsvm_authen,
|
||||
password=dsvm_password,
|
||||
mode="Sync")
|
||||
```
|
||||
|
||||
#### Adding extension to DSVM
|
||||
|
||||
Some R packages (e.g., `caretEnsemble`) used in the accelerator are not
|
||||
pre-installed in a freshly deployed Linux DSVM. These packages can be installed
|
||||
post deployment with [Azure VM Extensions](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/extensions-features) which is also available in `AzureDSVM`.
|
||||
|
||||
Basically the Azure Extensions function runs a script located remote on the
|
||||
target VM. In this case, the script, named `script.sh`, is a Linux shell script
|
||||
in which R packages that are needed but missing in the DSVM are installed.
|
||||
|
||||
The following R codes add extension to the deployed DSVM.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# add extension to the deployed DSVM.
|
||||
# NOTE extension is installed as root.
|
||||
|
||||
dsvm_command <- "sudo sh script.sh"
|
||||
dsvm_fileurl <- "https://raw.githubusercontent.com/Microsoft/acceleratoRs/master/EmployeeAttritionPrediction/Code/script.sh"
|
||||
|
||||
addExtensionDSVM(asc,
|
||||
location=dsvm_location,
|
||||
resource.group=dsvm_rg,
|
||||
hostname=dsvm_name,
|
||||
os=dsvm_os,
|
||||
fileurl=dsvm_fileurl,
|
||||
command=dsvm_command)
|
||||
```
|
||||
|
||||
Once experiment with the accelerator is finished, deallocate the DSVM by
|
||||
stopping it so that there will be no charge on the machine,
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# Stop the DSVM if it is not needed.
|
||||
|
||||
operateDSVM(asc,
|
||||
resource.group=dsvm_rg,
|
||||
hostname=dsvm_name,
|
||||
operation="Stop")
|
||||
```
|
||||
|
||||
or destroy the whole resource group if the instances are not needed.
|
||||
|
||||
```{r, eval=FALSE}
|
||||
# Resource group can be removed if the resources are no longer needed.
|
||||
|
||||
azureDeleteResourceGroup(asc, resourceGroup=dsvm_rg)
|
||||
```
|
||||
|
||||
#### Remote access to DSVM
|
||||
|
||||
The DSVM can be accessed via several approaches:
|
||||
|
||||
* Remote desktop. [X2Go](https://wiki.x2go.org/doku.php) server is
|
||||
pre-configured on a DSVM so one can used X2Go client to log onto that machine
|
||||
and use it as a remote desktop.
|
||||
* RStudio Server. RStudio Server is installed, configured, but not started
|
||||
on a Linux DSVM. Starting RStudio Server is embedded in the DSVM extension, so
|
||||
after running the extension code above, one can access the VM via RStudio Server
|
||||
("http://<dsvm_name>.<dsvm_location>.cloudapp.azure.com:8787"). The user name
|
||||
and password used in creating the DSVM can be used for log-in.
|
||||
* Jupyter notebook. Similar to RStudio Server, R user can also work on the DSVM
|
||||
within a Jupyter notebook environment. The remote JupyterHub can be accessed via
|
||||
the address "https://<dsvm_name>.<dsvm_location>.cloudapp.azure.com:8000". To
|
||||
enable an R environment, select R kernel when creating a new notebook.
|
||||
|
||||
The accelerator in both `.md` and `.ipynb` formats are provided for convenient
|
||||
run in RStudio and Jupyter notebook environments, respectively.
|
||||
|
||||
## Service deployment
|
||||
|
||||
The section shows how to consume data analytics in the accelerator on web-based
|
||||
shiny applications.
|
||||
|
||||
### Deployment of R application
|
||||
|
||||
It is usually desirable to deploy R analytics as applications. This allows non-R
|
||||
-user data scientist to consume the pre-trained model or analytical results. For
|
||||
instance, the model created in the employee attrition accelerator can be
|
||||
consumed by end users for either statistical analysis on raw data or real-time
|
||||
attrition prediction.
|
||||
|
||||
#### Ways of deployment
|
||||
|
||||
There are various ways of deploying R analyics.
|
||||
|
||||
* Deployment as API. Deployment of API will benefit downstream developers to
|
||||
consume the data analytics in other applications. It is flexible and efficient.
|
||||
R packages such as `AzureML` and `mrsdeploy` allow deployment of R codes onto
|
||||
Azure Machine Learning Studio web service and web service hosted on a machine
|
||||
where Microsoft R Server is installed and configured, respectively. Other
|
||||
packages such as `plumbr` also allows publishing R codes on a local host as
|
||||
a web service.
|
||||
* Deployment as GUI application. [R Shiny](https://shiny.rstudio.com/) is the most popular framework
|
||||
for publishing R codes as GUI based application. The application can also be
|
||||
publically accessible if it is hosted on Shiny server (not free). Shiny
|
||||
framework provides rich set of functions to define UI and server logic for
|
||||
static, responsive, and graphical interactions with application.
|
||||
* Deployment as Container. [Docker](https://www.docker.com/) container becomes increasingly popular along
|
||||
with the proliferation of microservice architecture. The benefits of running
|
||||
container as a service is that different services can be easily modularized and
|
||||
maintainence. For a data analytical or artificial intelligence solution,
|
||||
models of different purposes can be trained and deployed into different
|
||||
containers whereever needed.
|
||||
|
||||
The following sub-section will talk about how to create shiny applications
|
||||
for the accelerlator and then containerize them.
|
||||
|
||||
#### Shiny + Docker container
|
||||
|
||||
R Shiny can be run on either a local host or a server where Shiny Server is
|
||||
installed.
|
||||
|
||||
There is also a [Shiny Server Docker image](https://hub.docker.com/r/rocker/shiny/) available, which makes it easy
|
||||
for containerizing Shiny applications. The Dockerfile for the Shiny Server is
|
||||
built based on the `r-base` image and is shown as follows.
|
||||
|
||||
```
|
||||
FROM r-base:latest
|
||||
|
||||
MAINTAINER Winston Chang "winston@rstudio.com"
|
||||
|
||||
# Install dependencies and Download and install shiny server
|
||||
RUN apt-get update && apt-get install -y -t unstable \
|
||||
sudo \
|
||||
gdebi-core \
|
||||
pandoc \
|
||||
pandoc-citeproc \
|
||||
libcurl4-gnutls-dev \
|
||||
libcairo2-dev/unstable \
|
||||
libxt-dev && \
|
||||
wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
|
||||
VERSION=$(cat version.txt) && \
|
||||
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
|
||||
gdebi -n ss-latest.deb && \
|
||||
rm -f version.txt ss-latest.deb && \
|
||||
R -e "install.packages(c('shiny', 'rmarkdown'), repos='https://cran.rstudio.com/')" && \
|
||||
cp -R /usr/local/lib/R/site-library/shiny/examples/* /srv/shiny-server/ && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
EXPOSE 3838
|
||||
|
||||
COPY shiny-server.sh /usr/bin/shiny-server.sh
|
||||
|
||||
CMD ["/usr/bin/shiny-server.sh"]
|
||||
```
|
||||
|
||||
A Docker image can be built by using the Dockerfile with
|
||||
|
||||
```
|
||||
docker build -t <image_name> <path_to_the_dockerfile>
|
||||
```
|
||||
and run with
|
||||
|
||||
```
|
||||
docker run --rm -p 3838:3838 <image_name>
|
||||
```
|
||||
|
||||
The Shiny application can be then accessed in a web browser via address "http://localhost:3838" (if it is run on a local host machine) or "http://<ip_address_of_shiny_server:3838".
|
||||
|
||||
### Container orchestration
|
||||
|
||||
When there are more than one application or service needed in the whole
|
||||
pipeline, orchestration of multiple containers becomes useful.
|
||||
|
||||
There are multiple ways of orchestrating containers, and the three most
|
||||
representative approaches are [Kubernetes](https://kubernetes.io/), [Docker Swarm](https://docs.docker.com/engine/swarm/), and [DC/OS](https://dcos.io/).
|
||||
|
||||
Comparison between these orchestration methods is beyond the scope of this
|
||||
tutorial. In the following sections, it will be shown how to deploy multiple
|
||||
Shiny applications on a Kubernetes cluster.
|
||||
|
||||
#### Azure Container Service
|
||||
|
||||
[Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/) is a cloud-based service on Azure, which simplifies the configuration
|
||||
for orchestrating containers with various orchestration methods such as
|
||||
Kubernetes, Docker Swarm, and DC/OS. Azure Container Service offers optimized
|
||||
configuration of these orchestration tools and technologies for Azure. In
|
||||
deployment of the orchestration cluster, it is allowed to set VM size, number
|
||||
of hosts, etc., for scalability, load capacity, cost efficiency, etc.
|
||||
|
||||
#### Deployment of multiple Shiny applications with Azure Container Service
|
||||
|
||||
The following illustrates how to deploy two Shiny applications derived from
|
||||
the employee attrition prediction accelerator with Azure Container Service.
|
||||
|
||||
While there may be more sophisticated architecture in real-world application,
|
||||
the demonstration here merely exhibits a how-to on setting up the environment.
|
||||
|
||||
The two Shiny applications are for (simple) data exploration and model creation,
|
||||
respectively. The two applications are built on top of two individual images.
|
||||
Both obtain data from a Azure Storage blob, where data is persistently
|
||||
preserved. This enables the real-world scenario where R-user data scientists and
|
||||
data analysts are working within the same infrascture but tasks for each can be
|
||||
de-coupled loosely.
|
||||
|
||||
The whole architecture is depicted as follows.
|
||||
|
||||
##### Step 1 - Create Docker images
|
||||
|
||||
Both of the images are created based on the rocker/shiny image.
|
||||
|
||||
* Data exploration image
|
||||
|
||||
```
|
||||
FROM r-base:latest
|
||||
|
||||
MAINTAINER Le Zhang "zhle@microsoft.com"
|
||||
|
||||
RUN apt-get update && apt-get install -y -t unstable \
|
||||
sudo \
|
||||
gdebi-core \
|
||||
pandoc \
|
||||
pandoc-citeproc \
|
||||
libcurl4-gnutls-dev \
|
||||
libcairo2-dev/unstable \
|
||||
libxt-dev \
|
||||
libssl-dev
|
||||
|
||||
# Download and install shiny server
|
||||
|
||||
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
|
||||
VERSION=$(cat version.txt) && \
|
||||
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
|
||||
gdebi -n ss-latest.deb && \
|
||||
rm -f version.txt ss-latest.deb
|
||||
|
||||
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'markdown'), repos='http://cran.rstudio.com/')"
|
||||
|
||||
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
|
||||
COPY /myapp /srv/shiny-server/
|
||||
|
||||
EXPOSE 3838
|
||||
|
||||
COPY shiny-server.sh /usr/bin/shiny-server.sh
|
||||
|
||||
RUN chmod +x /usr/bin/shiny-server.sh
|
||||
|
||||
CMD ["/usr/bin/shiny-server.sh"
|
||||
```
|
||||
* Model creation image
|
||||
|
||||
```
|
||||
FROM r-base:latest
|
||||
|
||||
MAINTAINER Le Zhang "zhle@microsoft.com"
|
||||
|
||||
RUN apt-get update && apt-get install -y -t unstable \
|
||||
sudo \
|
||||
gdebi-core \
|
||||
pandoc \
|
||||
pandoc-citeproc \
|
||||
libcurl4-gnutls-dev \
|
||||
libcairo2-dev/unstable \
|
||||
libxt-dev \
|
||||
libssl-dev
|
||||
|
||||
# Download and install shiny server
|
||||
|
||||
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
|
||||
VERSION=$(cat version.txt) && \
|
||||
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
|
||||
gdebi -n ss-latest.deb && \
|
||||
rm -f version.txt ss-latest.deb
|
||||
|
||||
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'caret', 'caretEnsemble', 'kernlab', 'randomForest', 'xgboost', 'DT'), repos='http://cran.rstudio.com/')"
|
||||
|
||||
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
|
||||
COPY /myapp /srv/shiny-server/
|
||||
|
||||
EXPOSE 3838
|
||||
|
||||
COPY shiny-server.sh /usr/bin/shiny-server.sh
|
||||
|
||||
RUN chmod +x /usr/bin/shiny-server.sh
|
||||
|
||||
# Download pre-trained model
|
||||
|
||||
RUN wget --no-verbose https://zhledata.blob.core.windows.net/employee/model.RData -O "/srv/shiny-server/model.RData"
|
||||
|
||||
CMD ["/usr/bin/shiny-server.sh"
|
||||
```
|
||||
All of the layers are the same as those in the original rocker/shiny image, except for installation of additional R packages and their required
|
||||
run time libraries (e.g., caretEnsemble, xgboost, etc.).
|
||||
|
||||
Docker images can be built similar to the rocker/shiny image. After the images
|
||||
are built, they can be pushed onto a public repository such as on [Dockerhub](https://hub.docker.com/) or a private repository on [Azure Container Registry](https://azure.microsoft.com/en-us/services/container-registry/).
|
||||
|
||||
The following shows how to do that with Dockerhub.
|
||||
|
||||
1. Build the image.
|
||||
```
|
||||
docker build -t <name_of_image> <path_to_dockerfile>
|
||||
```
|
||||
2. Tag the image.
|
||||
```
|
||||
docker tag <name_of_image> <dockerhub_account_name>/<name_of_repo>
|
||||
```
|
||||
3. Login with Dockerhub.
|
||||
```
|
||||
docker login
|
||||
```
|
||||
4. Push image onto Dockerhub repository.
|
||||
```
|
||||
docker push <dockerhub_account_name>/<name_of_repo>
|
||||
```
|
||||
|
||||
In this case, both of the two images are pushed on Dockerhub.
|
||||
|
||||
##### Step 2 - Create Azure Container Service
|
||||
|
||||
Creation of Azure Container Service can be achieved with either Azure portal or
|
||||
Azure Command-Line Interface (CLI).
|
||||
|
||||
The following shows how to create a Kubernetes type orchestrator in a specified
|
||||
resource group with Azure CLI (installation of Azure CLI can be found [here](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).
|
||||
|
||||
1. Login with Azure subscription.
|
||||
```
|
||||
az login
|
||||
```
|
||||
2. Create a resource group where the Azure Container Service cluster resides.
|
||||
```
|
||||
az group create --name=<resource_group> --location=<location>
|
||||
```
|
||||
3. Create an Azure Container Service with Kubernetes orchestrator. The
|
||||
cluster is made of one master node and three agent nodes. Name of
|
||||
cluster, DNS prefix, and authentication private key can also be specified as
|
||||
requested.
|
||||
```
|
||||
az acs create --orchestrator-type=kubernetes --resource-group <resource_group> --name=<cluster_name> --dns-prefix=<dns_prefix> --ssh-key-value ~/.ssh/id_rsa.pub --admin-username=<user_name> --master-count=1 --agent-count=2 --agent-vm-size=<vm_size>
|
||||
```
|
||||
##### Step 3 - Deploy Shiny applications on the Azure Container Service
|
||||
|
||||
The status of Azure Container Service deployment can be checked in Azure portal.
|
||||
Once it is successfully done, there will be the resources listed in the resource
|
||||
group.
|
||||
|
||||
In this tutorial, there are two Shiny applications hosted on the cluster. For
|
||||
simplicity reason, these two applications do not have dependency on each other,
|
||||
so they are deployed independently and exposed as invidual service.
|
||||
|
||||
The deployment is done with [Kubernetes command line tool](https://kubernetes.io/docs/tasks/tools/install-kubectl/), which can be installed on the local machine.
|
||||
|
||||
kubectl should be configured properly in order to communicate with the remote
|
||||
Kubernetes cluster. This can be done via copy the `config` file located at
|
||||
`~/.kube` on master node of the Kubernetes cluster to `~/.kube/` of the local
|
||||
machine.
|
||||
|
||||
Each of the two applications can be deployed individually as follows.
|
||||
```
|
||||
kubectl run <name_of_deployment> --image <dockerhub_account_name>/<name_of_repo>
|
||||
--port=3838 --replicas=3
|
||||
```
|
||||
The deployment can be exposed as web-based service by the following command:
|
||||
```
|
||||
kubectl expose deployments <name_of_deployment> --port=3838 --type=LoadBalancer
|
||||
```
|
||||
Status of the deployment and service exposure can be monitored by
|
||||
```
|
||||
kubectl get deployments
|
||||
```
|
||||
and
|
||||
```
|
||||
kubectl get services
|
||||
```
|
||||
respectively.
|
||||
|
||||
The deployment and exposure of service can be put together into a yaml file for
|
||||
convenience of operation.
|
||||
```
|
||||
apiVersion: apps/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: <name_of_model_app>
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: <name_of_model_app>
|
||||
spec:
|
||||
containers:
|
||||
- name: <name_of_model_app>
|
||||
image: <dockerhub_account_name>/<name_of_model_app>
|
||||
ports:
|
||||
- containerPort: 3838
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
limits:
|
||||
cpu: 500m
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: <name_of_model_app>
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
ports:
|
||||
- port: 3838
|
||||
selector:
|
||||
app: <name_of_model_app>
|
||||
---
|
||||
apiVersion: apps/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: <name_of_data_app>
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: <name_of_data_app>
|
||||
spec:
|
||||
containers:
|
||||
- name: <name_of_data_app>
|
||||
image: <dockerhub_account_name>/<name_of_data_app>
|
||||
ports:
|
||||
- containerPort: 3030
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
limits:
|
||||
cpu: 500m
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: <name_of_data_app>
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
ports:
|
||||
- port: 3030
|
||||
selector:
|
||||
app: <name_of_data_app>
|
||||
```
|
||||
The deployment and service can then be created simply by
|
||||
```
|
||||
kubectl create -f <path_to_the_yaml_file>
|
||||
```
|
||||
|
||||
##### Step 4 - Test the deployed Shiny applications
|
||||
|
||||
Once the deployment is finished, public IP address and port number of the
|
||||
exposed service can be checked with `kubectl get service --watch`. In the
|
||||
deployment process, external IP addresses of the exposed services will show
|
||||
"<pending>". It usually takes a while to finish depending on the size of the
|
||||
image and capability of cluster.
|
||||
|
||||
The deployed Shiny application service can be accessed from web browser via the
|
||||
public IP address with corresponding port number.
|
||||
|
||||
The following snapshots show the deployed Shiny apps.
|
||||
|
||||
The readers can find Dockerfile as well as Shiny R codes in the directories.
|
||||
Images built based on them are pre-published on Dockerhub - `yueguoguo/hrdata`
|
||||
and `yueguoguo/hrmodel`, corresponding to the data exploration application and
|
||||
model creation application, respectively. These images are ready for testing
|
||||
on a deployed Kubernetes typed Azure Container Service cluster.
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,65 @@
|
|||
apiVersion: apps/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: hrmodel
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: hrmodel
|
||||
spec:
|
||||
containers:
|
||||
- name: hrmodel
|
||||
image: yueguoguo/hrmodel
|
||||
ports:
|
||||
- containerPort: 3838
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
limits:
|
||||
cpu: 500m
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: hrmodel
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
ports:
|
||||
- port: 3838
|
||||
selector:
|
||||
app: hrmodel
|
||||
---
|
||||
apiVersion: apps/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: hrdata
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: hrdata
|
||||
spec:
|
||||
containers:
|
||||
- name: hrdata
|
||||
image: yueguoguo/hrdata
|
||||
ports:
|
||||
- containerPort: 3030
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
limits:
|
||||
cpu: 500m
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: hrdata
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
ports:
|
||||
- port: 3030
|
||||
selector:
|
||||
app: hrdata
|
|
@ -0,0 +1,34 @@
|
|||
FROM r-base:latest
|
||||
|
||||
MAINTAINER Le Zhang "zhle@microsoft.com"
|
||||
|
||||
RUN apt-get update && apt-get install -y -t unstable \
|
||||
sudo \
|
||||
gdebi-core \
|
||||
pandoc \
|
||||
pandoc-citeproc \
|
||||
libcurl4-gnutls-dev \
|
||||
libcairo2-dev/unstable \
|
||||
libxt-dev \
|
||||
libssl-dev
|
||||
|
||||
# Download and install shiny server
|
||||
|
||||
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
|
||||
VERSION=$(cat version.txt) && \
|
||||
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
|
||||
gdebi -n ss-latest.deb && \
|
||||
rm -f version.txt ss-latest.deb
|
||||
|
||||
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'markdown', 'DT', 'scales'), repos='http://cran.rstudio.com/')"
|
||||
|
||||
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
|
||||
COPY /myapp /srv/shiny-server/
|
||||
|
||||
EXPOSE 3030
|
||||
|
||||
COPY shiny-server.sh /usr/bin/shiny-server.sh
|
||||
|
||||
RUN chmod +x /usr/bin/shiny-server.sh
|
||||
|
||||
CMD ["/usr/bin/shiny-server.sh"]
|
|
@ -0,0 +1,30 @@
|
|||
---
|
||||
title: "about"
|
||||
author: "Le Zhang"
|
||||
date: "August 24, 2017"
|
||||
output: html_document
|
||||
---
|
||||
|
||||
### Employee Attrition Prediction
|
||||
|
||||
This is a demonstration on a case study of employee attrition prediction.
|
||||
Data science and machine learning development process often consists of multiple
|
||||
steps. Containerizing each of the steps help modularize the whole process and
|
||||
thus making it easier for DevOps.
|
||||
|
||||
For simplicity reason, the demo process is merely composed of two steps, which are
|
||||
data exploration and model creation.
|
||||
|
||||
This web-based app is to show how to do simple data exploration graphically on
|
||||
the HR data set.
|
||||
|
||||
#### R accelerator
|
||||
|
||||
The end-to-end tutorial of the R based template for data processing, model
|
||||
training, etc. (we call it "acceleratoR") can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
|
||||
|
||||
#### Operationalization
|
||||
|
||||
Operationalization of the case on Azure cloud (i.e., data exploration, model creation,
|
||||
model management, model deployment, etc.) with [Azure Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
|
||||
[Azure Storage](https://azure.microsoft.com/en-us/services/storage/), [Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/), etc., can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
|
|
@ -0,0 +1,32 @@
|
|||
# ------------------------------------------------------------------------------
|
||||
# R packages needed for the analytics.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
library(shiny)
|
||||
library(dplyr)
|
||||
library(magrittr)
|
||||
library(ggplot2)
|
||||
library(markdown)
|
||||
library(scales)
|
||||
|
||||
# ------------------------------------------------------------------------------
|
||||
# Global variables.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
data_url <- "https://zhledata.blob.core.windows.net/employee/DataSet1.csv"
|
||||
|
||||
# ------------------------------------------------------------------------------
|
||||
# Functions.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
# Load HR demographic data.
|
||||
|
||||
loadData <- function() {
|
||||
df <- read.csv(data_url)
|
||||
|
||||
return(df)
|
||||
}
|
||||
|
||||
# Load HR data and pre-trained model.
|
||||
|
||||
df_hr <- loadData()
|
|
@ -0,0 +1,103 @@
|
|||
source("global.R")
|
||||
|
||||
# The actual shiny server function.
|
||||
|
||||
shinyServer(function(input, output) {
|
||||
|
||||
|
||||
# Plot a table of the HR data.
|
||||
|
||||
output$hrtable <- DT::renderDataTable({
|
||||
DT::datatable(df_hr[, input$show_vars, drop=FALSE])
|
||||
})
|
||||
|
||||
# Downloadable csv of selected dataset.
|
||||
|
||||
output$downloadData <- downloadHandler(
|
||||
filename = function() {
|
||||
paste(input$dataset, ".csv", sep = "")
|
||||
},
|
||||
content = function(file) {
|
||||
write.csv(datasetInput(), file, row.names = FALSE)
|
||||
}
|
||||
)
|
||||
|
||||
# Plot some general summary statistics for those who are predicted attrition.
|
||||
|
||||
output$plot3 <- renderPlot({
|
||||
if (identical(input$att_vars, "Yes")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "Yes")
|
||||
} else if (identical(input$att_vars, "No")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "No")
|
||||
} else if (identical(input$att_vars, c("Yes", "No"))) {
|
||||
df_hr
|
||||
} else {
|
||||
df_hr <- df_hr[0, ]
|
||||
}
|
||||
|
||||
df_hr <- filter(df_hr, JobRole %in% input$disc_vars)
|
||||
|
||||
ggplot(df_hr, aes(JobRole, fill=Attrition)) +
|
||||
geom_bar(aes(y=(..count..)/sum(..count..)),
|
||||
position="dodge",
|
||||
alpha=0.6) +
|
||||
scale_y_continuous(labels=percent) +
|
||||
xlab(input$disc_vars) +
|
||||
ylab("Percentage") +
|
||||
theme_bw() +
|
||||
ggtitle(paste("Count for", input$disc_vars))
|
||||
})
|
||||
|
||||
output$plot <- renderPlot({
|
||||
if (identical(input$att_vars, "Yes")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "Yes")
|
||||
} else if (identical(input$att_vars, "No")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "No")
|
||||
} else if (identical(input$att_vars, c("Yes", "No"))) {
|
||||
df_hr
|
||||
} else {
|
||||
df_hr <- df_hr[0, ]
|
||||
}
|
||||
|
||||
df_hr_final <- select(df_hr, one_of("Attrition", input$plot_vars))
|
||||
|
||||
ggplot(df_hr_final,
|
||||
aes_string(input$plot_vars,
|
||||
color="Attrition",
|
||||
fill="Attrition")) +
|
||||
geom_density(alpha=0.2) +
|
||||
theme_bw() +
|
||||
xlab(input$plot_vars) +
|
||||
ylab("Density") +
|
||||
ggtitle(paste("Estimated density for", input$plot_vars))
|
||||
})
|
||||
|
||||
# Monthly income, service year, etc.
|
||||
|
||||
output$plot2 <- renderPlot({
|
||||
if (identical(input$att_vars, "Yes")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "Yes")
|
||||
} else if (identical(input$att_vars, "No")) {
|
||||
df_hr %<>% filter(as.character(Attrition) == "No")
|
||||
} else if (identical(input$att_vars, c("Yes", "No"))) {
|
||||
df_hr
|
||||
} else {
|
||||
df_hr <- df_hr[0, ]
|
||||
}
|
||||
|
||||
df_hr <- filter(df_hr,
|
||||
YearsAtCompany >= input$years_service[1] &
|
||||
YearsAtCompany <= input$years_service[2] &
|
||||
JobLevel < input$job_level &
|
||||
JobRole %in% input$job_roles)
|
||||
|
||||
ggplot(df_hr,
|
||||
aes(x=factor(JobRole), y=MonthlyIncome, color=factor(Attrition))) +
|
||||
geom_boxplot() +
|
||||
xlab("Job Role") +
|
||||
ylab("Monthly income") +
|
||||
scale_fill_discrete(guide=guide_legend(title="Attrition")) +
|
||||
theme_bw() +
|
||||
theme(text=element_text(size=13), legend.position="top")
|
||||
})
|
||||
})
|
|
@ -0,0 +1,109 @@
|
|||
source("global.R")
|
||||
|
||||
navbarPage(
|
||||
"HR Analytics - data exploration",
|
||||
tabPanel(
|
||||
"About",
|
||||
fluidRow(
|
||||
column(3, includeMarkdown("about.md")),
|
||||
column(
|
||||
6,
|
||||
img(class="img-polaroid",
|
||||
src=paste0("https://careers.microsoft.com/content/images/services/HomePage_Hero1_Tim.jpg"))
|
||||
)
|
||||
)
|
||||
),
|
||||
tabPanel(
|
||||
"Data",
|
||||
sidebarLayout(
|
||||
sidebarPanel(
|
||||
# Variables to select for displayed demographic data.
|
||||
|
||||
checkboxGroupInput(
|
||||
"show_vars",
|
||||
"Columns in HR data set to show:",
|
||||
names(df_hr),
|
||||
selected=names(df_hr)
|
||||
),
|
||||
|
||||
# Button
|
||||
downloadButton("hrData", "Download")
|
||||
),
|
||||
|
||||
mainPanel(
|
||||
tabsetPanel(
|
||||
id="dataset",
|
||||
tabPanel("HR Demographic data", DT::dataTableOutput("hrtable"))
|
||||
)
|
||||
)
|
||||
)
|
||||
),
|
||||
tabPanel(
|
||||
"Plot",
|
||||
|
||||
h4("Select employees of attrition or non-attrition to visualize."),
|
||||
|
||||
checkboxGroupInput(
|
||||
"att_vars",
|
||||
"Attrition or not:",
|
||||
c("Yes", "No"),
|
||||
selected=c("Yes", "No")),
|
||||
|
||||
fluidRow(
|
||||
column(
|
||||
4,
|
||||
h4("Count of discrete variable."),
|
||||
plotOutput("plot3"),
|
||||
|
||||
checkboxGroupInput(
|
||||
"disc_vars",
|
||||
"Job roles:",
|
||||
unique(df_hr$JobRole),
|
||||
selected=unique(df_hr$JobRole)[1:5])
|
||||
),
|
||||
|
||||
column(
|
||||
4,
|
||||
h4("Distribution of continuous variable."),
|
||||
plotOutput("plot"),
|
||||
|
||||
selectInput(
|
||||
"plot_vars",
|
||||
"Variable to visualize:",
|
||||
names(select_if(df_hr, is.integer)),
|
||||
selected=names(select_if(df_hr, is.integer)))
|
||||
),
|
||||
|
||||
column(
|
||||
4,
|
||||
h4("Comparison on certain factors."),
|
||||
plotOutput("plot2"),
|
||||
|
||||
# Years of service.
|
||||
|
||||
sliderInput(
|
||||
"years_service",
|
||||
"Years of service:",
|
||||
min=1,
|
||||
max=40,
|
||||
value=c(2, 5)),
|
||||
|
||||
# Job level.
|
||||
|
||||
sliderInput(
|
||||
"job_level",
|
||||
"Job level:",
|
||||
min=1,
|
||||
max=5,
|
||||
value=3
|
||||
),
|
||||
|
||||
checkboxGroupInput(
|
||||
"job_roles",
|
||||
"Job roles:",
|
||||
unique(df_hr$JobRole),
|
||||
selected=unique(df_hr$JobRole)[1:5])
|
||||
)
|
||||
)
|
||||
)
|
||||
)
|
|
@ -0,0 +1,26 @@
|
|||
# Define the user we should use when spawning R Shiny processes
|
||||
run_as shiny;
|
||||
|
||||
# This will show screen shot of errors in docker bash.
|
||||
sanitize_errors off;
|
||||
|
||||
# Define a top-level server which will listen on a port
|
||||
server {
|
||||
# Instruct this server to listen on port 80. The app at dokku-alt need expose PORT 80, or 500 e etc. See the docs
|
||||
listen 3030;
|
||||
|
||||
# Define the location available at the base URL
|
||||
location / {
|
||||
|
||||
# Run this location in 'site_dir' mode, which hosts the entire directory
|
||||
# tree at '/srv/shiny-server'
|
||||
site_dir /srv/shiny-server;
|
||||
|
||||
# Define where we should put the log files for this location
|
||||
log_dir /var/log/shiny-server;
|
||||
|
||||
# Should we list the contents of a (non-Shiny-App) directory when the user
|
||||
# visits the corresponding URL?
|
||||
directory_index on;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/sh
|
||||
|
||||
# Make sure the directory for individual app logs exists
|
||||
mkdir -p /var/log/shiny-server
|
||||
chown shiny.shiny /var/log/shiny-server
|
||||
|
||||
exec shiny-server >> /var/log/shiny-server.log 2>&1
|
|
@ -0,0 +1,37 @@
|
|||
FROM r-base:latest
|
||||
|
||||
MAINTAINER Le Zhang "zhle@microsoft.com"
|
||||
|
||||
RUN apt-get update && apt-get install -y -t unstable \
|
||||
sudo \
|
||||
gdebi-core \
|
||||
pandoc \
|
||||
pandoc-citeproc \
|
||||
libcurl4-gnutls-dev \
|
||||
libcairo2-dev/unstable \
|
||||
libxt-dev \
|
||||
libssl-dev \
|
||||
libxml2-dev
|
||||
|
||||
# Download and install shiny server
|
||||
|
||||
RUN wget --no-verbose https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/VERSION -O "version.txt" && \
|
||||
VERSION=$(cat version.txt) && \
|
||||
wget --no-verbose "https://s3.amazonaws.com/rstudio-shiny-server-os-build/ubuntu-12.04/x86_64/shiny-server-$VERSION-amd64.deb" -O ss-latest.deb && \
|
||||
gdebi -n ss-latest.deb && \
|
||||
rm -f version.txt ss-latest.deb
|
||||
|
||||
RUN R -e "install.packages(c('shiny', 'ggplot2', 'dplyr', 'magrittr', 'caret', 'caretEnsemble', 'kernlab', 'randomForest', 'xgboost', 'DT', 'DMwR', 'markdown', 'mlbench', 'devtools', 'XML', 'gridSVG', 'pROC', 'plotROC', 'scales'), repos='http://cran.rstudio.com/')"
|
||||
|
||||
RUN R -e "library(devtools);devtools::install_github('sachsmc/plotROC')"
|
||||
|
||||
COPY shiny-server.conf /etc/shiny-server/shiny-server.conf
|
||||
COPY /myapp /srv/shiny-server/
|
||||
|
||||
EXPOSE 3838
|
||||
|
||||
COPY shiny-server.sh /usr/bin/shiny-server.sh
|
||||
|
||||
RUN chmod +x /usr/bin/shiny-server.sh
|
||||
|
||||
CMD ["/usr/bin/shiny-server.sh"]
|
|
@ -0,0 +1,32 @@
|
|||
---
|
||||
title: "about"
|
||||
author: "Le Zhang"
|
||||
date: "August 24, 2017"
|
||||
output: html_document
|
||||
---
|
||||
|
||||
### Employee Attrition Prediction
|
||||
|
||||
This is a demonstration on a case study of employee attrition prediction.
|
||||
Data science and machine learning development process often consists of multiple
|
||||
steps. Containerizing each of the steps help modularize the whole process and
|
||||
thus making it easier for DevOps.
|
||||
|
||||
For simplicity reason, the demo process is merely composed of two steps, which are
|
||||
data exploration and model creation.
|
||||
|
||||
This web-based app is to show how to create a model on the data. The training
|
||||
candidature algorithms include Support Vector Machine (SVM), Random Forest, and
|
||||
Extreme Gradient Boosting (XGBoost). For illustration purpose, only a few high-
|
||||
level parameters are allowed to set.
|
||||
|
||||
#### R accelerator
|
||||
|
||||
The end-to-end tutorial of the R based template for data processing, model
|
||||
training, etc. (we call it "acceleratoR") can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
|
||||
|
||||
#### Operationalization
|
||||
|
||||
Operationalization of the case on Azure cloud (i.e., data exploration, model creation,
|
||||
model management, model deployment, etc.) with [Azure Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-provision-vm)
|
||||
[Azure Storage](https://azure.microsoft.com/en-us/services/storage/), [Azure Container Service](https://azure.microsoft.com/en-us/services/container-service/), etc., can be found [here](https://github.com/Microsoft/acceleratoRs/blob/master/EmployeeAttritionPrediction).
|
|
@ -0,0 +1,142 @@
|
|||
# ------------------------------------------------------------------------------
|
||||
# R packages needed for the analytics.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
library(caret)
|
||||
library(caretEnsemble)
|
||||
library(DMwR)
|
||||
library(dplyr)
|
||||
library(ggplot2)
|
||||
library(markdown)
|
||||
library(magrittr)
|
||||
library(mlbench)
|
||||
library(pROC)
|
||||
library(plotROC)
|
||||
library(shiny)
|
||||
|
||||
# ------------------------------------------------------------------------------
|
||||
# Global variables.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
data_url <- "https://zhledata.blob.core.windows.net/employee/DataSet1.csv"
|
||||
|
||||
# ------------------------------------------------------------------------------
|
||||
# Functions.
|
||||
# ------------------------------------------------------------------------------
|
||||
|
||||
# Load HR demographic data.
|
||||
|
||||
loadData <- function() {
|
||||
df <- read.csv(data_url)
|
||||
|
||||
return(df)
|
||||
}
|
||||
|
||||
|
||||
# Process data - the same data processing steps apply on the data.
|
||||
|
||||
processData <- function(data) {
|
||||
|
||||
# 1. Remove zero-variance variables.
|
||||
|
||||
pred_no_var <- c("EmployeeCount", "StandardHours")
|
||||
data %<>% select(-one_of(pred_no_var))
|
||||
|
||||
# 2. Convert Integer to Factor type of data.
|
||||
|
||||
int_2_ftr_vars <- c("Education",
|
||||
"EnvironmentSatisfaction",
|
||||
"JobInvolvement",
|
||||
"JobLevel",
|
||||
"JobSatisfaction",
|
||||
"NumCompaniesWorked",
|
||||
"PerformanceRating",
|
||||
"RelationshipSatisfaction",
|
||||
"StockOptionLevel")
|
||||
data[, int_2_ftr_vars] <- lapply((data[, int_2_ftr_vars]), as.factor)
|
||||
|
||||
# 3. Keep the most salient variables.
|
||||
|
||||
least_important_vars <- c("Department", "Gender", "PerformanceRating")
|
||||
data %<>% select(-one_of(least_important_vars))
|
||||
|
||||
return(data)
|
||||
}
|
||||
|
||||
# Data split.
|
||||
|
||||
splitData <- function(data, ratio) {
|
||||
if (!("Attrition" %in% names(data)))
|
||||
stop("No label found in data set.")
|
||||
|
||||
train_index <-
|
||||
createDataPartition(data$Attrition,
|
||||
times=1,
|
||||
p=ratio / 100) %>%
|
||||
unlist()
|
||||
|
||||
data_train <- data[train_index, ]
|
||||
data_test <- data[-train_index, ]
|
||||
|
||||
data_split <- list(train=data_train, test=data_test)
|
||||
|
||||
return(data_split)
|
||||
}
|
||||
|
||||
# Model training.
|
||||
|
||||
trainModel <- function(data,
|
||||
smote_over,
|
||||
smote_under,
|
||||
method="boot",
|
||||
number=3,
|
||||
repeats=3,
|
||||
search="grid",
|
||||
algorithm="rf") {
|
||||
|
||||
# If the training set is imbalanced, SMOTE will be applied.
|
||||
|
||||
data %<>% as.data.frame()
|
||||
|
||||
data <- SMOTE(Attrition ~ .,
|
||||
data,
|
||||
perc.over=smote_over,
|
||||
perc.under=smote_under)
|
||||
|
||||
# Train control.
|
||||
|
||||
tc <- trainControl(method=method,
|
||||
number=number,
|
||||
repeats=repeats,
|
||||
search="grid",
|
||||
classProbs=TRUE,
|
||||
savePredictions="final",
|
||||
summaryFunction=twoClassSummary)
|
||||
|
||||
# Model training.
|
||||
|
||||
model <- train(Attrition ~ .,
|
||||
data,
|
||||
method=algorithm,
|
||||
trControl=tc)
|
||||
|
||||
return(model)
|
||||
}
|
||||
|
||||
# Function for predicting attrition based on demographic data.
|
||||
|
||||
inference <- function(model, data) {
|
||||
if ("Attrition" %in% names(data)) {
|
||||
data %<>% select(-Attrition)
|
||||
}
|
||||
|
||||
labels <- predict(model, newdata=data, type="prob")
|
||||
|
||||
return(labels)
|
||||
}
|
||||
|
||||
# Load and pre-process HR data.
|
||||
|
||||
df_hr <-
|
||||
loadData() %>%
|
||||
processData()
|
|
@ -0,0 +1,127 @@
|
|||
source("global.R")
|
||||
|
||||
# The actual shiny server function.
|
||||
|
||||
shinyServer(function(input, output) {
|
||||
|
||||
# Training and testing data.
|
||||
|
||||
dataSplit <- reactive({
|
||||
df <- splitData(df_hr, input$ratio)
|
||||
|
||||
df
|
||||
})
|
||||
|
||||
# Train a reactive model.
|
||||
|
||||
modelTrained <- eventReactive(input$goButton, {
|
||||
df <- dataSplit()
|
||||
df_train <- df$train
|
||||
df_test <- df$test
|
||||
|
||||
if (input$algorithm == "SVM") {
|
||||
method <- "svmRadial"
|
||||
} else if (input$algorithm == "Random Forest") {
|
||||
method <- "rf"
|
||||
} else {
|
||||
method <- "xgbLinear"
|
||||
}
|
||||
|
||||
model <- trainModel(data=df_train,
|
||||
smote_over=input$smoteOver,
|
||||
smote_under=input$smoteDown,
|
||||
method="boot",
|
||||
number=input$number,
|
||||
repeats=input$repeats,
|
||||
search="grid",
|
||||
algorithm=method)
|
||||
|
||||
model
|
||||
})
|
||||
|
||||
# Print summary of data set.
|
||||
|
||||
output$summary <- renderPrint({
|
||||
df <- dataSplit()
|
||||
|
||||
# str(df$train)
|
||||
|
||||
table(df$train$Attrition)
|
||||
})
|
||||
|
||||
# Print table of training data set.
|
||||
|
||||
output$dataTrain <- DT::renderDataTable({
|
||||
df <- dataSplit()
|
||||
|
||||
DT::datatable(df$train)
|
||||
})
|
||||
|
||||
# Plot some general summary statistics for those who are predicted attrition.
|
||||
|
||||
output$plot <- renderPlot({
|
||||
|
||||
df <- dataSplit()
|
||||
df_test <- df$test
|
||||
|
||||
# Train a model
|
||||
|
||||
model <- modelTrained()
|
||||
|
||||
# Use the model for inference on testing data.
|
||||
|
||||
results <- inference(model, data=df_test)
|
||||
results <- mutate(results, label=df_test$Attrition)
|
||||
|
||||
# Plot the ROC curve.
|
||||
|
||||
basic_plot <-
|
||||
ggplot(results,
|
||||
aes(m=Yes, d=factor(label, levels=c("No", "Yes")))) +
|
||||
geom_roc(n.cuts=0)
|
||||
|
||||
basic_plot +
|
||||
style_roc(theme=theme_grey) +
|
||||
theme(axis.text=element_text(colour="blue")) +
|
||||
# annotate("text",
|
||||
# x=.75,
|
||||
# y=.25,
|
||||
# label=paste("AUC =", round(calc_auc(basic_plot)$AUC, 2))) +
|
||||
ggtitle("Plot of ROC curve") +
|
||||
scale_x_continuous("1 - Specificity", breaks = seq(0, 1, by = .1))
|
||||
})
|
||||
|
||||
output$auc <- renderPrint({
|
||||
|
||||
df <- dataSplit()
|
||||
df_test <- df$test
|
||||
|
||||
# Train a model
|
||||
|
||||
model <- modelTrained()
|
||||
|
||||
# Use the model for inference on testing data.
|
||||
|
||||
results <- inference(model, data=df_test)
|
||||
results <- mutate(results, label=df_test$Attrition)
|
||||
|
||||
basic_plot <-
|
||||
ggplot(results,
|
||||
aes(m=Yes, d=factor(label, levels=c("No", "Yes")))) +
|
||||
geom_roc(n.cuts=0)
|
||||
|
||||
sprintf("AUC of the ROC curve is %f", round(calc_auc(basic_plot)$AUC, 2))
|
||||
})
|
||||
|
||||
# # Export the trained model.
|
||||
#
|
||||
# output$downloadModel <- downloadHandler(
|
||||
# filename = function() {
|
||||
# paste(input$algorithm, "_model", ".rds", sep="")
|
||||
# },
|
||||
#
|
||||
# content = function(file) {
|
||||
# saveRDS(model, file)
|
||||
# }
|
||||
# )
|
||||
})
|
|
@ -0,0 +1,109 @@
|
|||
source("global.R")
|
||||
|
||||
navbarPage(
|
||||
"HR Analytics - model creation",
|
||||
tabPanel(
|
||||
"About",
|
||||
fluidRow(
|
||||
column(3, includeMarkdown("about.md")),
|
||||
column(6, img(
|
||||
class="img-polaroid",
|
||||
src=paste0("https://careers.microsoft.com/content/images/services/HomePage_Hero1_Tim.jpg"))
|
||||
)
|
||||
)
|
||||
),
|
||||
|
||||
tabPanel(
|
||||
"Model",
|
||||
sidebarLayout(
|
||||
sidebarPanel(
|
||||
# Split ratio for training/testing data.
|
||||
|
||||
sliderInput(inputId="ratio",
|
||||
label="Split ratio (%) for training data.",
|
||||
min=0,
|
||||
max=100,
|
||||
value=70),
|
||||
|
||||
# SMOTE upsampling percentage.
|
||||
|
||||
p("SMOTE is used for balancing data set"),
|
||||
|
||||
numericInput(inputId="smoteOver",
|
||||
label="Upsampling percentage in SMOTE for minority class.",
|
||||
value=300),
|
||||
|
||||
# SMOTE downsampling percentage.
|
||||
|
||||
numericInput(inputId="smoteDown",
|
||||
label="Downsampling percentage in SMOTE for majority class.",
|
||||
value=150),
|
||||
|
||||
# Repeats in train control.
|
||||
|
||||
p("High-level control for cross-validation in training the model."),
|
||||
|
||||
numericInput(inputId="repeats",
|
||||
label="Number of repeats for a k-fold cross-validation.",
|
||||
min=1,
|
||||
max=3,
|
||||
value=1),
|
||||
|
||||
# Number of cross-validations in train control.
|
||||
|
||||
numericInput(inputId="number",
|
||||
label="Number of folds in cross-validation.",
|
||||
min=2,
|
||||
max=5,
|
||||
value=1),
|
||||
|
||||
# Algorithm for use.
|
||||
|
||||
selectInput(inputId="algorithm",
|
||||
label="Machine learning algorithm to use for training a model:",
|
||||
choices=c("SVM", "Random Forest", "XGBoost")),
|
||||
|
||||
# Train model
|
||||
|
||||
p("Click the button to train a model with the above settings (it may
|
||||
take some time depending on algorithm used for training). After the
|
||||
training process, a ROC curve which evaluates model performance on
|
||||
the testing data set is plotted."),
|
||||
|
||||
actionButton("goButton", "Train")
|
||||
|
||||
# # Export model
|
||||
#
|
||||
# p("Export the trained model"),
|
||||
#
|
||||
# downloadButton("downloadModel",
|
||||
# "Download")
|
||||
),
|
||||
|
||||
mainPanel(
|
||||
|
||||
# Summary of the training data set.
|
||||
|
||||
p("It should be noted that the data set is not balanced, which may
|
||||
negatively impact model training if no balancing technique is
|
||||
applied."),
|
||||
|
||||
verbatimTextOutput("summary"),
|
||||
|
||||
# Print table of training data.
|
||||
|
||||
tabsetPanel(
|
||||
id="dataset",
|
||||
tabPanel("HR Demographic data for training",
|
||||
DT::dataTableOutput("dataTrain"))
|
||||
),
|
||||
|
||||
# Plot the model validation results.
|
||||
|
||||
plotOutput("plot"),
|
||||
|
||||
verbatimTextOutput("auc")
|
||||
)
|
||||
)
|
||||
)
|
||||
)
|
|
@ -0,0 +1,26 @@
|
|||
# Define the user we should use when spawning R Shiny processes
|
||||
run_as shiny;
|
||||
|
||||
# This will show screen shot of errors in docker bash.
|
||||
sanitize_errors off;
|
||||
|
||||
# Define a top-level server which will listen on a port
|
||||
server {
|
||||
# Instruct this server to listen on port 80. The app at dokku-alt need expose PORT 80, or 500 e etc. See the docs
|
||||
listen 3838;
|
||||
|
||||
# Define the location available at the base URL
|
||||
location / {
|
||||
|
||||
# Run this location in 'site_dir' mode, which hosts the entire directory
|
||||
# tree at '/srv/shiny-server'
|
||||
site_dir /srv/shiny-server;
|
||||
|
||||
# Define where we should put the log files for this location
|
||||
log_dir /var/log/shiny-server;
|
||||
|
||||
# Should we list the contents of a (non-Shiny-App) directory when the user
|
||||
# visits the corresponding URL?
|
||||
directory_index on;
|
||||
}
|
||||
}
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/sh
|
||||
|
||||
# Make sure the directory for individual app logs exists
|
||||
mkdir -p /var/log/shiny-server
|
||||
chown shiny.shiny /var/log/shiny-server
|
||||
|
||||
exec shiny-server >> /var/log/shiny-server.log 2>&1
|
|
@ -0,0 +1,21 @@
|
|||
#!/bin/bash
|
||||
|
||||
# install R libraries.
|
||||
|
||||
sudo mkdir /etc/skel/R
|
||||
sudo mkdir /etc/skel/R/lib
|
||||
sudo Rscript -e 'library(devtools);library(withr);withr::with_libpaths(new="/etc/skel/R/lib/", install(c("DMwR", "caretEnsemble", "pROC", "jiebaR")));withr::with_libpaths(new="/etc/skel/R/lib/", install_url("https://github.com/yueguoguo/Azure-R-Interface/raw/master/utils/msLanguageR_0.1.0.tar.gz"))'
|
||||
|
||||
# Copy /etc/skel to home directory of all users.
|
||||
|
||||
USR=$(ls /home | grep user)
|
||||
|
||||
for u in ${USR}; do
|
||||
DBASE="/home/$u/"
|
||||
|
||||
cp -rf /etc/skel/R ${DBASE}/
|
||||
done
|
||||
|
||||
# Start the Rstudio Server
|
||||
|
||||
rstudio-server start
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 1.1 MiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 98 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 62 KiB |
Загрузка…
Ссылка в новой задаче