This commit is contained in:
hong-revo 2017-03-17 13:25:17 +11:00
Родитель d6824c0787 0291861e21
Коммит 7b9aecb04c
35 изменённых файлов: 3526 добавлений и 544 удалений

1
.gitignore поставляемый
Просмотреть файл

@ -31,3 +31,4 @@ vignettes/*.pdf
# Temporary files created by R markdown
*.utf8.md
*.knit.md
.Rproj.user

Просмотреть файл

@ -11,7 +11,7 @@ documentclass: ctexart
## 1 Introduction
Voluntary employee attrition may negatively affect a company in various aspects, i.e., induce labor cost, lose morality of employees, leak IP/talents to competitors, etc. Identifying individual employee with inclination of leaving company is therefore pivotal to save the potential loss. Conventional practices rely on qualitative assessment on factors that may reflect the prospensity of an employee to leave company. For example, studies found that staff churn is correlated with both demographic information as well as behavioral activities, satisfaction, etc. Data-driven techniques which are based on statistical learning methods exhibit more accurate prediction on employee attrition, as by nature they mathematically model the correlation between factors and attrition outcome and maximize the probability of predicting the correct group of people with a properly trained machine learning model.
Voluntary employee attrition may negatively affect a company in various aspects, i.e., induce labor cost, lose morality of employees, leak IP/talents to competitors, etc. Identifying individual employee with inclination of leaving company is therefore pivotal to save the potential loss. Conventional practices rely on qualitative assessment on factors that may reflect the propensity of an employee to leave company. For example, studies found that staff churn is correlated with both demographic information as well as behavioral activities, satisfaction, etc. Data-driven techniques which are based on statistical learning methods exhibit more accurate prediction on employee attrition, as by nature they mathematically model the correlation between factors and attrition outcome and maximize the probability of predicting the correct group of people with a properly trained machine learning model.
In the data-driven employee attrition prediction model, normally two types of data are taken into consideration.
@ -39,7 +39,7 @@ library(pROC)
# natural language processing
library(languageR)
library(msLanguageR)
library(tm)
library(jiebaR)
@ -64,11 +64,10 @@ DATA2 <- "../Data/DataSet2.csv"
### 2.1 Demographic and organizational data
The experiments will be conducted on a data set of employees. The data set is publically available and can be found at [here](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/).
The experiments will be conducted on a data set of employees. The data set is publicly available and can be found at [here](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/).
#### 2.1.1 Data exploration
The data is loaded from remote blob of cloud storage.
```{r}
df <- read_csv(DATA1)
```
@ -102,11 +101,13 @@ ggplot(df, aes(JobRole, fill=Attrition)) +
2. monthly income, job level, and service year may affect decision of leaving for employees in different departments. For example, junior staffs with lower pay will be more likely to leave compared to those who are paid higher.
```{r}
ggplot(filter(df, (YearsAtCompany >= 2) & (YearsAtCompany <= 5) & (JobLevel < 3)),
aes(x=factor(Department), y=MonthlyIncome, color=factor(Attrition))) +
aes(x=factor(JobRole), y=MonthlyIncome, color=factor(Attrition))) +
geom_boxplot() +
xlab("Department") +
ylab("Monthly income") +
scale_fill_discrete(guide=guide_legend(title="Attrition"))
scale_fill_discrete(guide=guide_legend(title="Attrition")) +
theme_bw() +
theme(text=element_text(size=13), legend.position="top")
```
3. Promotion is a commonly adopted HR strategy for employee retention. It can be observed in the following plot that for a certain department, e.g., Research & Development, employees with higher job level is more likely to leave if there are years since their last promotion.
```{r}
@ -134,20 +135,12 @@ df %<>% select(-one_of(pred_no_var))
```
Integer types of predictors which are nominal are converted to categorical type.
```{r}
# convert certain interger variable to factor variable.
# convert certain integer variable to factor variable.
int_2_ftr_vars <- c("Education", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "NumCompaniesWorked", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel")
df[, int_2_ftr_vars] <- lapply((df[, int_2_ftr_vars]), as.factor)
```
The rest of integer type variables are converted to numeric.
```{r}
# convert remaining integer variables to be numeric.
names(select_if(df, is.integer))
df %<>% mutate_if(is.integer, as.numeric)
```
The variables of character type are converted to categorical type.
```{r}
@ -174,7 +167,7 @@ is.factor(df$Attrition)
It is possible that not all variables are correlated with the label, feature selection is therefore performed to filter out the most relevant ones.
As the data set is a blend of both numerical and discrete variables, certain correlation analysis (e.g., Pearson correlation) is not applicable. One alternative is to train a model and then rank the variable importance so as to select the most salien ones.
As the data set is a blend of both numerical and discrete variables, certain correlation analysis (e.g., Pearson correlation) is not applicable. One alternative is to train a model and then rank the variable importance so as to select the most salient ones.
The following shows how to achieve variable importance ranking with a random forest model.
```{r, echo=TRUE, message=FALSE, warning=FALSE}
@ -243,7 +236,7 @@ Active employees (864) are more than terminated employees (166). There are sever
1. Resampling the data - either upsampling the minority class or downsampling the majority class.
2. Use cost sensitive learning method.
In this case the first method is used. SMOTE is a commonly adopted method fo synthetically upsampling minority class in an imbalanced data set. Package `DMwR` provides methods that apply SMOTE methods on training data set.
In this case the first method is used. SMOTE is a commonly adopted method for synthetically upsampling minority class in an imbalanced data set. Package `DMwR` provides methods that apply SMOTE methods on training data set.
```{r}
# note DMwR::SMOTE does not handle well with tbl_df. Need to convert to data frame.
@ -321,7 +314,13 @@ model_stack <- caretStack(
model_list,
metric="ROC",
method="glm",
trControl=tc
trControl=trainControl(
method="boot",
number=10,
savePredictions="final",
classProbs=TRUE,
summaryFunction=twoClassSummary
)
)
```
@ -440,7 +439,7 @@ df <-
head(df$Feedback, 10)
```
The text can be pre-processed with `tm` package. Normally to process text for quantitative analysis, the original non-structural data in text format needes to be transformed into vector.
The text can be pre-processed with `tm` package. Normally to process text for quantitative analysis, the original non-structural data in text format needs to be transformed into vector.
For the convenient of text-to-vector transformation, the original review comment data is wrapped into a corpus format.
```{r}
@ -450,7 +449,7 @@ corp_text <- Corpus(VectorSource(df$Feedback))
corp_text
```
`tm_map` function in `tm` package helps perform transmation on the corpus.
`tm_map` function in `tm` package helps perform translation on the corpus.
```{r}
# the transformation functions can be checked with
@ -601,7 +600,7 @@ df_dtf <-
#### 2.2.4 Sentiment analysis on review comments
Sentiment analysis on text data by machine learning techniques is discussed in details [Pang's paper](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf). Basically, the given text data that is lablled with different sentiment is firstly tokenized into segmented terms. Term frequencies, or combined with inverse document term frequencies, are then generated as feature vectors for the text.
Sentiment analysis on text data by machine learning techniques is discussed in details [Pang's paper](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf). Basically, the given text data that is labelled with different sentiment is firstly tokenized into segmented terms. Term frequencies, or combined with inverse document term frequencies, are then generated as feature vectors for the text.
Sometimes multi-gram and part-of-speech tag are also included in the feature vectors. [Pang's studies](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf) conclude that the performance of unigram features excel over other hybrid methods in terms of model accuracy.
@ -645,7 +644,16 @@ confusionMatrix(prediction,
positive="Yes")
```
Sentiment analysis on text data can also be done with Text Analytics API of Microsoft Cognitive Services. Package `languageR` wraps functions that call the API for generating sentiment scores.
Sentiment analysis on text data can also be done with Text Analytics API of Microsoft Cognitive Services. Package `msLanguageR` wraps functions that call the API for generating sentiment scores.
`msLanguageR` can be installed from GitHub repository.
```{r}
# Install devtools
if(!require("devtools")) install.packages("devtools")
devtools::install_github("yueguoguo/Azure-R-Interface/utils/msLanguageR")
library(msLanguageR)
```
```{r, eval=FALSE}
senti_score <- cognitiveSentiAnalysis(text=df[-train_index, ]$Feedback, apiKey="your_api_key")
@ -655,7 +663,7 @@ confusionMatrix(df_senti$Attrition,
reference=df[-train_index, ]$Attrition,
positive="Yes")
```
Note this method is not applicable to languages other than the supported ones. For instance, for analyzing Chinese, text data needs to be translated into English firstly. This can be done with Bing Translation API, which is available in `languageR` package as `cognitiveTranslation`.
Note this method is not applicable to languages other than the supported ones. For instance, for analyzing Chinese, text data needs to be translated into English firstly. This can be done with Bing Translation API, which is available in `msLanguageR` package as `cognitiveTranslation`.
```{r, eval=FALSE}
text_translated <- lapply(df_text$text, cognitiveTranslation,
lanFrom="zh-CHS",
@ -667,4 +675,4 @@ text_translated
### Conclusion
This document introduces a data-driven approach for employee attrition prediction with sentiment analysis. Techniques of data analysis, model building, and natural language processing are demonstrated on sample data. The walk through may help corporate HR department or relevant organization to plan in advance for saving any potential loss in recruiting and training.
This document introduces a data-driven approach for employee attrition prediction with sentiment analysis. Techniques of data analysis, model building, and natural language processing are demonstrated on sample data. The walk through may help corporate HR department or relevant organization to plan in advance for saving any potential loss in recruiting and training.

Просмотреть файл

@ -17,7 +17,7 @@
"\n",
"## 1 Introduction\n",
"\n",
"Voluntary employee attrition may negatively affect a company in various aspects, i.e., induce labor cost, lose morality of employees, leak IP/talents to competitors, etc. Identifying individual employee with inclination of leaving company is therefore pivotal to save the potential loss. Conventional practices rely on qualitative assessment on factors that may reflect the prospensity of an employee to leave company. For example, studies found that staff churn is correlated with both demographic information as well as behavioral activities, satisfaction, etc. Data-driven techniques which are based on statistical learning methods exhibit more accurate prediction on employee attrition, as by nature they mathematically model the correlation between factors and attrition outcome and maximize the probability of predicting the correct group of people with a properly trained machine learning model.\n",
"Voluntary employee attrition may negatively affect a company in various aspects, i.e., induce labor cost, lose morality of employees, leak IP/talents to competitors, etc. Identifying individual employee with inclination of leaving company is therefore pivotal to save the potential loss. Conventional practices rely on qualitative assessment on factors that may reflect the propensity of an employee to leave company. For example, studies found that staff churn is correlated with both demographic information as well as behavioral activities, satisfaction, etc. Data-driven techniques which are based on statistical learning methods exhibit more accurate prediction on employee attrition, as by nature they mathematically model the correlation between factors and attrition outcome and maximize the probability of predicting the correct group of people with a properly trained machine learning model.\n",
"\n",
"In the data-driven employee attrition prediction model, normally two types of data are taken into consideration. \n",
"\n",
@ -59,7 +59,7 @@
"\n",
"# natural language processing\n",
"\n",
"library(languageR)\n",
"library(msLanguageR)\n",
"library(tm)\n",
"library(jiebaR)\n",
"\n",
@ -94,11 +94,9 @@
"source": [
"### 2.1 Demographic and organizational data\n",
"\n",
"The experiments will be conducted on a data set of employees. The data set is publically available and can be found at [here](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/).\n",
"The experiments will be conducted on a data set of employees. The data set is publicly available and can be found at [here](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/).\n",
"\n",
"#### 2.1.1 Data exploration\n",
"\n",
"The data is loaded from remote blob of cloud storage."
"#### 2.1.1 Data exploration"
]
},
{
@ -186,11 +184,13 @@
"outputs": [],
"source": [
"ggplot(filter(df, (YearsAtCompany >= 2) & (YearsAtCompany <= 5) & (JobLevel < 3)),\n",
" aes(x=factor(Department), y=MonthlyIncome, color=factor(Attrition))) +\n",
" aes(x=factor(JobRole), y=MonthlyIncome, color=factor(Attrition))) +\n",
" geom_boxplot() +\n",
" xlab(\"Department\") +\n",
" ylab(\"Monthly income\") +\n",
" scale_fill_discrete(guide=guide_legend(title=\"Attrition\"))"
" scale_fill_discrete(guide=guide_legend(title=\"Attrition\")) +\n",
" theme_bw() +\n",
" theme(text=element_text(size=13), legend.position=\"top\")"
]
},
{
@ -259,33 +259,13 @@
"metadata": {},
"outputs": [],
"source": [
"# convert certain interger variable to factor variable.\n",
"# convert certain integer variable to factor variable.\n",
"\n",
"int_2_ftr_vars <- c(\"Education\", \"EnvironmentSatisfaction\", \"JobInvolvement\", \"JobLevel\", \"JobSatisfaction\", \"NumCompaniesWorked\", \"PerformanceRating\", \"RelationshipSatisfaction\", \"StockOptionLevel\")\n",
"\n",
"df[, int_2_ftr_vars] <- lapply((df[, int_2_ftr_vars]), as.factor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rest of integer type variables are converted to numeric."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# convert remaining integer variables to be numeric.\n",
"\n",
"names(select_if(df, is.integer))\n",
"\n",
"df %<>% mutate_if(is.integer, as.numeric)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -348,7 +328,7 @@
"\n",
"It is possible that not all variables are correlated with the label, feature selection is therefore performed to filter out the most relevant ones. \n",
"\n",
"As the data set is a blend of both numerical and discrete variables, certain correlation analysis (e.g., Pearson correlation) is not applicable. One alternative is to train a model and then rank the variable importance so as to select the most salien ones.\n",
"As the data set is a blend of both numerical and discrete variables, certain correlation analysis (e.g., Pearson correlation) is not applicable. One alternative is to train a model and then rank the variable importance so as to select the most salient ones.\n",
"\n",
"The following shows how to achieve variable importance ranking with a random forest model."
]
@ -476,7 +456,7 @@
"1. Resampling the data - either upsampling the minority class or downsampling the majority class.\n",
"2. Use cost sensitive learning method.\n",
"\n",
"In this case the first method is used. SMOTE is a commonly adopted method fo synthetically upsampling minority class in an imbalanced data set. Package `DMwR` provides methods that apply SMOTE methods on training data set."
"In this case the first method is used. SMOTE is a commonly adopted method for synthetically upsampling minority class in an imbalanced data set. Package `DMwR` provides methods that apply SMOTE methods on training data set."
]
},
{
@ -612,7 +592,13 @@
" model_list,\n",
" metric=\"ROC\",\n",
" method=\"glm\",\n",
" trControl=tc\n",
" trControl=trainControl(\n",
" method=\"boot\",\n",
" number=10,\n",
" savePredictions=\"final\",\n",
" classProbs=TRUE,\n",
" summaryFunction=twoClassSummary\n",
" )\n",
")"
]
},
@ -799,7 +785,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The text can be pre-processed with `tm` package. Normally to process text for quantitative analysis, the original non-structural data in text format needes to be transformed into vector. \n",
"The text can be pre-processed with `tm` package. Normally to process text for quantitative analysis, the original non-structural data in text format needs to be transformed into vector. \n",
"\n",
"For the convenient of text-to-vector transformation, the original review comment data is wrapped into a corpus format."
]
@ -821,7 +807,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`tm_map` function in `tm` package helps perform transmation on the corpus."
"`tm_map` function in `tm` package helps perform translation on the corpus."
]
},
{
@ -1134,7 +1120,7 @@
"source": [
"#### 2.2.4 Sentiment analysis on review comments\n",
"\n",
"Sentiment analysis on text data by machine learning techniques is discussed in details [Pang's paper](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf). Basically, the given text data that is lablled with different sentiment is firstly tokenized into segmented terms. Term frequencies, or combined with inverse document term frequencies, are then generated as feature vectors for the text. \n",
"Sentiment analysis on text data by machine learning techniques is discussed in details [Pang's paper](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf). Basically, the given text data that is labelled with different sentiment is firstly tokenized into segmented terms. Term frequencies, or combined with inverse document term frequencies, are then generated as feature vectors for the text. \n",
"\n",
"Sometimes multi-gram and part-of-speech tag are also included in the feature vectors. [Pang's studies](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf) conclude that the performance of unigram features excel over other hybrid methods in terms of model accuracy.\n",
"\n",
@ -1220,7 +1206,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Sentiment analysis on text data can also be done with Text Analytics API of Microsoft Cognitive Services. Package `languageR` wraps functions that call the API for generating sentiment scores."
"Sentiment analysis on text data can also be done with Text Analytics API of Microsoft Cognitive Services. Package `msLanguageR` wraps functions that call the API for generating sentiment scores.\n",
"\n",
"`msLanguageR` can be installed from GitHub repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install devtools\n",
"if(!require(\"devtools\")) install.packages(\"devtools\")\n",
"devtools::install_github(\"yueguoguo/Azure-R-Interface/utils/msLanguageR\")\n",
"library(msLanguageR)"
]
},
{
@ -1248,7 +1248,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note this method is not applicable to languages other than the supported ones. For instance, for analyzing Chinese, text data needs to be translated into English firstly. This can be done with Bing Translation API, which is available in `languageR` package as `cognitiveTranslation`."
"Note this method is not applicable to languages other than the supported ones. For instance, for analyzing Chinese, text data needs to be translated into English firstly. This can be done with Bing Translation API, which is available in `msLanguageR` package as `cognitiveTranslation`."
]
},
{

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,13 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX

Просмотреть файл

@ -0,0 +1,12 @@
# Employee attrition with sentiment analysis
This presentation at FOSSAsia Meetup (Mar 9, 2017) introduces an R accelerator for employee attrition prediction with sentiment analysis. Basically the presentation covers
* a step-by-step tutorial of how to resolve employee attrition prediction problem with data science and machine learning technology.
* R techniques for dealing with imbalanced data set, creating model ensembles, and performing sentiment analysis on text.
## Prerequisites
* Fundamental knowledge of data science and machine learning.
* R >= 3.3.1.
* RStudio IDE >= v0.98

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 76 KiB

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.1 MiB

После

Ширина:  |  Высота:  |  Размер: 944 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 68 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 86 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 32 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 69 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 7.9 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 16 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 24 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 2.8 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 9.3 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 8.8 KiB

Просмотреть файл

@ -0,0 +1,733 @@
Employee Attrition Prediction with R Accelerator
========================================================
author: Le Zhang, Data Scientist at Microsoft
date: `r Sys.Date()`
width: 1800
height: 1000
Agenda
========================================================
```{r, echo=FALSE}
# data wrangling
library(dplyr)
library(magrittr)
library(stringr)
library(stringi)
library(readr)
# machine learning and advanced analytics
library(DMwR)
library(caret)
library(caretEnsemble)
library(pROC)
library(e1071)
library(rattle)
# natural language processing
library(tm)
# tools
library(httr)
library(XML)
library(jsonlite)
# data visualization
library(scales)
library(ggplot2)
library(ggmap)
# data
data(iris)
DATA1 <- "https://raw.githubusercontent.com/Microsoft/acceleratoRs/master/EmployeeAttritionPrediction/Data/DataSet1.csv"
DATA2 <- "https://raw.githubusercontent.com/Microsoft/acceleratoRs/master/EmployeeAttritionPrediction/Data/DataSet2.csv"
df1 <- read_csv(DATA1)
df2 <- read_csv(DATA2)
# model
load("./models.RData")
```
- Introduction.
- How to predict employee attrition?
- Walk-through of an R "accelerator".
Introduction
========================================================
- Microsoft Algorithms and Data Science (ADS).
```{r, echo=FALSE, fig.align="center", fig.width=15, fig.height=8}
location <- c("Seatle", "Sunnyvale", "Boston", "London", "Singapore", "Melbourne")
ll_location <- geocode(location)
location_x <- ll_location$lon
location_y <- ll_location$lat
df <- data.frame(
location=location,
location_x=location_x,
location_y=location_y,
adsap=as.factor(c(rep(FALSE, 4), rep(TRUE, 2)))
)
mp <- NULL
mapWorld <- borders("world", colour="gray50", fill="gray50")
mp <-
ggplot(data=df, aes(x=location_x, y=location_y, label=location)) +
mapWorld +
geom_label(color=df$adsap, size=10) +
theme_bw() +
scale_x_continuous(breaks=NULL, labels=NULL) +
scale_y_continuous(breaks=NULL, labels=NULL) +
xlab("") +
ylab("")
mp
```
- ADS Asia Pacific.
- Data science solution templates to resolve real-world problems.
- Scalable tools & algorithms for high performance advanced analytics.
Data science and machine learning
========================================================
- Data science & Machine learning
- A review on iris data.
```{r, echo=FALSE, fig.height=8, fig.width=10}
ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_jitter(size=5) +
theme_bw() +
theme(text=element_text(size=30))
rattle::fancyRpartPlot(model_iris, main="Decision tree model built on iris data.")
```
- Is that all?
Real-world data science project
========================================================
- Team Data Science Process (TDSP) https://github.com/Azure/Microsoft-TDSP.
- Collaborative.
- Iterative.
![](./demo-figure/tdsp.png)
Use case - employee attrition prediction
========================================================
![](./demo-figure/employee_sentiment.png)
- Consequences of employee attrition.
- Loss of human resources and cost on new hires.
- Potential loss of company intellectual properties.
- Problem formalization: **to identify employees with inclination of leaving.**
Data collection, exploration, and preparation
========================================================
![](./demo-figure/reasons_for_leave.png)
- Static data: not change or change in a deterministic manner.
- Dynamic data: change randomly over time.
Data collection, exploration, and preparation (Cont'd)
========================================================
- Data source
- HR department.
- IT department.
- Direct reports.
- Social media network.
Data collection, exploration, and preparation (Cont'd)
========================================================
- Data exploration.
- Statistical analysis.
- Visualization!
Data collection, exploration, and preparation (Cont'd)
========================================================
```{r, echo=FALSE, fig.align="center", fig.height=5, fig.width=28}
# how job role imacts attrition.
ggplot(df1, aes(JobLevel, fill=Attrition)) +
geom_bar(aes(y=(..count..)/sum(..count..)), position="dodge") +
scale_y_continuous(labels=percent) +
xlab("Job level") +
ylab("Percentage") +
ggtitle("Percentage of attrition for each job level.") +
theme_bw() +
theme(text=element_text(size=30))
# how distribution of income is.
ggplot(df1, aes(x=factor(JobLevel), y=MonthlyIncome, color=factor(Attrition))) +
geom_boxplot() +
theme_bw() +
xlab("Job level") +
ylab("Monthly income") +
scale_fill_discrete(guide=guide_legend(title="Attrition")) +
ggtitle("Distribution of monthly income for employees with different employment status.") +
theme(text=element_text(size=30))
# collective effect of the two factors.
ggplot(df1, aes(x=MonthlyIncome, y=JobLevel, color=Attrition)) +
geom_point(size=5) +
theme_bw() +
xlab("Monthly income") +
ylab("Job level") +
ggtitle("Separation of groups by role and income.") +
theme(text=element_text(size=30))
```
Data collection, exploration, and preparation (Cont'd)
========================================================
<center>![](./demo-figure/framework.png)</center>
**To predict employee attrition in the next M + 1 months, by analyzing employee data of last N months.**
Feature Extraction
========================================================
- Feature engineering
- Statistics
- max, min, standard deviation, etc.
- Time series characterization.
- Trend analysis.
- Anomaly detection.
- Time series model (ARIMA, etc.)
- Text mining
- Feature selection.
- Correlation analysis.
- Model based feature selection.
Model creation and validation
========================================================
- Supervised classification problem.
- Algorithm selection
- Logistic regression, Support vector machine, Decision tree, etc.
- Ensemble
- Bagging.
- Boosting.
- Stacking.
- Model creation
Model creation and validation
========================================================
- Cross validation.
- Confusion matrix.
- Precision.
- Recall.
- F Score.
- Area-Under-Curve.
Employee attrition prediction - R accelerator
========================================================
- What is R "accelerator"
- Lightweight end-to-end solution template.
- Follows Microsoft Team Data Science Process (TDSP) format, in a simplified version.
- Easy for prototyping, presenting, and documenting.
- Github repo https://github.com/Microsoft/acceleratoRs
Step 0 Setup
========================================================
R session for the employee attrition prediction accelerator.
```{r, echo=FALSE}
print(sessionInfo(), locale=FALSE)
```
Step 1 Data exploration
========================================================
- Employee attrition data: https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
```{r}
dim(df1)
```
- Review comments data: http://www.glassdoor.com
```{r}
head(df2$Feedback, 3)
```
Step 2 Data preprocessing
========================================================
- Handle NAs.
- Remove non-variants.
- Normalization.
- Data type conversion.
```{r}
# get predictors that has no variation.
pred_no_var <- names(df1[, nearZeroVar(df1)]) %T>% print()
```
```{r}
# remove the zero variation predictor columns.
df1 %<>% select(-one_of(pred_no_var))
```
Step 2 Data preprocessing (Cont'd)
========================================================
```{r}
# convert certain interger variable to factor variable.
int_2_ftr_vars <- c("Education", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "NumCompaniesWorked", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel")
df1[, int_2_ftr_vars] <- lapply((df1[, int_2_ftr_vars]), as.factor)
```
```{r}
df1 %<>% mutate_if(is.character, as.factor)
```
Step 2 Feature extraction
========================================================
- Feature selection with a trained model.
```{r, eval=FALSE}
control <- trainControl(method="repeatedcv", number=3, repeats=1)
# train the model
model <- train(dplyr::select(df1, -Attrition),
df1$Attrition,
data=df1,
method="rf",
preProcess="scale",
trControl=control)
```
Step 2 Feature extraction (Cont'd)
========================================================
```{r, fig.align="center", fig.height=8, fig.width=15}
# estimate variable importance
imp <- varImp(model, scale=FALSE)
plot(imp, cex=3)
```
Step 2 Feature extraction (Cont'd)
========================================================
```{r}
# select the top-ranking variables.
imp_list <- rownames(imp$importance)[order(imp$importance$Overall, decreasing=TRUE)]
# drop the low ranking variables. Here the last 3 variables are dropped.
top_var <-
imp_list[1:(ncol(df1) - 3)] %>%
as.character() %T>%
print()
```
Step 3 Resampling
========================================================
- Split data set into trainning and testing sets.
```{r}
train_index <-
createDataPartition(df1$Attrition,
times=1,
p=.7) %>%
unlist()
df1_train <- df1[train_index, ]
df1_test <- df1[-train_index, ]
```
```{r}
table(df1_train$Attrition)
```
- Training set is not balanced!
Step 3 Resampling (Cont'd)
========================================================
To handle imbalanced data sets
- Cost-sensitive learning.
- Resampling of data.
- Synthetic Minority Over-sampling TechniquE (SMOTE)
- Upsampling minority class synthetically.
- Downsampling majority class.
***
```{r}
df1_train %<>% as.data.frame()
df1_train <- SMOTE(Attrition ~ .,
df1_train,
perc.over=300,
perc.under=150)
```
```{r}
table(df1_train$Attrition)
```
Step 4 Model building
========================================================
- Select algorithm for model creation.
- Tune model parameters.
```{r, eval=FALSE}
# initialize training control.
tc <- trainControl(method="repeatedcv",
number=3,
repeats=3,
search="grid",
classProbs=TRUE,
savePredictions="final",
summaryFunction=twoClassSummary)
```
Step 4 Model building (Cont'd)
========================================================
Let's try several machine learning algorithms.
- Support vector machine.
```{r, eval=FALSE}
# SVM model.
time_svm <- system.time(
model_svm <- train(Attrition ~ .,
df1_train,
method="svmRadial",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
- Random forest.
```{r, eval=FALSE}
# random forest model
time_rf <- system.time(
model_rf <- train(Attrition ~ .,
df1_train,
method="rf",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
- Extreme gradient boosting (XGBoost).
```{r, eval=FALSE}
# xgboost model.
time_xgb <- system.time(
model_xgb <- train(Attrition ~ .,
df1_train,
method="xgbLinear",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
An ensemble may be better?
- Ensemble of models.
- Ensemble methods - bagging, boosting, and stacking.
![](./demo-figure/stacking.png)
***
```{r, eval=FALSE}
# ensemble of the three models.
time_ensemble <- system.time(
model_list <- caretList(Attrition ~ .,
data=df1_train,
trControl=tc,
methodList=c("svmRadial", "rf", "xgbLinear"))
)
```
```{r, eval=FALSE}
# stack of models. Use glm for meta model.
model_stack <- caretStack(
model_list,
metric="ROC",
method="glm",
trControl=trainControl(
method="boot",
number=10,
savePredictions="final",
classProbs=TRUE,
summaryFunction=twoClassSummary
)
)
```
Step 5 Model evaluating
========================================================
- Confusion matrix.
- Performance measure.
```{r}
predictions <-lapply(models,
predict,
newdata=select(df1_test, -Attrition))
```
```{r}
# confusion matrix evaluation results.
cm_metrics <- lapply(predictions,
confusionMatrix,
reference=df1_test$Attrition,
positive="Yes")
```
Step 5 Model evaluating (Cont'd)
========================================================
- Comparison of different models in terms of accuracy, recall, precision, and elapsed time.
```{r, echo=FALSE}
# accuracy
acc_metrics <-
lapply(cm_metrics, `[[`, "overall") %>%
lapply(`[`, 1) %>%
unlist()
# recall
rec_metrics <-
lapply(cm_metrics, `[[`, "byClass") %>%
lapply(`[`, 1) %>%
unlist()
# precision
pre_metrics <-
lapply(cm_metrics, `[[`, "byClass") %>%
lapply(`[`, 3) %>%
unlist()
```
```{r, echo=FALSE}
algo_list <- c("SVM RBF", "Random Forest", "Xgboost", "Stacking")
time_consumption <- c(time_svm[3], time_rf[3], time_xgb[3], time_ensemble[3])
specify_decimal <- function(x, k) format(round(x, k), nsmall=k)
df_comp <-
data.frame(Models=algo_list,
Accuracy=acc_metrics,
Recall=rec_metrics,
Precision=pre_metrics,
Elapsed=time_consumption) %>%
mutate(Accuracy=specify_decimal(Accuracy, 2),
Recall=specify_decimal(Recall, 2),
Precision=specify_decimal(Precision, 2))
df_comp
```
- Analysis
- Diversity of model affects ensemble performance.
- Data size also impacts.
Step 6 Sentiment analysis - a glimpse of data
========================================================
```{r}
# getting the data.
head(df2$Feedback, 5)
```
Step 7 Sentiment analysis - feature extraction
========================================================
- General methods
- Initial transformation
- Removal of unnecessary elements (stopwords, numbers, punctuations, etc.).
- Stopwords: yes, no, you, I, etc.
- Translation or sentence/word alignment.
- Part-Of-Speech tagging.
- Bag-of-words model
- n-Grams.
- Term frequency (TF) or Term frequency inverse-document frequency (TF-IDF).
- Model creation
Step 7 Sentiment analysis - feature extraction (Cont'd)
========================================================
```{r}
# create a corpus based upon the text data.
corp_text <- Corpus(VectorSource(df2$Feedback))
```
```{r}
# transformation on the corpus.
corp_text %<>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
```
```{r}
dtm_txt_tf <-
DocumentTermMatrix(corp_text, control=list(wordLengths=c(1, Inf), weighting=weightTf))
```
```{r}
# remove sparse terms.
dtm_txt <-
removeSparseTerms(dtm_txt_tf, 0.99)
```
Step 7 Sentiment analysis - feature extraction (Cont'd)
========================================================
- Convert corpus to term frequency matrix.
```{r}
dtm_txt_sample <- removeSparseTerms(dtm_txt_tf, 0.85)
inspect(dtm_txt_sample[1:10, ])
```
```{r, echo=FALSE, include=FALSE}
df_txt <-
inspect(dtm_txt) %>%
as.data.frame()
```
Step 8 Sentiment analysis - model creation and validation
========================================================
- Create and validate a classification model.
```{r}
# form the data set
df_txt %<>% cbind(Attrition=df2$Attrition)
```
```{r}
# split data set into training and testing set.
train_index <-
createDataPartition(df_txt$Attrition,
times=1,
p=.7) %>%
unlist()
df_txt_train <- df_txt[train_index, ]
df_txt_test <- df_txt[-train_index, ]
```
Step 8 Sentiment analysis - model creation and validation (Cont'd)
========================================================
- SVM is used.
```{r, eval=FALSE}
# model building
model_sent <- train(Attrition ~ .,
df_txt_train,
method="svmRadial",
trainControl=tc)
```
```{r}
prediction <- predict(model_sent, newdata=select(df_txt_test, -Attrition))
```
Step 8 Sentiment analysis - model creation and validation (Cont'd)
========================================================
```{r, echo=FALSE}
confusionMatrix(prediction,
reference=df_txt_test$Attrition,
positive="Yes")
```
Takeaways
========================================================
- DS & ML combined with domain knowledge.
- Feature engineering takes majority of time.
- Model creation and validation.
- Sentiment analysis on text data.
- All resources available on Github! https://github.com/Microsoft/acceleratoRs/tree/master/EmployeeAttritionPrediction/Docs/FOSSAsiaMeetup
References
========================================================
1. Terence R. Michelle et al., "Why people stay: using job embeddedness to predict voluntary turnover".
2. Bo Pang and Lillian Lee, "Opinion mining and sentiment analysis".
3. Nitesh V. Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique".
Contact
========================================================
Le Zhang
zhle@microsoft.com

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -0,0 +1,749 @@
Employee Attrition Prediction with R Accelerator
========================================================
author: Le Zhang, Data Scientist at Microsoft
date: 2017-03-09
width: 1800
height: 1000
Agenda
========================================================
- Introduction.
- How to predict employee attrition?
- Walk-through of an R "accelerator".
Introduction
========================================================
- Microsoft Algorithms and Data Science (ADS).
<img src="demo-figure/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" />
- ADS Asia Pacific.
- Data science solution templates to resolve real-world problems.
- Scalable tools & algorithms for high performance advanced analytics.
Data science and machine learning
========================================================
- Data science & Machine learning
- A review on iris data.
![plot of chunk unnamed-chunk-3](demo-figure/unnamed-chunk-3-1.png)![plot of chunk unnamed-chunk-3](demo-figure/unnamed-chunk-3-2.png)
- Is that all?
Real-world data science project
========================================================
- Team Data Science Process (TDSP) https://github.com/Azure/Microsoft-TDSP.
- Collaborative.
- Iterative.
![](./demo-figure/tdsp.png)
Use case - employee attrition prediction
========================================================
![](./demo-figure/employee_sentiment.png)
- Consequences of employee attrition.
- Loss of human resources and cost on new hires.
- Potential loss of company intellectual properties.
- Problem formalization: **to identify employees with inclination of leaving.**
Data collection, exploration, and preparation
========================================================
![](./demo-figure/reasons_for_leave.png)
- Static data: not change or change in a deterministic manner.
- Dynamic data: change randomly over time.
Data collection, exploration, and preparation (Cont'd)
========================================================
- Data source
- HR department.
- IT department.
- Direct reports.
- Social media network.
Data collection, exploration, and preparation (Cont'd)
========================================================
- Data exploration.
- Statistical analysis.
- Visualization!
Data collection, exploration, and preparation (Cont'd)
========================================================
<img src="demo-figure/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /><img src="demo-figure/unnamed-chunk-4-2.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /><img src="demo-figure/unnamed-chunk-4-3.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" />
Data collection, exploration, and preparation (Cont'd)
========================================================
<center>![](./demo-figure/framework.png)</center>
**To predict employee attrition in the next M + 1 months, by analyzing employee data of last N months.**
Feature Extraction
========================================================
- Feature engineering
- Statistics
- max, min, standard deviation, etc.
- Time series characterization.
- Trend analysis.
- Anomaly detection.
- Time series model (ARIMA, etc.)
- Text mining
- Feature selection.
- Correlation analysis.
- Model based feature selection.
Model creation and validation
========================================================
- Supervised classification problem.
- Algorithm selection
- Logistic regression, Support vector machine, Decision tree, etc.
- Ensemble
- Bagging.
- Boosting.
- Stacking.
- Model creation
Model creation and validation
========================================================
- Cross validation.
- Confusion matrix.
- Precision.
- Recall.
- F Score.
- Area-Under-Curve.
Employee attrition prediction - R accelerator
========================================================
- What is R "accelerator"
- Lightweight end-to-end solution template.
- Follows Microsoft Team Data Science Process (TDSP) format, in a simplified version.
- Easy for prototyping, presenting, and documenting.
- Github repo https://github.com/Microsoft/acceleratoRs
Step 0 Setup
========================================================
R session for the employee attrition prediction accelerator.
```
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] maps_3.1.0 ggmap_2.6.1 scales_0.4.0
[4] jsonlite_1.0 XML_3.98-1.4 httr_1.2.1
[7] tm_0.6-2 NLP_0.1-9 rattle_4.1.0
[10] e1071_1.6-7 pROC_1.8 caretEnsemble_2.0.0
[13] caret_6.0-70 ggplot2_2.1.0 DMwR_0.4.1
[16] readr_0.2.2 stringi_1.1.2 stringr_1.1.0
[19] magrittr_1.5 dplyr_0.5.0 knitr_1.15
[22] RevoUtilsMath_10.0.0 RevoUtils_10.0.2 RevoMods_10.0.0
[25] MicrosoftML_1.0.0 mrsdeploy_1.0 RevoScaleR_9.0.1
[28] lattice_0.20-34 rpart_4.1-10
loaded via a namespace (and not attached):
[1] splines_3.3.2 foreach_1.4.3 gtools_3.5.0
[4] assertthat_0.1 TTR_0.23-1 highr_0.6
[7] sp_1.2-3 stats4_3.3.2 slam_0.1-35
[10] quantreg_5.26 chron_2.3-47 digest_0.6.10
[13] RColorBrewer_1.1-2 minqa_1.2.4 colorspace_1.2-6
[16] Matrix_1.2-7.1 plyr_1.8.4 SparseM_1.7
[19] gdata_2.17.0 jpeg_0.1-8 lme4_1.1-12
[22] MatrixModels_0.4-1 tibble_1.2 mgcv_1.8-15
[25] car_2.1-2 ROCR_1.0-7 pbapply_1.2-1
[28] nnet_7.3-12 proto_0.3-10 pbkrtest_0.4-6
[31] quantmod_0.4-5 RJSONIO_1.3-0 evaluate_0.10
[34] nlme_3.1-128 MASS_7.3-45 gplots_3.0.1
[37] xts_0.9-7 class_7.3-14 tools_3.3.2
[40] CompatibilityAPI_1.1.0 data.table_1.9.6 geosphere_1.5-5
[43] RgoogleMaps_1.2.0.7 rpart.plot_2.1.0 kernlab_0.9-24
[46] munsell_0.4.3 caTools_1.17.1 nloptr_1.0.4
[49] iterators_1.0.8 RGtk2_2.20.31 rjson_0.2.15
[52] labeling_0.3 bitops_1.0-6 gtable_0.2.0
[55] codetools_0.2-15 abind_1.4-3 DBI_0.5
[58] curl_1.2 reshape2_1.4.1 R6_2.1.3
[61] gridExtra_2.2.1 zoo_1.7-13 KernSmooth_2.23-15
[64] parallel_3.3.2 Rcpp_0.12.6 mapproj_1.2-4
[67] png_0.1-7
```
Step 1 Data exploration
========================================================
- Employee attrition data: https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
```r
dim(df1)
```
```
[1] 1470 33
```
- Review comments data: http://www.glassdoor.com
```r
head(df2$Feedback, 3)
```
```
[1] "People are willing to share knowledge which is not the norm in this industry This is the first place I have worked where the people are great about that Also I get to work on cool projects that no one else in the world works on"
[2] "To repeat what I wrote before the people I work for and work with here are very smart and confident people"
[3] "The variety of different projects and the speed of completion I am truly very satisfied pleased there is an endless list of pros"
```
Step 2 Data preprocessing
========================================================
- Handle NAs.
- Remove non-variants.
- Normalization.
- Data type conversion.
```r
# get predictors that has no variation.
pred_no_var <- names(df1[, nearZeroVar(df1)]) %T>% print()
```
```
[1] "EmployeeCount" "StandardHours"
```
```r
# remove the zero variation predictor columns.
df1 %<>% select(-one_of(pred_no_var))
```
Step 2 Data preprocessing (Cont'd)
========================================================
```r
# convert certain interger variable to factor variable.
int_2_ftr_vars <- c("Education", "EnvironmentSatisfaction", "JobInvolvement", "JobLevel", "JobSatisfaction", "NumCompaniesWorked", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel")
df1[, int_2_ftr_vars] <- lapply((df1[, int_2_ftr_vars]), as.factor)
```
```r
df1 %<>% mutate_if(is.character, as.factor)
```
Step 2 Feature extraction
========================================================
- Feature selection with a trained model.
```r
control <- trainControl(method="repeatedcv", number=3, repeats=1)
# train the model
model <- train(dplyr::select(df1, -Attrition),
df1$Attrition,
data=df1,
method="rf",
preProcess="scale",
trControl=control)
```
Step 2 Feature extraction (Cont'd)
========================================================
```r
# estimate variable importance
imp <- varImp(model, scale=FALSE)
plot(imp, cex=3)
```
<img src="demo-figure/unnamed-chunk-13-1.png" title="plot of chunk unnamed-chunk-13" alt="plot of chunk unnamed-chunk-13" style="display: block; margin: auto;" />
Step 2 Feature extraction (Cont'd)
========================================================
```r
# select the top-ranking variables.
imp_list <- rownames(imp$importance)[order(imp$importance$Overall, decreasing=TRUE)]
# drop the low ranking variables. Here the last 3 variables are dropped.
top_var <-
imp_list[1:(ncol(df1) - 3)] %>%
as.character() %T>%
print()
```
```
[1] "MonthlyIncome" "OverTime"
[3] "NumCompaniesWorked" "Age"
[5] "DailyRate" "TotalWorkingYears"
[7] "JobRole" "DistanceFromHome"
[9] "MonthlyRate" "HourlyRate"
[11] "YearsAtCompany" "PercentSalaryHike"
[13] "EnvironmentSatisfaction" "StockOptionLevel"
[15] "EducationField" "TrainingTimesLastYear"
[17] "YearsWithCurrManager" "JobSatisfaction"
[19] "WorkLifeBalance" "YearsSinceLastPromotion"
[21] "JobLevel" "JobInvolvement"
[23] "RelationshipSatisfaction" "Education"
[25] "YearsInCurrentRole" "MaritalStatus"
[27] "BusinessTravel" "Department"
```
Step 3 Resampling
========================================================
- Split data set into trainning and testing sets.
```r
train_index <-
createDataPartition(df1$Attrition,
times=1,
p=.7) %>%
unlist()
df1_train <- df1[train_index, ]
df1_test <- df1[-train_index, ]
```
```r
table(df1_train$Attrition)
```
```
No Yes
864 166
```
- Training set is not balanced!
Step 3 Resampling (Cont'd)
========================================================
To handle imbalanced data sets
- Cost-sensitive learning.
- Resampling of data.
- Synthetic Minority Over-sampling TechniquE (SMOTE)
- Upsampling minority class synthetically.
- Downsampling majority class.
***
```r
df1_train %<>% as.data.frame()
df1_train <- SMOTE(Attrition ~ .,
df1_train,
perc.over=300,
perc.under=150)
```
```r
table(df1_train$Attrition)
```
```
No Yes
747 664
```
Step 4 Model building
========================================================
- Select algorithm for model creation.
- Tune model parameters.
```r
# initialize training control.
tc <- trainControl(method="repeatedcv",
number=3,
repeats=3,
search="grid",
classProbs=TRUE,
savePredictions="final",
summaryFunction=twoClassSummary)
```
Step 4 Model building (Cont'd)
========================================================
Let's try several machine learning algorithms.
- Support vector machine.
```r
# SVM model.
time_svm <- system.time(
model_svm <- train(Attrition ~ .,
df1_train,
method="svmRadial",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
- Random forest.
```r
# random forest model
time_rf <- system.time(
model_rf <- train(Attrition ~ .,
df1_train,
method="rf",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
- Extreme gradient boosting (XGBoost).
```r
# xgboost model.
time_xgb <- system.time(
model_xgb <- train(Attrition ~ .,
df1_train,
method="xgbLinear",
trainControl=tc)
)
```
Step 4 Model building (Cont'd)
========================================================
An ensemble may be better?
- Ensemble of models.
- Ensemble methods - bagging, boosting, and stacking.
![](./demo-figure/stacking.png)
***
```r
# ensemble of the three models.
time_ensemble <- system.time(
model_list <- caretList(Attrition ~ .,
data=df1_train,
trControl=tc,
methodList=c("svmRadial", "rf", "xgbLinear"))
)
```
```r
# stack of models. Use glm for meta model.
model_stack <- caretStack(
model_list,
metric="ROC",
method="glm",
trControl=trainControl(
method="boot",
number=10,
savePredictions="final",
classProbs=TRUE,
summaryFunction=twoClassSummary
)
)
```
Step 5 Model evaluating
========================================================
- Confusion matrix.
- Performance measure.
```r
predictions <-lapply(models,
predict,
newdata=select(df1_test, -Attrition))
```
```r
# confusion matrix evaluation results.
cm_metrics <- lapply(predictions,
confusionMatrix,
reference=df1_test$Attrition,
positive="Yes")
```
Step 5 Model evaluating (Cont'd)
========================================================
- Comparison of different models in terms of accuracy, recall, precision, and elapsed time.
```
Models Accuracy Recall Precision Elapsed
1 SVM RBF 0.86 0.76 0.55 27.69
2 Random Forest 0.92 0.80 0.71 220.19
3 Xgboost 0.92 0.82 0.73 290.05
4 Stacking 0.92 0.85 0.71 84.36
```
- Analysis
- Diversity of model affects ensemble performance.
- Data size also impacts.
Step 6 Sentiment analysis - a glimpse of data
========================================================
```r
# getting the data.
head(df2$Feedback, 5)
```
```
[1] "People are willing to share knowledge which is not the norm in this industry This is the first place I have worked where the people are great about that Also I get to work on cool projects that no one else in the world works on"
[2] "To repeat what I wrote before the people I work for and work with here are very smart and confident people"
[3] "The variety of different projects and the speed of completion I am truly very satisfied pleased there is an endless list of pros"
[4] "As youve probably heard Google has great benefits insurance and flexible hours If you want to you can work from home"
[5] "Great insurance benefits free food equipment is available whenever you want it to be and there is always whatever you need monitor etc"
```
Step 7 Sentiment analysis - feature extraction
========================================================
- General methods
- Initial transformation
- Removal of unnecessary elements (stopwords, numbers, punctuations, etc.).
- Stopwords: yes, no, you, I, etc.
- Translation or sentence/word alignment.
- Part-Of-Speech tagging.
- Bag-of-words model
- n-Grams.
- Term frequency (TF) or Term frequency inverse-document frequency (TF-IDF).
- Model creation
Step 7 Sentiment analysis - feature extraction (Cont'd)
========================================================
```r
# create a corpus based upon the text data.
corp_text <- Corpus(VectorSource(df2$Feedback))
```
```r
# transformation on the corpus.
corp_text %<>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
```
```r
dtm_txt_tf <-
DocumentTermMatrix(corp_text, control=list(wordLengths=c(1, Inf), weighting=weightTf))
```
```r
# remove sparse terms.
dtm_txt <-
removeSparseTerms(dtm_txt_tf, 0.99)
```
Step 7 Sentiment analysis - feature extraction (Cont'd)
========================================================
- Convert corpus to term frequency matrix.
```r
dtm_txt_sample <- removeSparseTerms(dtm_txt_tf, 0.85)
inspect(dtm_txt_sample[1:10, ])
```
```
<<DocumentTermMatrix (documents: 10, terms: 6)>>
Non-/sparse entries: 19/41
Sparsity : 68%
Maximal term length: 7
Weighting : term frequency (tf)
Terms
Docs company google great people smart work
1 0 0 1 2 0 1
2 0 0 0 2 1 2
3 0 0 0 0 0 0
4 0 1 1 0 0 1
5 0 0 1 0 0 0
6 0 0 0 2 1 1
7 0 0 0 0 0 0
8 0 0 1 1 0 3
9 0 0 1 0 0 0
10 0 0 0 1 0 1
```
Step 8 Sentiment analysis - model creation and validation
========================================================
- Create and validate a classification model.
```r
# form the data set
df_txt %<>% cbind(Attrition=df2$Attrition)
```
```r
# split data set into training and testing set.
train_index <-
createDataPartition(df_txt$Attrition,
times=1,
p=.7) %>%
unlist()
df_txt_train <- df_txt[train_index, ]
df_txt_test <- df_txt[-train_index, ]
```
Step 8 Sentiment analysis - model creation and validation (Cont'd)
========================================================
- SVM is used.
```r
# model building
model_sent <- train(Attrition ~ .,
df_txt_train,
method="svmRadial",
trainControl=tc)
```
```r
prediction <- predict(model_sent, newdata=select(df_txt_test, -Attrition))
```
Step 8 Sentiment analysis - model creation and validation (Cont'd)
========================================================
```
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 87 13
Yes 3 47
Accuracy : 0.8933
95% CI : (0.8326, 0.9378)
No Information Rate : 0.6
P-Value [Acc > NIR] : 1.336e-15
Kappa : 0.7714
Mcnemar's Test P-Value : 0.02445
Sensitivity : 0.7833
Specificity : 0.9667
Pos Pred Value : 0.9400
Neg Pred Value : 0.8700
Prevalence : 0.4000
Detection Rate : 0.3133
Detection Prevalence : 0.3333
Balanced Accuracy : 0.8750
'Positive' Class : Yes
```
Takeaways
========================================================
- DS & ML combined with domain knowledge.
- Feature engineering takes majority of time.
- Model creation and validation.
- Sentiment analysis on text data.
- All resources available on Github! https://github.com/Microsoft/acceleratoRs/tree/master/EmployeeAttritionPrediction/Docs/FOSSAsiaMeetup
References
========================================================
1. Terence R. Michelle et al., "Why people stay: using job embeddedness to predict voluntary turnover".
2. Bo Pang and Lillian Lee, "Opinion mining and sentiment analysis".
3. Nitesh V. Chawla et al., "SMOTE: Synthetic Minority Over-sampling Technique".
Contact
========================================================
Le Zhang
zhle@microsoft.com

Двоичные данные
EmployeeAttritionPrediction/Docs/FOSSAsiaMeetup/models.RData Normal file

Двоичный файл не отображается.

Двоичные данные
EmployeeAttritionPrediction/Docs/RevoBlog/blog_02mar2017.docx Normal file

Двоичный файл не отображается.