acceleratoRs/EducationAnalytics/Code/modelStudentDropIndia.Rmd

657 строки
20 KiB
Plaintext

---
title: "Data Science Design Pattern for Student Drop Out"
author: "Microsoft"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, echo=FALSE}
knitr::opts_chunk$set(fig.width = 6,
fig.height = 4,
fig.align='center',
dev = "png")
```
# Introduction
Welcome to the Data Science Design Pattern for Student Drop Out. This pattern provides a starting point for the data scientist exploring a new dataset. By no means is it the end point of the data science journey. The pattern is under regular revision and improvement and is provided as is.
We now introduce a generic pattern for building multiple binary classification models using R.
# Pre-configuration
We load the R packages required for modelling.
```{r, message=FALSE, warning=FALSE, error=FALSE}
########################################################################
# R SETUP
# Load required packages from local library into R.
library(magrittr) # Data pipelines: %>% %T>% %<>%.
library(stringi) # String operator: %s+%.
library(rattle) # Evaluate using riskchart().
library(unbalanced) # Resampling using ubSMOTE.
library(rpart) # Model: decision tree.
library(rpart.plot) # Draw fancyRpartPlot().
library(randomForest) # Model: random forest.
library(ada) # Model: ada boosting.
library(party) # Model: ctree and cforest.
library(e1071) # Model: support vector machine.
library(nnet) # Model: neural network.
library(Matrix) # Construct a Matrix of a class that inherits from Matrix.
library(caret) # Tune model hyper-parameters.
library(xgboost) # Model: extreme gradiant boosting.
library(Ckmeans.1d.dp)# Plot feature importance using xgb.plot.importance.
library(DiagrammeR) # Plot xgboost tree using xgb.plot.tree.
library(ROCR) # Use prediction() for evaluation.
library(pROC) # Use auc() for evaluation.
library(ggplot2) # Visually evaluate performance.
```
# Step 4.4: Re-load Dataset
In the Data template we loaded the studentDropIndia dataset, processed it, and saved it to file. Here we re-load the dataset and review its contents. In addition, we define some support functions for evaluation.
```{r, message=FALSE, warning=FALSE, error=FALSE}
########################################################################
# DATA INGESTION
# Identify the dataset.
dsname <- "studentDropIndia"
# We define some support functions that we often find useful.
evaluateModel <- function(data, observed, predicted)
{
# Calculate the confusion matrix
confusion <- table(data[[observed]], data[[predicted]], dnn=c("Observed", "Predicted"))
confusion %>% print()
# Calculate the performance metrics
tp <- confusion[rownames(confusion) == 1, colnames(confusion) == 1]
fn <- confusion[rownames(confusion) == 1, colnames(confusion) == 0]
fp <- confusion[rownames(confusion) == 0, colnames(confusion) == 1]
tn <- confusion[rownames(confusion) == 0, colnames(confusion) == 0]
accuracy <- (tp + tn) / (tp + fn + fp + tn)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
fscore <- 2 * (precision * recall) / (precision + recall)
# Construct the vector of performance metrics
metrics <- c("Accuracy" = accuracy,
"Precision" = precision,
"Recall" = recall,
"F-Score" = fscore)
# Return the vector of performance metrics
return(metrics)
}
rocChart <- function(pr, target)
{
# Calculate the true positive and the false positive rates.
rates <- pr %>%
prediction(target) %>%
performance("tpr", "fpr")
# Calulcate the AUC.
auc <- pr %>%
prediction(target) %>%
performance("auc") %>%
attr("y.values") %>%
extract2(1)
# Construct the plot.
pl <- data.frame(tpr=attr(rates, "y.values")[[1]],
fpr=attr(rates, "x.values")[[1]]) %>%
ggplot(aes(fpr, tpr)) +
geom_line() +
annotate("text", x=0.875, y=0.125, vjust=0,
label=paste("AUC =", round(100*auc, 2)),
family="xkcd") +
xlab("False Positive Rate (1-Specificity)") +
ylab("True Positive Rate (Sensitivity)")
# Return the plot object.
return(pl)
}
# Identify the dataset to load.
fpath <- "data"
dsdate <- "_" %s+% "20161215"
# Filename of the saved dataset.
dsrdata <-
file.path(fpath, dsname %s+% dsdate %s+% ".RData") %T>%
print()
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Load the R objects from file and list them.
load(dsrdata) %>% print()
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Review the metadata.
dsname
dspath
dsdate
nobs
vars
target
id
ignore
omit
```
# Step 4.5: Prepare - Formula to Describe the Goal
We continue on from the Data module where we had Steps 1, 2, and 3 and the beginnings of Step 4 of a data mining process.
The next step is to describe the model to be built by way of writing a formula to capture our intent. The formula describes the model to be built as being constructed to predict the target variable based on the other (suitable) variables available in the dataset. The notation used to express this is to name the target (continue_drop), followed by a tilde (~) followed by a period (.) to represent all other variables (these variables will be listed in vars in our case).
```{r, message=FALSE, warning=FALSE, error=FALSE}
########################################################################
# PREPARE FOR MODELLING
# Formula for modelling.
form <- ds[vars] %>% formula() %T>% print()
```
A common methodology for model building is to randomly partition the available data into a training dataset and testing dataset. We sometimes also introducing a third dataset called the validation dataset, used during the building of the model, but for now we will use just the two.
First we (optionally) initiate the random number sequence with a randomly selected seed, and report what the seed is so that we could repeat the experiments presented here if required. For consistency in this module we use a particular seed of 123.
Next we partition the dataset into two subsets. The first is a 70% random sample for building the model (the training dataset) and the second is the remainder, used to evaluate the performance of the model (the testing dataset).
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Initialise random numbers for repeatable results.
seed <- 123
set.seed(seed)
# Partition the full dataset into two.
train <-
sample(nobs, 0.70*nobs) %T>%
{length(.) %>% print()}
head(train)
test <-
seq_len(nobs) %>%
setdiff(train) %T>%
{length(.) %>% print()}
head(test)
```
# Step 5: Resampling - Rebalancing the Proportion of Minority over Majority (Optional)
Since the proportion of minority class (student dropping-out) is around 5% among the whole dataset, we here implement the SMOTE on the training dataset by using the function ubSMOTE from the R package “unbalanced”. This yields a dropping-out proportion of 23% among all the training data. By using the training dataset after SMOTE as the modeling input, we can greatly improve the model performance, especially when applying some of the algorithms not suitable for unbalanced data.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Rebalance the training dataset.
traindata <- as.data.frame(ds[train, inputs])
traintarget <- as.factor(as.numeric(as.data.frame(ds[train, target])[[1]])-1)
smote <- ubSMOTE(X=traindata, Y=traintarget,
perc.over=200, perc.under=500,
k=3, verbose=TRUE)
trainsmote <- cbind(smote$X, smote$Y)
names(trainsmote)[names(trainsmote) == "smote$Y"] <- "continue_drop"
traindata <- trainsmote
# Check the dropping-out proportion
table(traindata$continue_drop)/nrow(traindata)
```
# Step 6.1: Build - Decision Tree Model
The commonly used classification model builders include rpart() decision tree, randomForest() random forest, ada() stochastic boosting, ect. Now we build an rpart() decision tree, as a baseline model builder. Note that our models from now on are all built on the original training dataset in the purpose of demonstration.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Train model: rpart
ctrl <- rpart.control(maxdepth=3)
system.time(m.rp <- rpart(form, ds[train, vars], control=ctrl))
m.rp
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Record the type of the model for later use.
mtype <- "rpart"
```
We can also draw the model.
```{r, message=FALSE, warning=FALSE, error=FALSE}
fancyRpartPlot(m.rp)
```
# Step 6.2: Evaluate - Decision Tree Model
As we have noted though, performing any evaluation on the training dataset provides a biased estimate of the actual performance. We must instead evaluate the performance of our models on a previously unseen dataset (at least unseen by the algorithm building the model).
So we now evaluate the model performance on the testing dataset.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Score model
predictions <- predict(m.rp, ds[test, vars], type="prob")
threshold <- 0.5
rpart_probability <- predictions[, 2]
rpart_prediction <- ifelse(rpart_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], rpart_prediction, rpart_probability)
head(pred)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Evaluate model
pred$continue_drop <- as.numeric(pred$continue_drop)-1
metrics.rp <- evaluateModel(data=pred,
observed="continue_drop",
predicted="rpart_prediction")
metrics.rp
rocChart(pr=pred$rpart_probability, target=pred$continue_drop)
```
# Step 6.3: Compare - Multiple Models using Experiment
We can repeat the modelling multiple times, randomly selecting different datasets for training, to get an estimate of the actual expected performance and variation we see in the performance. The helper function experi() can be used to assist us here. It is available as http://onepager.togaware.com/experi.R and we show some of the coding of experi() below.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Show the function experi()
experi <- function(form, ds, dsname, target, modeller, details="",
n=100, control=NULL,
keep=FALSE, # Keep the last model built.
prob="prob",
class="class",
log="experi.log")
{
suppressPackageStartupMessages(require(pROC))
user <- Sys.getenv("LOGNAME")
node <- Sys.info()[["nodename"]]
wsrpart.model <- modeller=="wsrpart"
numclass <- length(levels(ds[,target]))
start.time <- proc.time()
seeds <- cors <- strs <- aucs <- accs <- NULL
for (i in seq_len(n))
{
loop.time <- proc.time()
seeds <- c(seeds, seed <- sample(1:1000000, 1))
set.seed(seed)
....
result[-c(1:7)] <- round(result[-c(1:7)], 2)
row.names(result) <- NULL
if (keep)
{
if (numclass==2)
{
attr(result, "pr") <- pr
attr(result, "test") <- test
}
attr(result, "model") <- model
}
}
return(result)
}
```
Let's run the experiments using the algorihtms rpart (Therneau and Atkinson, 2014), randomForest (Breiman et al., 2012), ada (Culp et al., 2012), ctree() from party (Hothorn et al., 2013). In such way, we can conveniently implement those models and compare their performance.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# # Source experi.R
#
# source("http://onepager.togaware.com/experi.R")
#
# # Set the times of loops
#
# n <- 10
#
# # Run experiments
#
# ex.rp <- experi(form, ds[vars], dsname, target, "rpart", "1", n=n, keep=TRUE)
# ex.rf <- experi(form, ds[vars], dsname, target, "randomForest", "500", n=n, keep=TRUE, control=list(na.action=na.omit))
# ex.ad <- experi(form, ds[vars], dsname, target, "ada", "50", n=n, keep=TRUE)
# ex.ct <- experi(form, ds[vars], dsname, target, "ctree", "1", n=n, keep=TRUE)
#
# # Compare results
#
# results <- rbind(ex.rp, ex.rf, ex.ad, ex.ct)
# rownames(results) <- results$modeller
# results$modeller <- NULL
# results
```
# Step 7.1: Other Models - Support Vector Machine Model
Except for the above commonly used binary classification models, we could also try some more advanced models, for instance, svm(), support vector machine, nnet(), neural network, xgboost(), extreme gradient boosting, ect. We firstly build a svm() support vector machine model here.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Tune hyper-parameters
system.time({
m.svm.cv <- tune.svm(form,
data=ds[train, vars],
gamma=2^(-1:1),
cost=2^(2:4),
type="C-classification",
probability=TRUE,
scale=FALSE)
})
print(m.svm.cv$best.performance)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Train model: svm
system.time({
m.svm <- svm(form,
data=ds[train, vars],
gamma=as.numeric(m.svm.cv$best.parameters[1]),
cost=as.numeric(m.svm.cv$best.parameters[2]),
type="C-classification",
probability = TRUE,
scale = FALSE)
})
# Check the model information
m.svm
```
Then we score the model on testing dataset and evaluate the model performance.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Score model
predictions <- predict(m.svm, ds[test, vars], probability=TRUE)
threshold <- 0.5
svm_probability <- attr(predictions, 'probabilities')[, 2]
svm_prediction <- ifelse(svm_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], svm_prediction, svm_probability)
head(pred)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Evaluate model
pred$continue_drop <- as.numeric(pred$continue_drop)-1
metrics.svm <- evaluateModel(data=pred,
observed="continue_drop",
predicted="svm_prediction")
metrics.svm
rocChart(pr=pred$svm_probability, target=pred$continue_drop)
```
# Step 7.2: Other Models - Neural Network Model
Next we build a nnet(), neural network model.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Tune hyper-parameters
system.time({
m.nnet.cv <- tune.nnet(form,
data=ds[train, vars],
size=c(2, 4, 6, 8, 10),
decay=5*10^(-5:-1),
rang=0.1,
maxit=200)
})
print(m.nnet.cv$best.performance)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Train model: nnet
system.time({
m.nnet <- nnet(formula=form,
data=ds[train, vars],
size=as.numeric(m.nnet.cv$best.parameters[1]),
decay=as.numeric(m.nnet.cv$best.parameters[2]),
rang=0.1,
maxit=200)
})
# Check the model information
m.nnet
```
Then we score the model on testing dataset and evaluate the model performance.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Score model
predictions <- predict(m.nnet, ds[test, vars], type="raw")
threshold <- 0.5
nnet_probability <- predictions
nnet_prediction <- ifelse(nnet_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], nnet_prediction, nnet_probability)
head(pred)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Evaluate model
pred$continue_drop <- as.numeric(pred$continue_drop)-1
metrics.nnet <- evaluateModel(data=pred,
observed="continue_drop",
predicted="nnet_prediction")
metrics.nnet
rocChart(pr=pred$nnet_probability, target=pred$continue_drop)
```
# Step 7.3: Other Models - Extreme Gradient Boosting Model
Finally, we build a xgboost() extreme gradient boosting, as a specicial example, which performs well when dealing with unbalanced data. In our case, the proportion of student drop-out is around 5% in the original training dataset. Here we just use it as input to demonstrate the power of xgboost() in dealing with unbalanced data.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Re-structure the training data set
traindata <- ds[train, inputs]
traindata[, c(1:ncol(traindata))] <- sapply(traindata[, c(1:ncol(traindata))], as.numeric)
ntrain <- as.matrix(traindata[ , c(1:ncol(traindata))])
dtrain <- list()
dtrain$data <- Matrix(ntrain, sparse=TRUE)
dtrain$label <- as.numeric(as.data.frame(ds[train, target])[[1]]) - 1
dtrain %>% str()
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Tune hyper-parameters
cv.ctrl <- trainControl(method="cv",
number=5,
verboseIter=TRUE,
returnData=FALSE,
returnResamp="all", # save losses across all models
classProbs=TRUE, # set to TRUE for AUC to be computed
summaryFunction=twoClassSummary,
allowParallel=TRUE)
grid.xgb <- expand.grid(nrounds=2,
max_depth=2^(1:3),
eta=1*10^(-2:0),
min_child_weight=1,
colsample_bytree=1,
subsample=1,
gamma=0)
set.seed(45)
m.xgb.cv <-train(x=ntrain,
y=as.data.frame(ds[train, target])[[1]],
method="xgbTree",
trControl=cv.ctrl,
tuneGrid=grid.xgb,
verbose=TRUE,
metric="AUC",
nthread =2)
m.xgb.cv
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Train model: xgboost
system.time({
m.xgb <- xgboost(data=dtrain$data,
label=dtrain$label,
nround=m.xgb.cv$bestTune[[1]],
max.depth=m.xgb.cv$bestTune[[2]],
eta=m.xgb.cv$bestTune[[3]],
min_child_weight=1,
colsample_bytree=1,
subsample=1,
gamma=0,
nthread=2,
objective="binary:logistic")
})
m.xgb
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Calculate feature importance
importance <- xgb.importance(feature_names=dtrain$data@Dimnames[[2]],
model=m.xgb)
print(importance)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Visualize feature importance
xgb.plot.importance(importance)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Plot a boosted tree model
xgb.plot.tree(dtrain$data@Dimnames[[2]], model=m.xgb)
```
Now we score the model on testing dataset and evaluate the model performance.
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Re-structure the testing data set
testdata <- ds[test, inputs]
testdata[, c(1:ncol(traindata))] <- sapply(testdata[, c(1:ncol(traindata))], as.numeric)
ntest <- as.matrix(testdata[, c(1:ncol(traindata))])
dtest <- list()
dtest$data <- Matrix(ntest, sparse=TRUE)
dtest$label <- as.numeric(as.data.frame(ds[test, target])[[1]]) - 1
dtest %>% str()
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Score model
predictions <- predict(m.xgb, dtest$data)
threshold <- 0.5
xgboost_probability <- predictions
xgboost_prediction <- ifelse(xgboost_probability > threshold, 1, 0)
pred <- cbind(testdata, dtest$label, xgboost_prediction, xgboost_probability)
names(pred)[names(pred) == "dtest$label"] <- target
head(pred)
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
# Evaluate model
metrics.xgb <- evaluateModel(data=pred,
observed="continue_drop",
predicted="xgboost_prediction")
metrics.xgb
rocChart(pr=pred$xgboost_probability, target=pred$continue_drop)
```
# Step 8: Finish Up - Save Model
We save the model, together with the dataset and other variables, into a binary R file. Here, we use xgboost model as an example.
```{r, message=FALSE, warning=FALSE, error=FALSE}
model <- m.xgb
mtype <- 'xgboost'
pr <- xgboost_probability
cl <- xgboost_prediction
dname <- "models"
if (! file.exists(dname)) dir.create(dname)
time.stamp <- format(Sys.time(), "%Y%m%d_%H%M%S")
fstem <- paste(dsname, mtype, time.stamp, sep="_")
(fname <- file.path(dname, sprintf("%s.RData", fstem)))
```
```{r, message=FALSE, warning=FALSE, error=FALSE}
save(ds, dsname, vars, target, ignore,
form, nobs, seed, train, test, model, mtype, pr, cl,
file=fname)
```
We can then load this later and replicate the process.
```{r, message=FALSE, warning=FALSE, error=FALSE}
(load(fname))
```
Note that by using generic variable names we can load different model files and perform common operations on them without changing the names within a script. However, do note that each time we load such a saved model file we overwrite any other variables of the same name.