Merge branch 'master' of github.com:Microsoft/acceleratoRs
This commit is contained in:
Коммит
91ec3bea03
|
@ -0,0 +1,297 @@
|
|||
---
|
||||
title: "Deploy a Credit Risk Model as a Web Service"
|
||||
author: "Fang Zhou, Data Scientist, Microsoft"
|
||||
date: "`r Sys.Date()`"
|
||||
output: html_document
|
||||
---
|
||||
|
||||
```{r setup, include=FALSE, purl=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE,
|
||||
fig.width = 8,
|
||||
fig.height = 5,
|
||||
fig.align='center',
|
||||
dev = "png")
|
||||
```
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
The `mrsdeploy` package, delivered with Microsoft R Client and R Server, provides functions for:
|
||||
|
||||
**1** Establishing a remote session in a R console application for the purposes of executing code on that server
|
||||
|
||||
**2** Publishing and managing an R web service that is backed by the R code block or script you provided.
|
||||
|
||||
Each feature can be used independently, but the greatest value is achieved when you can leverage both.
|
||||
|
||||
This document will walk through you how to deploy a credit risk model as a web service, using the `mrsdeploy` package.
|
||||
|
||||
It will start by modelling locally, then publish it as a web service, and then share it with other authenticated users for consumption, and finally manage and update the web service.
|
||||
|
||||
## 2 Automated Credit Risk Model Deployment
|
||||
|
||||
### 2.1 Setup
|
||||
|
||||
We load the required R packages.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Setup
|
||||
|
||||
# Load the required packages into the R session.
|
||||
|
||||
library(rattle) # Use normVarNames().
|
||||
library(dplyr) # Wrangling: tbl_df(), group_by(), print(), glimpse().
|
||||
library(magrittr) # Pipe operator %>% %<>% %T>% equals().
|
||||
library(scales) # Include commas in numbers.
|
||||
library(MicrosoftML) # Build models using Microsoft ML algortihms.
|
||||
library(mrsdeploy) # Publish an R model as a web service.
|
||||
```
|
||||
|
||||
Then, the dataset processedSimu is ingested for demonstration. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Data Ingestion
|
||||
|
||||
# Identify the source location of the dataset.
|
||||
|
||||
#DATA <- "../../Data/"
|
||||
#txn_fname <- file.path(DATA, "Raw/processedSimu.csv")
|
||||
|
||||
wd <- getwd()
|
||||
|
||||
dpath <- "../Data"
|
||||
data_fname <- file.path(wd, dpath, "processedSimu.csv")
|
||||
|
||||
# Ingest the dataset.
|
||||
|
||||
data <- read.csv(file=data_fname) %T>%
|
||||
{dim(.) %>% comma() %>% cat("\n")}
|
||||
|
||||
# A glimpse into the data.
|
||||
|
||||
glimpse(data)
|
||||
```
|
||||
|
||||
### 2.2 Model Locally
|
||||
|
||||
Now, let's get started to build an R model based web service.
|
||||
|
||||
First of all, we create a machine learning fast tree model on the dataset processedSimu by using the function `rxFastTrees()` from the `MicrosoftML` package. This model could be used to predict whether an account will default or to predict its probability of default, given some transaction statistics and demographic & bank account information as inputs.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Variable roles.
|
||||
|
||||
# Target variable
|
||||
|
||||
target <- "bad_flag"
|
||||
|
||||
# Note any identifier.
|
||||
|
||||
id <- c("account_id") %T>% print()
|
||||
|
||||
# Note the available variables as model inputs.
|
||||
|
||||
vars <- setdiff(names(data), c(target, id))
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Split Data
|
||||
|
||||
set.seed(42)
|
||||
|
||||
data <- data[order(runif(nrow(data))), ]
|
||||
|
||||
train <- sample(nrow(data), 0.70 * nrow(data))
|
||||
test <- setdiff(seq_len(nrow(data)), train)
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Prepare the formula
|
||||
|
||||
top_vars <- c("amount_6", "pur_6", "avg_pur_amt_6", "avg_interval_pur_6", "credit_limit", "age", "income", "sex", "education", "marital_status")
|
||||
|
||||
form <- as.formula(paste(target, paste(top_vars, collapse="+"), sep="~"))
|
||||
form
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Train model: rxFastTrees
|
||||
|
||||
model_rxtrees <- rxFastTrees(formula=form,
|
||||
data=data[train, c(target, vars)],
|
||||
type="binary",
|
||||
numTrees=100,
|
||||
numLeaves=20,
|
||||
learningRate=0.2,
|
||||
minSplit=10,
|
||||
unbalancedSets=FALSE,
|
||||
verbose=0)
|
||||
|
||||
model_rxtrees
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Produce a prediction function that can use the model
|
||||
|
||||
creditRiskPrediction <- function(account_id, amount_6, pur_6, avg_pur_amt_6, avg_interval_pur_6,
|
||||
credit_limit, marital_status, sex, education, income, age)
|
||||
{
|
||||
newdata <- data.frame(account_id=account_id,
|
||||
amount_6=amount_6,
|
||||
pur_6=pur_6,
|
||||
avg_pur_amt_6=avg_pur_amt_6,
|
||||
avg_interval_pur_6=avg_interval_pur_6,
|
||||
credit_limit=credit_limit,
|
||||
marital_status=marital_status,
|
||||
sex=sex,
|
||||
education=education,
|
||||
income=income,
|
||||
age=age)
|
||||
|
||||
pred <- rxPredict(modelObject=model_rxtrees, data=newdata)[, c(1, 3)]
|
||||
pred <- cbind(newdata$account_id, pred)
|
||||
names(pred) <- c("account_id", "scored_label", "scored_prob")
|
||||
pred
|
||||
}
|
||||
|
||||
# Test function locally by printing results
|
||||
|
||||
pred <- creditRiskPrediction(account_id="a_1055521029582310",
|
||||
amount_6=173.22,
|
||||
pur_6=1,
|
||||
avg_pur_amt_6=173.22,
|
||||
avg_interval_pur_6=0,
|
||||
credit_limit=5.26,
|
||||
marital_status="married",
|
||||
sex="male",
|
||||
education="undergraduate",
|
||||
income=12.36,
|
||||
age=38)
|
||||
|
||||
print(pred)
|
||||
```
|
||||
|
||||
### 2.2 Publish model as a web service
|
||||
|
||||
The second procedure is to publish the model as a web service by following the below steps.
|
||||
|
||||
Step 1: From your local R IDE, log into Microsoft R Server with your credentials using the appropriate authentication function from the `mrsdeploy` package (remoteLogin or remoteLoginAAD).
|
||||
|
||||
For simplicity, the code below uses the basic local admin account for authentication with the remoteLogin function and `session = false` so that no remote R session is started.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Use `remoteLogin` to authenticate with R Server using
|
||||
# the local admin account. Use session = false so no
|
||||
# remote R session started
|
||||
|
||||
remoteLogin("http://localhost:12800",
|
||||
username="admin",
|
||||
password="P@ssw0rd",
|
||||
session=FALSE)
|
||||
```
|
||||
|
||||
Now, you are successfully connected to the remote R Server.
|
||||
|
||||
Step 2: Publish the model as a web service to R Server using the `publishService()` function from the `mrsdeploy` package.
|
||||
|
||||
In this example, you publish a web service called "crpService" using the model `model_rxtrees` and the function `creditRiskPrediction()`. As an input, the service takes a list of transaction statistics and demographic & bank account information represented as numerical or categorical. As an output, an R data frame including the account id, the predicted label of default, and the probability of default for the given individual account, has of being achieved with the pre-defined credit risk prediction function.
|
||||
|
||||
When publishing, you must specify, among other parameters, a service name and version, the R code, the inputs, as well as the outputs that application developers will need to integrate in their applications.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Publish a web service
|
||||
|
||||
api <- publishService(
|
||||
"crpService",
|
||||
code=creditRiskPrediction,
|
||||
model=model_rxtrees,
|
||||
inputs=list(account_id="character",
|
||||
amount_6="numeric",
|
||||
pur_6="numeric",
|
||||
avg_pur_amt_6="numeric",
|
||||
avg_interval_pur_6="numeric",
|
||||
credit_limit="numeric",
|
||||
marital_status="character",
|
||||
sex="character",
|
||||
education="character",
|
||||
income="numeric",
|
||||
age="numeric"),
|
||||
outputs=list(pred="data.frame"),
|
||||
v="v1.0.0")
|
||||
```
|
||||
|
||||
### 2.3 Test the service by consuming it in R
|
||||
|
||||
After publishing it , we can consume the service in R directly to verify that the results are as expected.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=TRUE}
|
||||
# Get service and assign service to the variable `api`.
|
||||
|
||||
api <- getService("crpService", "v1.0.0")
|
||||
|
||||
# Consume service by calling function, `creditRiskPrediction` contained in this service
|
||||
|
||||
result <- api$creditRiskPrediction(account_id="a_1055521029582310",
|
||||
amount_6=173.22,
|
||||
pur_6=1,
|
||||
avg_pur_amt_6=173.22,
|
||||
avg_interval_pur_6=0,
|
||||
credit_limit=5.26,
|
||||
marital_status="married",
|
||||
sex="male",
|
||||
education="undergraduate",
|
||||
income=12.36,
|
||||
age=38)
|
||||
|
||||
# Print response output named `answer`
|
||||
|
||||
print(result$output("pred"))
|
||||
```
|
||||
|
||||
### 2.4 Update the web service
|
||||
|
||||
In the process of production, we could manage and update the web service timely.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=TRUE}
|
||||
# Load the pre-trained optimal model obtained from the template of CreditRiskScale.
|
||||
|
||||
load(file="model_rxtrees.RData")
|
||||
|
||||
model_rxtrees
|
||||
|
||||
api <- updateService(name="crpService",
|
||||
v="v1.0.0",
|
||||
model=model_rxtrees,
|
||||
descr="Update the model hyper-parameters")
|
||||
|
||||
# Re-test the updated service by consuming it
|
||||
|
||||
result <- api$creditRiskPrediction(account_id="a_1055521029582310",
|
||||
amount_6=173.22,
|
||||
pur_6=1,
|
||||
avg_pur_amt_6=173.22,
|
||||
avg_interval_pur_6=0,
|
||||
credit_limit=5.26,
|
||||
marital_status="married",
|
||||
sex="male",
|
||||
education="undergraduate",
|
||||
income=12.36,
|
||||
age=38)
|
||||
|
||||
# Print response output named `answer`
|
||||
|
||||
print(result$output("pred"))
|
||||
```
|
||||
|
||||
### 2.5 Application Integration
|
||||
|
||||
Last but not least, we can get the json file that is needed for application integration.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=TRUE}
|
||||
# Get this service's `swagger.json` file that is needed for web application integration
|
||||
|
||||
swagger <- api$swagger(json = FALSE)
|
||||
|
||||
# Delete the service to make the script re-runable
|
||||
|
||||
deleteService(name="crpService", v="v1.0.0")
|
||||
```
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,518 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"---\n",
|
||||
"title: \"Deploy a Credit Risk Model as a Web Service\"\n",
|
||||
"author: \"Fang Zhou, Data Scientist, Microsoft\"\n",
|
||||
"date: \"`r Sys.Date()`\"\n",
|
||||
"output: html_document\n",
|
||||
"---"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"id": "",
|
||||
"include": "FALSE,",
|
||||
"purl": "FALSE"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"knitr::opts_chunk$set(echo = TRUE,\n",
|
||||
" fig.width = 8,\n",
|
||||
" fig.height = 5,\n",
|
||||
" fig.align='center',\n",
|
||||
" dev = \"png\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1 Introduction\n",
|
||||
"\n",
|
||||
"The `mrsdeploy` package, delivered with Microsoft R Client and R Server, provides functions for:\n",
|
||||
"\n",
|
||||
"**1** Establishing a remote session in a R console application for the purposes of executing code on that server\n",
|
||||
"\n",
|
||||
"**2** Publishing and managing an R web service that is backed by the R code block or script you provided. \n",
|
||||
"\n",
|
||||
"Each feature can be used independently, but the greatest value is achieved when you can leverage both.\n",
|
||||
"\n",
|
||||
"This document will walk through you how to deploy a credit risk model as a web service, using the `mrsdeploy` package.\n",
|
||||
"\n",
|
||||
"It will start by modelling locally, then publish it as a web service, and then share it with other authenticated users for consumption, and finally manage and update the web service. \n",
|
||||
"\n",
|
||||
"## 2 Automated Credit Risk Model Deployment\n",
|
||||
"\n",
|
||||
"### 2.1 Setup\n",
|
||||
"\n",
|
||||
"We load the required R packages."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"# Load the required packages into the R session.\n",
|
||||
"\n",
|
||||
"library(rattle) # Use normVarNames().\n",
|
||||
"library(dplyr) # Wrangling: tbl_df(), group_by(), print(), glimpse().\n",
|
||||
"library(magrittr) # Pipe operator %>% %<>% %T>% equals().\n",
|
||||
"library(scales) # Include commas in numbers.\n",
|
||||
"library(MicrosoftML) # Build models using Microsoft ML algortihms.\n",
|
||||
"library(mrsdeploy) # Publish an R model as a web service."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Then, the dataset processedSimu is ingested for demonstration. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Data Ingestion\n",
|
||||
"\n",
|
||||
"# Identify the source location of the dataset.\n",
|
||||
"\n",
|
||||
"#DATA <- \"../../Data/\"\n",
|
||||
"#txn_fname <- file.path(DATA, \"Raw/processedSimu.csv\")\n",
|
||||
"\n",
|
||||
"wd <- getwd()\n",
|
||||
"\n",
|
||||
"dpath <- \"../Data\"\n",
|
||||
"data_fname <- file.path(wd, dpath, \"processedSimu.csv\")\n",
|
||||
"\n",
|
||||
"# Ingest the dataset.\n",
|
||||
"\n",
|
||||
"data <- read.csv(file=data_fname) %T>% \n",
|
||||
" {dim(.) %>% comma() %>% cat(\"\\n\")}\n",
|
||||
"\n",
|
||||
"# A glimpse into the data.\n",
|
||||
"\n",
|
||||
"glimpse(data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.2 Model Locally\n",
|
||||
"\n",
|
||||
"Now, let's get started to build an R model based web service. \n",
|
||||
"\n",
|
||||
"First of all, we create a machine learning fast tree model on the dataset processedSimu by using the function `rxFastTrees()` from the `MicrosoftML` package. This model could be used to predict whether an account will default or to predict its probability of default, given some transaction statistics and demographic & bank account information as inputs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Variable roles.\n",
|
||||
"\n",
|
||||
"# Target variable\n",
|
||||
"\n",
|
||||
"target <- \"bad_flag\"\n",
|
||||
"\n",
|
||||
"# Note any identifier.\n",
|
||||
"\n",
|
||||
"id <- c(\"account_id\") %T>% print() \n",
|
||||
"\n",
|
||||
"# Note the available variables as model inputs.\n",
|
||||
"\n",
|
||||
"vars <- setdiff(names(data), c(target, id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Split Data\n",
|
||||
"\n",
|
||||
"set.seed(42)\n",
|
||||
"\n",
|
||||
"data <- data[order(runif(nrow(data))), ]\n",
|
||||
"\n",
|
||||
"train <- sample(nrow(data), 0.70 * nrow(data))\n",
|
||||
"test <- setdiff(seq_len(nrow(data)), train)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Prepare the formula\n",
|
||||
"\n",
|
||||
"top_vars <- c(\"amount_6\", \"pur_6\", \"avg_pur_amt_6\", \"avg_interval_pur_6\", \"credit_limit\", \"age\", \"income\", \"sex\", \"education\", \"marital_status\")\n",
|
||||
"\n",
|
||||
"form <- as.formula(paste(target, paste(top_vars, collapse=\"+\"), sep=\"~\"))\n",
|
||||
"form"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Train model: rxFastTrees\n",
|
||||
"\n",
|
||||
"model_rxtrees <- rxFastTrees(formula=form,\n",
|
||||
" data=data[train, c(target, vars)],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=100,\n",
|
||||
" numLeaves=20,\n",
|
||||
" learningRate=0.2,\n",
|
||||
" minSplit=10,\n",
|
||||
" unbalancedSets=FALSE,\n",
|
||||
" verbose=0)\n",
|
||||
"\n",
|
||||
"model_rxtrees"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Produce a prediction function that can use the model\n",
|
||||
"\n",
|
||||
"creditRiskPrediction <- function(account_id, amount_6, pur_6, avg_pur_amt_6, avg_interval_pur_6, \n",
|
||||
" credit_limit, marital_status, sex, education, income, age)\n",
|
||||
"{ \n",
|
||||
" newdata <- data.frame(account_id=account_id,\n",
|
||||
" amount_6=amount_6, \n",
|
||||
" pur_6=pur_6, \n",
|
||||
" avg_pur_amt_6=avg_pur_amt_6, \n",
|
||||
" avg_interval_pur_6=avg_interval_pur_6, \n",
|
||||
" credit_limit=credit_limit, \n",
|
||||
" marital_status=marital_status, \n",
|
||||
" sex=sex, \n",
|
||||
" education=education, \n",
|
||||
" income=income, \n",
|
||||
" age=age)\n",
|
||||
" \n",
|
||||
" pred <- rxPredict(modelObject=model_rxtrees, data=newdata)[, c(1, 3)]\n",
|
||||
" pred <- cbind(newdata$account_id, pred)\n",
|
||||
" names(pred) <- c(\"account_id\", \"scored_label\", \"scored_prob\")\n",
|
||||
" pred \n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"# Test function locally by printing results\n",
|
||||
"\n",
|
||||
"pred <- creditRiskPrediction(account_id=\"a_1055521029582310\",\n",
|
||||
" amount_6=173.22, \n",
|
||||
" pur_6=1, \n",
|
||||
" avg_pur_amt_6=173.22, \n",
|
||||
" avg_interval_pur_6=0, \n",
|
||||
" credit_limit=5.26, \n",
|
||||
" marital_status=\"married\", \n",
|
||||
" sex=\"male\", \n",
|
||||
" education=\"undergraduate\", \n",
|
||||
" income=12.36, \n",
|
||||
" age=38)\n",
|
||||
"\n",
|
||||
"print(pred)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.2 Publish model as a web service\n",
|
||||
"\n",
|
||||
"The second procedure is to publish the model as a web service by following the below steps.\n",
|
||||
"\n",
|
||||
"Step 1: From your local R IDE, log into Microsoft R Server with your credentials using the appropriate authentication function from the `mrsdeploy` package (remoteLogin or remoteLoginAAD). \n",
|
||||
"\n",
|
||||
"For simplicity, the code below uses the basic local admin account for authentication with the remoteLogin function and `session = false` so that no remote R session is started."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Use `remoteLogin` to authenticate with R Server using \n",
|
||||
"# the local admin account. Use session = false so no \n",
|
||||
"# remote R session started\n",
|
||||
"\n",
|
||||
"remoteLogin(\"http://localhost:12800\", \n",
|
||||
" username=\"admin\", \n",
|
||||
" password=\"P@ssw0rd\",\n",
|
||||
" session=FALSE)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now, you are successfully connected to the remote R Server.\n",
|
||||
"\n",
|
||||
"Step 2: Publish the model as a web service to R Server using the `publishService()` function from the `mrsdeploy` package. \n",
|
||||
"\n",
|
||||
"In this example, you publish a web service called \"crpService\" using the model `model_rxtrees` and the function `creditRiskPrediction()`. As an input, the service takes a list of transaction statistics and demographic & bank account information represented as numerical or categorical. As an output, an R data frame including the account id, the predicted label of default, and the probability of default for the given individual account, has of being achieved with the pre-defined credit risk prediction function. \n",
|
||||
"\n",
|
||||
"When publishing, you must specify, among other parameters, a service name and version, the R code, the inputs, as well as the outputs that application developers will need to integrate in their applications."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Publish a web service\n",
|
||||
"\n",
|
||||
"api <- publishService(\n",
|
||||
" \"crpService\",\n",
|
||||
" code=creditRiskPrediction,\n",
|
||||
" model=model_rxtrees,\n",
|
||||
" inputs=list(account_id=\"character\",\n",
|
||||
" amount_6=\"numeric\", \n",
|
||||
" pur_6=\"numeric\", \n",
|
||||
" avg_pur_amt_6=\"numeric\", \n",
|
||||
" avg_interval_pur_6=\"numeric\", \n",
|
||||
" credit_limit=\"numeric\", \n",
|
||||
" marital_status=\"character\", \n",
|
||||
" sex=\"character\", \n",
|
||||
" education=\"character\", \n",
|
||||
" income=\"numeric\", \n",
|
||||
" age=\"numeric\"),\n",
|
||||
" outputs=list(pred=\"data.frame\"),\n",
|
||||
" v=\"v1.0.0\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.3 Test the service by consuming it in R\n",
|
||||
"\n",
|
||||
"After publishing it , we can consume the service in R directly to verify that the results are as expected."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "TRUE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get service and assign service to the variable `api`.\n",
|
||||
"\n",
|
||||
"api <- getService(\"crpService\", \"v1.0.0\")\n",
|
||||
"\n",
|
||||
"# Consume service by calling function, `creditRiskPrediction` contained in this service\n",
|
||||
"\n",
|
||||
"result <- api$creditRiskPrediction(account_id=\"a_1055521029582310\",\n",
|
||||
" amount_6=173.22, \n",
|
||||
" pur_6=1, \n",
|
||||
" avg_pur_amt_6=173.22, \n",
|
||||
" avg_interval_pur_6=0, \n",
|
||||
" credit_limit=5.26, \n",
|
||||
" marital_status=\"married\", \n",
|
||||
" sex=\"male\", \n",
|
||||
" education=\"undergraduate\", \n",
|
||||
" income=12.36, \n",
|
||||
" age=38)\n",
|
||||
"\n",
|
||||
"# Print response output named `answer`\n",
|
||||
"\n",
|
||||
"print(result$output(\"pred\")) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.4 Update the web service\n",
|
||||
"\n",
|
||||
"In the process of production, we could manage and update the web service timely."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "TRUE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load the pre-trained optimal model obtained from the template of CreditRiskScale.\n",
|
||||
"\n",
|
||||
"load(file=\"model_rxtrees.RData\")\n",
|
||||
"\n",
|
||||
"model_rxtrees\n",
|
||||
"\n",
|
||||
"api <- updateService(name=\"crpService\", \n",
|
||||
" v=\"v1.0.0\",\n",
|
||||
" model=model_rxtrees,\n",
|
||||
" descr=\"Update the model hyper-parameters\")\n",
|
||||
"\n",
|
||||
"# Re-test the updated service by consuming it\n",
|
||||
"\n",
|
||||
"result <- api$creditRiskPrediction(account_id=\"a_1055521029582310\",\n",
|
||||
" amount_6=173.22, \n",
|
||||
" pur_6=1, \n",
|
||||
" avg_pur_amt_6=173.22, \n",
|
||||
" avg_interval_pur_6=0, \n",
|
||||
" credit_limit=5.26, \n",
|
||||
" marital_status=\"married\", \n",
|
||||
" sex=\"male\", \n",
|
||||
" education=\"undergraduate\", \n",
|
||||
" income=12.36, \n",
|
||||
" age=38)\n",
|
||||
"\n",
|
||||
"# Print response output named `answer`\n",
|
||||
"\n",
|
||||
"print(result$output(\"pred\")) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.5 Application Integration\n",
|
||||
"\n",
|
||||
"Last but not least, we can get the json file that is needed for application integration."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "TRUE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get this service's `swagger.json` file that is needed for web application integration\n",
|
||||
"\n",
|
||||
"swagger <- api$swagger(json = FALSE)\n",
|
||||
"\n",
|
||||
"# Delete the service to make the script re-runable\n",
|
||||
"\n",
|
||||
"deleteService(name=\"crpService\", v=\"v1.0.0\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
|
@ -0,0 +1,354 @@
|
|||
---
|
||||
title: "Faster and Scalable Credit Risk Prediction"
|
||||
author: "Fang Zhou, Data Scientist, Microsoft"
|
||||
date: "`r Sys.Date()`"
|
||||
output: html_document
|
||||
---
|
||||
|
||||
```{r setup, include=FALSE, purl=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE,
|
||||
fig.width = 8,
|
||||
fig.height = 5,
|
||||
fig.align='center',
|
||||
dev = "png")
|
||||
```
|
||||
|
||||
## 1 Introduction
|
||||
|
||||
Microsoft R is a collection of servers and tools that extend the capabilities of R, making it easier and faster to build and deploy R-based solutions. Microsoft R brings you the ability to do parallel and chunked data processing and modelling that relax the restrictions on dataset size imposed by in-memory open source R.
|
||||
|
||||
The `MicrosoftML` package brings new machine learning functionality with increased speed, performance and scalability, especially for handling a large corpus of text data or high-dimensional categorical data. The `MicrosoftML` package is installed with **Microsoft R Client**, **Microsoft R Server** and with the **SQL Server Machine Learning Services**.
|
||||
|
||||
This document will walk through you how to build faster and scalable credit risk models, using the `MicrosoftML` package that adds state-of-the-art machine learning algorithms and data transforms to Microsoft R Server.
|
||||
|
||||
## 2 Faster and Scalable Credit Risk Models
|
||||
|
||||
### 2.1 Setup
|
||||
|
||||
We load the required R packages.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Setup
|
||||
|
||||
# Load the required packages into the R session.
|
||||
|
||||
library(rattle) # Use normVarNames().
|
||||
library(dplyr) # Wrangling: tbl_df(), group_by(), print(), glimpse().
|
||||
library(magrittr) # Pipe operator %>% %<>% %T>% equals().
|
||||
library(scales) # Include commas in numbers.
|
||||
library(RevoScaleR) # Enable out-of-memory computation in R.
|
||||
library(dplyrXdf) # Wrangling on xdf data format.
|
||||
library(MicrosoftML) # Build models using Microsoft ML algortihms.
|
||||
library(caret) # Calculate confusion matrix by using confusionMatrix().
|
||||
library(ROCR) # Provide functions for model performance evaluation.
|
||||
```
|
||||
|
||||
Then, the dataset processedSimu is ingested and transformed into a `.xdf` data format. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Data Ingestion
|
||||
|
||||
# Identify the source location of the dataset.
|
||||
|
||||
#DATA <- "../../Data/"
|
||||
#data_fname <- file.path(DATA, "Raw/processedSimu.csv")
|
||||
|
||||
wd <- getwd()
|
||||
|
||||
dpath <- "../Data"
|
||||
data_fname <- file.path(wd, dpath, "processedSimu.csv")
|
||||
output_fname <- file.path(wd, dpath, "processedSimu.xdf")
|
||||
output <- RxXdfData(file=output_fname)
|
||||
|
||||
# Ingest the dataset.
|
||||
|
||||
data <- rxImport(inData=data_fname,
|
||||
outFile=output,
|
||||
stringsAsFactors=TRUE,
|
||||
overwrite=TRUE)
|
||||
|
||||
|
||||
# View data information.
|
||||
|
||||
rxGetVarInfo(data)
|
||||
```
|
||||
|
||||
### 2.2 Model Building
|
||||
|
||||
Now, let's get started to build credit risk models by leveraging different machine learning algorithms from the `MicrosoftML` package.
|
||||
|
||||
First of all, we create individual machine learning models on the dataset processedSimu.xdf by using the functions `rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`.
|
||||
|
||||
From the credit risk prediction template, we know that gradient boosting is the most suitable algorithm for this example, considering the overall performance. Therefore, the models implemented by the function `rxFastTrees()` with different sets of parameters are trained respectively.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
## Variable roles.
|
||||
|
||||
# Target variable
|
||||
|
||||
target <- "bad_flag"
|
||||
|
||||
# Note any identifier.
|
||||
|
||||
id <- c("account_id") %T>% print()
|
||||
|
||||
# Note the available variables as model inputs.
|
||||
|
||||
vars <- setdiff(names(data), c(target, id))
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Split Data
|
||||
|
||||
set.seed(42)
|
||||
|
||||
# Add training/testing flag to each observation.
|
||||
|
||||
data %<>%
|
||||
mutate(.train=factor(sample(1:2, .rxNumRows,
|
||||
replace=TRUE,
|
||||
prob=c(0.70, 0.30)),
|
||||
levels=1:2))
|
||||
|
||||
# Split dataset into training/test.
|
||||
|
||||
data_split <- rxSplit(data, splitByFactor=".train")
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Prepare the formula
|
||||
|
||||
top_vars <- c("amount_6", "pur_6", "avg_pur_amt_6", "avg_interval_pur_6", "credit_limit", "age", "income", "sex", "education", "marital_status")
|
||||
|
||||
form <- as.formula(paste(target, paste(top_vars, collapse="+"), sep="~"))
|
||||
form
|
||||
```
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Specify the local parallel compute context.
|
||||
|
||||
rxSetComputeContext("localpar")
|
||||
|
||||
# Train model: rxLogisticRegression
|
||||
|
||||
time_rxlogit <- system.time(
|
||||
|
||||
model_rxlogit <- rxLogisticRegression(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
l1Weight=1,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
|
||||
# Train model: rxFastForest
|
||||
|
||||
time_rxforest <- system.time(
|
||||
|
||||
model_rxforest <- rxFastForest(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
numTrees=100,
|
||||
numLeaves=20,
|
||||
minSplit=10,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
|
||||
# Train model: rxFastTrees
|
||||
|
||||
time_rxtrees1 <- system.time(
|
||||
|
||||
model_rxtrees1 <- rxFastTrees(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
numTrees=100,
|
||||
numLeaves=20,
|
||||
learningRate=0.2,
|
||||
minSplit=10,
|
||||
unbalancedSets=FALSE,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
|
||||
time_rxtrees2 <- system.time(
|
||||
|
||||
model_rxtrees2 <- rxFastTrees(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
numTrees=500,
|
||||
numLeaves=20,
|
||||
learningRate=0.2,
|
||||
minSplit=10,
|
||||
unbalancedSets=FALSE,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
|
||||
time_rxtrees3 <- system.time(
|
||||
|
||||
model_rxtrees3 <- rxFastTrees(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
numTrees=500,
|
||||
numLeaves=20,
|
||||
learningRate=0.3,
|
||||
minSplit=10,
|
||||
unbalancedSets=FALSE,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
|
||||
time_rxtrees4 <- system.time(
|
||||
|
||||
model_rxtrees4 <- rxFastTrees(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
numTrees=500,
|
||||
numLeaves=20,
|
||||
learningRate=0.3,
|
||||
minSplit=10,
|
||||
unbalancedSets=TRUE,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
Next, we build an ensemble of fast tree models by using the function `rxEnsemble()`.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Train an ensemble model.
|
||||
|
||||
time_ensemble <- system.time(
|
||||
|
||||
model_ensemble <- rxEnsemble(
|
||||
formula=form,
|
||||
data=data_split[[1]],
|
||||
type="binary",
|
||||
trainers=list(fastTrees(),
|
||||
fastTrees(numTrees=500),
|
||||
fastTrees(numTrees=500, learningRate=0.3),
|
||||
fastTrees(numTrees=500, learningRate=0.3, unbalancedSets=TRUE)),
|
||||
combineMethod="vote",
|
||||
replace=TRUE,
|
||||
verbose=0
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 2.3 Model Evaluation
|
||||
|
||||
Finally, we evaluate and compare the above built models at various aspects.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Predict
|
||||
|
||||
models <- list(model_rxlogit, model_rxforest,
|
||||
model_rxtrees1, model_rxtrees2, model_rxtrees3, model_rxtrees4,
|
||||
model_ensemble)
|
||||
|
||||
# Predict class
|
||||
|
||||
predictions <- lapply(models,
|
||||
rxPredict,
|
||||
data=data_split[[2]]) %>%
|
||||
lapply('[[', 1)
|
||||
|
||||
levels(predictions[[7]]) <- c("no", "yes")
|
||||
|
||||
# Confusion matrix evaluation results.
|
||||
|
||||
cm_metrics <-lapply(predictions,
|
||||
confusionMatrix,
|
||||
reference=data_split[[2]][[target]],
|
||||
positive="yes")
|
||||
|
||||
# Accuracy
|
||||
|
||||
acc_metrics <-
|
||||
lapply(cm_metrics, `[[`, "overall") %>%
|
||||
lapply(`[`, 1) %>%
|
||||
unlist() %>%
|
||||
as.vector()
|
||||
|
||||
# Recall
|
||||
|
||||
rec_metrics <-
|
||||
lapply(cm_metrics, `[[`, "byClass") %>%
|
||||
lapply(`[`, 1) %>%
|
||||
unlist() %>%
|
||||
as.vector()
|
||||
|
||||
# Precision
|
||||
|
||||
pre_metrics <-
|
||||
lapply(cm_metrics, `[[`, "byClass") %>%
|
||||
lapply(`[`, 3) %>%
|
||||
unlist() %>%
|
||||
as.vector()
|
||||
|
||||
# Predict class probability
|
||||
|
||||
probs <- lapply(models[c(1, 2, 3, 4, 5, 6)],
|
||||
rxPredict,
|
||||
data=data_split[[2]]) %>%
|
||||
lapply('[[', 3)
|
||||
|
||||
# Create prediction object
|
||||
|
||||
preds <- lapply(probs,
|
||||
ROCR::prediction,
|
||||
labels=data_split[[2]][[target]])
|
||||
|
||||
# Auc
|
||||
|
||||
auc_metrics <- lapply(preds,
|
||||
ROCR::performance,
|
||||
"auc") %>%
|
||||
lapply(slot, "y.values") %>%
|
||||
lapply('[[', 1) %>%
|
||||
unlist()
|
||||
|
||||
auc_metrics <- c(auc_metrics, NaN)
|
||||
|
||||
algo_list <- c("rxLogisticRegression",
|
||||
"rxFastForest",
|
||||
"rxFastTrees",
|
||||
"rxFastTrees(500)",
|
||||
"rxFastTrees(500, 0.3)",
|
||||
"rxFastTrees(500, 0.3, ub)",
|
||||
"rxEnsemble")
|
||||
|
||||
time_consumption <- c(time_rxlogit[3], time_rxforest[[3]],
|
||||
time_rxtrees1[3], time_rxtrees2[[3]],
|
||||
time_rxtrees3[[3]], time_rxtrees4[[3]],
|
||||
time_ensemble[3])
|
||||
|
||||
df_comp <-
|
||||
data.frame(Models=algo_list,
|
||||
Accuracy=acc_metrics,
|
||||
Recall=rec_metrics,
|
||||
Precision=pre_metrics,
|
||||
AUC=auc_metrics,
|
||||
Time=time_consumption) %T>%
|
||||
print()
|
||||
```
|
||||
|
||||
### 2.4 Save Models for Deployment
|
||||
|
||||
Last but not least, we need to save the model objects in various formats, (e.g., `.RData`, `SQLServerData`, ect) for the later usage of deployment.
|
||||
|
||||
```{r, message=FALSE, warning=FALSE, error=FALSE}
|
||||
# Save model for deployment usage.
|
||||
|
||||
model_rxtrees <- model_rxtrees3
|
||||
|
||||
save(model_rxtrees, file="model_rxtrees.RData")
|
||||
```
|
||||
|
Различия файлов скрыты, потому что одна или несколько строк слишком длинны
|
@ -0,0 +1,525 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"---\n",
|
||||
"title: \"Faster and Scalable Credit Risk Prediction\"\n",
|
||||
"author: \"Fang Zhou, Data Scientist, Microsoft\"\n",
|
||||
"date: \"`r Sys.Date()`\"\n",
|
||||
"output: html_document\n",
|
||||
"---"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"id": "",
|
||||
"include": "FALSE,",
|
||||
"purl": "FALSE"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"knitr::opts_chunk$set(echo = TRUE,\n",
|
||||
" fig.width = 8,\n",
|
||||
" fig.height = 5,\n",
|
||||
" fig.align='center',\n",
|
||||
" dev = \"png\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1 Introduction\n",
|
||||
"\n",
|
||||
"Microsoft R is a collection of servers and tools that extend the capabilities of R, making it easier and faster to build and deploy R-based solutions. Microsoft R brings you the ability to do parallel and chunked data processing and modelling that relax the restrictions on dataset size imposed by in-memory open source R. \n",
|
||||
"\n",
|
||||
"The `MicrosoftML` package brings new machine learning functionality with increased speed, performance and scalability, especially for handling a large corpus of text data or high-dimensional categorical data. The `MicrosoftML` package is installed with **Microsoft R Client**, **Microsoft R Server** and with the **SQL Server Machine Learning Services**.\n",
|
||||
"\n",
|
||||
"This document will walk through you how to build faster and scalable credit risk models, using the `MicrosoftML` package that adds state-of-the-art machine learning algorithms and data transforms to Microsoft R Server.\n",
|
||||
"\n",
|
||||
"## 2 Faster and Scalable Credit Risk Models\n",
|
||||
"\n",
|
||||
"### 2.1 Setup\n",
|
||||
"\n",
|
||||
"We load the required R packages."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"\n",
|
||||
"# Load the required packages into the R session.\n",
|
||||
"\n",
|
||||
"library(rattle) # Use normVarNames().\n",
|
||||
"library(dplyr) # Wrangling: tbl_df(), group_by(), print(), glimpse().\n",
|
||||
"library(magrittr) # Pipe operator %>% %<>% %T>% equals().\n",
|
||||
"library(scales) # Include commas in numbers.\n",
|
||||
"library(RevoScaleR) # Enable out-of-memory computation in R.\n",
|
||||
"library(dplyrXdf) # Wrangling on xdf data format.\n",
|
||||
"library(MicrosoftML) # Build models using Microsoft ML algortihms.\n",
|
||||
"library(caret) # Calculate confusion matrix by using confusionMatrix().\n",
|
||||
"library(ROCR) # Provide functions for model performance evaluation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Then, the dataset processedSimu is ingested and transformed into a `.xdf` data format. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Data Ingestion\n",
|
||||
"\n",
|
||||
"# Identify the source location of the dataset.\n",
|
||||
"\n",
|
||||
"#DATA <- \"../../Data/\"\n",
|
||||
"#data_fname <- file.path(DATA, \"Raw/processedSimu.csv\")\n",
|
||||
"\n",
|
||||
"wd <- getwd()\n",
|
||||
"\n",
|
||||
"dpath <- \"../Data\"\n",
|
||||
"data_fname <- file.path(wd, dpath, \"processedSimu.csv\")\n",
|
||||
"output_fname <- file.path(wd, dpath, \"processedSimu.xdf\")\n",
|
||||
"output <- RxXdfData(file=output_fname)\n",
|
||||
"\n",
|
||||
"# Ingest the dataset.\n",
|
||||
"\n",
|
||||
"data <- rxImport(inData=data_fname, \n",
|
||||
" outFile=output,\n",
|
||||
" stringsAsFactors=TRUE,\n",
|
||||
" overwrite=TRUE)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# View data information.\n",
|
||||
"\n",
|
||||
"rxGetVarInfo(data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.2 Model Building\n",
|
||||
"\n",
|
||||
"Now, let's get started to build credit risk models by leveraging different machine learning algorithms from the `MicrosoftML` package. \n",
|
||||
"\n",
|
||||
"First of all, we create individual machine learning models on the dataset processedSimu.xdf by using the functions `rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`. \n",
|
||||
"\n",
|
||||
"From the credit risk prediction template, we know that gradient boosting is the most suitable algorithm for this example, considering the overall performance. Therefore, the models implemented by the function `rxFastTrees()` with different sets of parameters are trained respectively."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"## Variable roles.\n",
|
||||
"\n",
|
||||
"# Target variable\n",
|
||||
"\n",
|
||||
"target <- \"bad_flag\"\n",
|
||||
"\n",
|
||||
"# Note any identifier.\n",
|
||||
"\n",
|
||||
"id <- c(\"account_id\") %T>% print() \n",
|
||||
"\n",
|
||||
"# Note the available variables as model inputs.\n",
|
||||
"\n",
|
||||
"vars <- setdiff(names(data), c(target, id))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Split Data\n",
|
||||
"\n",
|
||||
"set.seed(42)\n",
|
||||
"\n",
|
||||
"# Add training/testing flag to each observation.\n",
|
||||
"\n",
|
||||
"data %<>%\n",
|
||||
" mutate(.train=factor(sample(1:2, .rxNumRows,\n",
|
||||
" replace=TRUE,\n",
|
||||
" prob=c(0.70, 0.30)),\n",
|
||||
" levels=1:2))\n",
|
||||
"\n",
|
||||
"# Split dataset into training/test.\n",
|
||||
"\n",
|
||||
"data_split <- rxSplit(data, splitByFactor=\".train\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Prepare the formula\n",
|
||||
"\n",
|
||||
"top_vars <- c(\"amount_6\", \"pur_6\", \"avg_pur_amt_6\", \"avg_interval_pur_6\", \"credit_limit\", \"age\", \"income\", \"sex\", \"education\", \"marital_status\")\n",
|
||||
"\n",
|
||||
"form <- as.formula(paste(target, paste(top_vars, collapse=\"+\"), sep=\"~\"))\n",
|
||||
"form"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Specify the local parallel compute context.\n",
|
||||
"\n",
|
||||
"rxSetComputeContext(\"localpar\")\n",
|
||||
"\n",
|
||||
"# Train model: rxLogisticRegression\n",
|
||||
"\n",
|
||||
"time_rxlogit <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxlogit <- rxLogisticRegression(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" l1Weight=1,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Train model: rxFastForest\n",
|
||||
"\n",
|
||||
"time_rxforest <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxforest <- rxFastForest(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=100,\n",
|
||||
" numLeaves=20,\n",
|
||||
" minSplit=10,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Train model: rxFastTrees\n",
|
||||
"\n",
|
||||
"time_rxtrees1 <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxtrees1 <- rxFastTrees(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=100,\n",
|
||||
" numLeaves=20,\n",
|
||||
" learningRate=0.2,\n",
|
||||
" minSplit=10,\n",
|
||||
" unbalancedSets=FALSE,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"time_rxtrees2 <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxtrees2 <- rxFastTrees(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=500,\n",
|
||||
" numLeaves=20,\n",
|
||||
" learningRate=0.2,\n",
|
||||
" minSplit=10,\n",
|
||||
" unbalancedSets=FALSE,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"time_rxtrees3 <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxtrees3 <- rxFastTrees(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=500,\n",
|
||||
" numLeaves=20,\n",
|
||||
" learningRate=0.3,\n",
|
||||
" minSplit=10,\n",
|
||||
" unbalancedSets=FALSE,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"time_rxtrees4 <- system.time(\n",
|
||||
" \n",
|
||||
" model_rxtrees4 <- rxFastTrees(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" numTrees=500,\n",
|
||||
" numLeaves=20,\n",
|
||||
" learningRate=0.3,\n",
|
||||
" minSplit=10,\n",
|
||||
" unbalancedSets=TRUE,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, we build an ensemble of fast tree models by using the function `rxEnsemble()`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Train an ensemble model.\n",
|
||||
"\n",
|
||||
"time_ensemble <- system.time(\n",
|
||||
" \n",
|
||||
" model_ensemble <- rxEnsemble(\n",
|
||||
" formula=form,\n",
|
||||
" data=data_split[[1]],\n",
|
||||
" type=\"binary\",\n",
|
||||
" trainers=list(fastTrees(), \n",
|
||||
" fastTrees(numTrees=500), \n",
|
||||
" fastTrees(numTrees=500, learningRate=0.3),\n",
|
||||
" fastTrees(numTrees=500, learningRate=0.3, unbalancedSets=TRUE)),\n",
|
||||
" combineMethod=\"vote\",\n",
|
||||
" replace=TRUE,\n",
|
||||
" verbose=0\n",
|
||||
" )\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.3 Model Evaluation \n",
|
||||
"\n",
|
||||
"Finally, we evaluate and compare the above built models at various aspects."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Predict\n",
|
||||
"\n",
|
||||
"models <- list(model_rxlogit, model_rxforest, \n",
|
||||
" model_rxtrees1, model_rxtrees2, model_rxtrees3, model_rxtrees4, \n",
|
||||
" model_ensemble)\n",
|
||||
"\n",
|
||||
"# Predict class\n",
|
||||
"\n",
|
||||
"predictions <- lapply(models, \n",
|
||||
" rxPredict, \n",
|
||||
" data=data_split[[2]]) %>%\n",
|
||||
" lapply('[[', 1)\n",
|
||||
"\n",
|
||||
"levels(predictions[[7]]) <- c(\"no\", \"yes\")\n",
|
||||
"\n",
|
||||
"# Confusion matrix evaluation results.\n",
|
||||
"\n",
|
||||
"cm_metrics <-lapply(predictions,\n",
|
||||
" confusionMatrix, \n",
|
||||
" reference=data_split[[2]][[target]],\n",
|
||||
" positive=\"yes\")\n",
|
||||
"\n",
|
||||
"# Accuracy\n",
|
||||
"\n",
|
||||
"acc_metrics <- \n",
|
||||
" lapply(cm_metrics, `[[`, \"overall\") %>%\n",
|
||||
" lapply(`[`, 1) %>%\n",
|
||||
" unlist() %>%\n",
|
||||
" as.vector()\n",
|
||||
"\n",
|
||||
"# Recall\n",
|
||||
"\n",
|
||||
"rec_metrics <- \n",
|
||||
" lapply(cm_metrics, `[[`, \"byClass\") %>%\n",
|
||||
" lapply(`[`, 1) %>%\n",
|
||||
" unlist() %>%\n",
|
||||
" as.vector()\n",
|
||||
" \n",
|
||||
"# Precision\n",
|
||||
"\n",
|
||||
"pre_metrics <- \n",
|
||||
" lapply(cm_metrics, `[[`, \"byClass\") %>%\n",
|
||||
" lapply(`[`, 3) %>%\n",
|
||||
" unlist() %>%\n",
|
||||
" as.vector()\n",
|
||||
"\n",
|
||||
"# Predict class probability\n",
|
||||
"\n",
|
||||
"probs <- lapply(models[c(1, 2, 3, 4, 5, 6)],\n",
|
||||
" rxPredict,\n",
|
||||
" data=data_split[[2]]) %>%\n",
|
||||
" lapply('[[', 3)\n",
|
||||
"\n",
|
||||
"# Create prediction object\n",
|
||||
"\n",
|
||||
"preds <- lapply(probs, \n",
|
||||
" ROCR::prediction,\n",
|
||||
" labels=data_split[[2]][[target]])\n",
|
||||
"\n",
|
||||
"# Auc\n",
|
||||
"\n",
|
||||
"auc_metrics <- lapply(preds, \n",
|
||||
" ROCR::performance,\n",
|
||||
" \"auc\") %>%\n",
|
||||
" lapply(slot, \"y.values\") %>%\n",
|
||||
" lapply('[[', 1) %>%\n",
|
||||
" unlist()\n",
|
||||
"\n",
|
||||
"auc_metrics <- c(auc_metrics, NaN)\n",
|
||||
"\n",
|
||||
"algo_list <- c(\"rxLogisticRegression\", \n",
|
||||
" \"rxFastForest\", \n",
|
||||
" \"rxFastTrees\", \n",
|
||||
" \"rxFastTrees(500)\", \n",
|
||||
" \"rxFastTrees(500, 0.3)\", \n",
|
||||
" \"rxFastTrees(500, 0.3, ub)\",\n",
|
||||
" \"rxEnsemble\")\n",
|
||||
"\n",
|
||||
"time_consumption <- c(time_rxlogit[3], time_rxforest[[3]], \n",
|
||||
" time_rxtrees1[3], time_rxtrees2[[3]], \n",
|
||||
" time_rxtrees3[[3]], time_rxtrees4[[3]],\n",
|
||||
" time_ensemble[3])\n",
|
||||
"\n",
|
||||
"df_comp <- \n",
|
||||
" data.frame(Models=algo_list, \n",
|
||||
" Accuracy=acc_metrics, \n",
|
||||
" Recall=rec_metrics, \n",
|
||||
" Precision=pre_metrics,\n",
|
||||
" AUC=auc_metrics,\n",
|
||||
" Time=time_consumption) %T>%\n",
|
||||
" print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2.4 Save Models for Deployment\n",
|
||||
"\n",
|
||||
"Last but not least, we need to save the model objects in various formats, (e.g., `.RData`, `SQLServerData`, ect) for the later usage of deployment."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"attributes": {
|
||||
"classes": [],
|
||||
"error": "FALSE",
|
||||
"id": "",
|
||||
"message": "FALSE,",
|
||||
"warning": "FALSE,"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Save model for deployment usage.\n",
|
||||
"\n",
|
||||
"model_rxtrees <- model_rxtrees3\n",
|
||||
"\n",
|
||||
"save(model_rxtrees, file=\"model_rxtrees.RData\")"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
|
@ -9,8 +9,15 @@ Some other critical R packages for the analysis:
|
|||
* glmnet >= 2.0-5 Logistic regression model with L1 and L2 regularization.
|
||||
* xgboost >= 0.6-4 Extreme gradiant boost model.
|
||||
* randomForest >= 4.6-12 Random Forest model.
|
||||
* caret >= 6.0-73 Classification and regression training.
|
||||
* caretEnsemble >= 2.0.0 Ensemble of caret based models.
|
||||
|
||||
* RevoScaleR >= 9.1 Parallel and chunked data processing and modeling.
|
||||
* dplyrXdf >= 0.9.2 Out-of-Memory Data wrangling.
|
||||
* MicrosoftML >= 9.1 Microsoft machine learning models.
|
||||
|
||||
* mrsdeploy >= 9.1 R Server Operationalization.
|
||||
|
||||
# Use of template
|
||||
|
||||
The codes for analytics, embedded with step-by-step instructions, are written in R markdown, and can be run interactively within the code chunks of the markdown file.
|
||||
|
|
Двоичный файл не отображается.
|
@ -9,7 +9,7 @@ Many banks nowadays are driving innovation to enhance risk management. For examp
|
|||
The repository contains three parts
|
||||
|
||||
- **Data** This contains the provided sample data.
|
||||
- **Code** This contains the R development code. They are displayed in R markdown files which can yield files of various formats.
|
||||
- **Code** This contains the R development code. They are displayed in R markdown files which can yield files of various formats, like html, ipynb, ect.
|
||||
- **Docs** This contains the documents, like blog, installation instructions, etc.
|
||||
|
||||
## Business domain
|
||||
|
@ -36,4 +36,13 @@ In the data-driven credit risk prediction model, normally two types of data are
|
|||
|
||||
1. Traditional logistic regression model with L1 regularization are built as a baseline.
|
||||
2. Machine learning models, such as gradiant boosting and random forest, or their ensembles, are fine tuned to compare the performance at various aspects.
|
||||
3. Innovative convolutionary hotspot method will be pursued in the near future.
|
||||
3. Innovative convolutionary hotspot method will be pursued in the near future.
|
||||
|
||||
## Scalability
|
||||
|
||||
**Faster and scalable credit risk models** are built using the state-of-the-art machine learning algorithms provided by the `MicrosoftML` package.
|
||||
|
||||
## Operationalization
|
||||
|
||||
An **R model based web service for credit risk prediction** is published and consumed by using the `mrsdeploy` package that ships with Microsoft R Client and R Server 9.1.
|
||||
|
Загрузка…
Ссылка в новой задаче