This commit is contained in:
Graham Williams 2017-06-23 08:26:12 +08:00
Родитель b9b209e3ae
Коммит e02efe40b5
1 изменённых файлов: 28 добавлений и 10 удалений

Просмотреть файл

@ -15,11 +15,23 @@ knitr::opts_chunk$set(echo = TRUE,
## 1 Introduction ## 1 Introduction
Microsoft R is a collection of servers and tools that extend the capabilities of R, making it easier and faster to build and deploy R-based solutions. Microsoft R brings you the ability to do parallel and chunked data processing and modelling that relax the restrictions on dataset size imposed by in-memory open source R. Microsoft R is a collection of servers and tools that extend the
capabilities of R, making it easier and faster to build and deploy
R-based solutions. Microsoft R brings you the ability to do parallel
and chunked data processing and modelling that relaxes the
restrictions on dataset size imposed by in-memory open source R.
The `MicrosoftML` package brings new machine learning functionality with increased speed, performance and scalability, especially for handling a large corpus of text data or high-dimensional categorical data. The `MicrosoftML` package is installed with **Microsoft R Client**, **Microsoft R Server** and with the **SQL Server Machine Learning Services**. The `MicrosoftML` package brings new machine learning functionality
with increased speed, performance and scalability, especially for
handling a large corpus of text data or high-dimensional categorical
data. The `MicrosoftML` package is installed with **Microsoft R
Client**, **Microsoft R Server** and with the **SQL Server Machine
Learning Services**.
This document will walk through you how to build faster and scalable credit risk models, using the `MicrosoftML` package that adds state-of-the-art machine learning algorithms and data transforms to Microsoft R Server. This document will walk through you how to build faster and scalable
credit risk models, using the `MicrosoftML` package that adds
state-of-the-art machine learning algorithms and data transforms to
Microsoft R Server.
## 2 Faster and Scalable Credit Risk Models ## 2 Faster and Scalable Credit Risk Models
@ -43,16 +55,15 @@ library(caret) # Calculate confusion matrix by using confusionMatrix().
library(ROCR) # Provide functions for model performance evaluation. library(ROCR) # Provide functions for model performance evaluation.
``` ```
Then, the dataset processedSimu is ingested and transformed into a `.xdf` data format. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction. Then, the dataset processedSimu is ingested and transformed into a
`.xdf` data format. This random dataset was created to simulate real
world banking transaction data.
```{r, message=FALSE, warning=FALSE, error=FALSE} ```{r, message=FALSE, warning=FALSE, error=FALSE}
## Data Ingestion ## Data Ingestion
# Identify the source location of the dataset. # Identify the source location of the dataset.
#DATA <- "../../Data/"
#data_fname <- file.path(DATA, "Raw/processedSimu.csv")
wd <- getwd() wd <- getwd()
dpath <- "../Data" dpath <- "../Data"
@ -75,11 +86,18 @@ rxGetVarInfo(data)
### 2.2 Model Building ### 2.2 Model Building
Now, let's get started to build credit risk models by leveraging different machine learning algorithms from the `MicrosoftML` package. Now, let's get started to build credit risk models by leveraging
different machine learning algorithms from the `MicrosoftML` package.
First of all, we create individual machine learning models on the dataset processedSimu.xdf by using the functions `rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`. First of all, we create individual machine learning models on the
dataset processedSimu.xdf by using the functions
`rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`.
From the credit risk prediction template, we know that gradient boosting is the most suitable algorithm for this example, considering the overall performance. Therefore, the models implemented by the function `rxFastTrees()` with different sets of parameters are trained respectively. From the credit risk prediction template, we know that gradient
boosting is the most suitable algorithm for this example, considering
the overall performance. Therefore, the models implemented by the
function `rxFastTrees()` with different sets of parameters are trained
respectively.
```{r, message=FALSE, warning=FALSE, error=FALSE} ```{r, message=FALSE, warning=FALSE, error=FALSE}
## Variable roles. ## Variable roles.