From e02efe40b5a1e4135aa913e951fe35dc032442b6 Mon Sep 17 00:00:00 2001 From: Graham Williams Date: Fri, 23 Jun 2017 08:26:12 +0800 Subject: [PATCH] Review and minor changes --- CreditRiskPrediction/Code/CreditRiskScale.Rmd | 38 ++++++++++++++----- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/CreditRiskPrediction/Code/CreditRiskScale.Rmd b/CreditRiskPrediction/Code/CreditRiskScale.Rmd index 47fe16f..0b15974 100644 --- a/CreditRiskPrediction/Code/CreditRiskScale.Rmd +++ b/CreditRiskPrediction/Code/CreditRiskScale.Rmd @@ -15,11 +15,23 @@ knitr::opts_chunk$set(echo = TRUE, ## 1 Introduction -Microsoft R is a collection of servers and tools that extend the capabilities of R, making it easier and faster to build and deploy R-based solutions. Microsoft R brings you the ability to do parallel and chunked data processing and modelling that relax the restrictions on dataset size imposed by in-memory open source R. +Microsoft R is a collection of servers and tools that extend the +capabilities of R, making it easier and faster to build and deploy +R-based solutions. Microsoft R brings you the ability to do parallel +and chunked data processing and modelling that relaxes the +restrictions on dataset size imposed by in-memory open source R. -The `MicrosoftML` package brings new machine learning functionality with increased speed, performance and scalability, especially for handling a large corpus of text data or high-dimensional categorical data. The `MicrosoftML` package is installed with **Microsoft R Client**, **Microsoft R Server** and with the **SQL Server Machine Learning Services**. +The `MicrosoftML` package brings new machine learning functionality +with increased speed, performance and scalability, especially for +handling a large corpus of text data or high-dimensional categorical +data. The `MicrosoftML` package is installed with **Microsoft R +Client**, **Microsoft R Server** and with the **SQL Server Machine +Learning Services**. -This document will walk through you how to build faster and scalable credit risk models, using the `MicrosoftML` package that adds state-of-the-art machine learning algorithms and data transforms to Microsoft R Server. +This document will walk through you how to build faster and scalable +credit risk models, using the `MicrosoftML` package that adds +state-of-the-art machine learning algorithms and data transforms to +Microsoft R Server. ## 2 Faster and Scalable Credit Risk Models @@ -43,16 +55,15 @@ library(caret) # Calculate confusion matrix by using confusionMatrix(). library(ROCR) # Provide functions for model performance evaluation. ``` -Then, the dataset processedSimu is ingested and transformed into a `.xdf` data format. This dataset was created by the data preprocessing steps in the data science accelerator for credit risk prediction. +Then, the dataset processedSimu is ingested and transformed into a +`.xdf` data format. This random dataset was created to simulate real +world banking transaction data. ```{r, message=FALSE, warning=FALSE, error=FALSE} ## Data Ingestion # Identify the source location of the dataset. -#DATA <- "../../Data/" -#data_fname <- file.path(DATA, "Raw/processedSimu.csv") - wd <- getwd() dpath <- "../Data" @@ -75,11 +86,18 @@ rxGetVarInfo(data) ### 2.2 Model Building -Now, let's get started to build credit risk models by leveraging different machine learning algorithms from the `MicrosoftML` package. +Now, let's get started to build credit risk models by leveraging +different machine learning algorithms from the `MicrosoftML` package. -First of all, we create individual machine learning models on the dataset processedSimu.xdf by using the functions `rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`. +First of all, we create individual machine learning models on the +dataset processedSimu.xdf by using the functions +`rxLogisticRegression()`, `rxFastForest()`, `rxFastTrees()`. -From the credit risk prediction template, we know that gradient boosting is the most suitable algorithm for this example, considering the overall performance. Therefore, the models implemented by the function `rxFastTrees()` with different sets of parameters are trained respectively. +From the credit risk prediction template, we know that gradient +boosting is the most suitable algorithm for this example, considering +the overall performance. Therefore, the models implemented by the +function `rxFastTrees()` with different sets of parameters are trained +respectively. ```{r, message=FALSE, warning=FALSE, error=FALSE} ## Variable roles.