This commit is contained in:
Blanca Li 2020-03-09 17:48:19 +08:00
Родитель a23708df05
Коммит c8c4d9ffb1
43 изменённых файлов: 696 добавлений и 0 удалений

Просмотреть файл

@ -0,0 +1,75 @@
---
title: 'Designer: Predict churn example'
titleSuffix: Azure Machine Learning
description: Follow this classification example to predict churn with Azure Machine Learning designer & boosted decision trees.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: sgilley
ms.date: 12/25/2019
---
# Use boosted decision tree to predict churn with Azure Machine Learning designer
**Designer (preview) sample 5**
Learn how to build a complex machine learning pipeline without writing a single line of code using the designer (preview).
This pipeline trains 2 **two-class boosted decision tree** classifiers to predict common tasks for customer relationship management (CRM) systems - customer churn. The data values and labels are split across multiple data sources and scrambled to anonymize customer information, however, we can still use the designer to combine data sets and train a model using the obscured values.
Because you're trying to answer the question "Which one?" this is called a classification problem, but you can apply the same logic shown in this sample to tackle any type of machine learning problem whether it be regression, classification, clustering, and so on.
Here's the completed graph for this pipeline:
![Pipeline graph](./media/how-to-designer-sample-classification-churn/pipeline-graph.png)
## Data
The data for this pipeline is from KDD Cup 2009. It has 50,000 rows and 230 feature columns. The task is to predict churn, appetency, and up-selling for customers who use these features. For more information about the data and the task, see the [KDD website](https://www.kdd.org/kdd-cup/view/kdd-cup-2009).
## Pipeline summary
This sample pipeline in the designer shows binary classifier prediction of churn, appetency, and up-selling, a common task for customer relationship management (CRM).
First, some simple data processing.
- The raw dataset has many missing values. Use the **Clean Missing Data** module to replace the missing values with 0.
![Clean the dataset](media/how-to-designer-sample-classification-churn/sample5-dataset-1225.png)
- The features and the corresponding churn are in different datasets. Use the **Add Columns** module to append the label columns to the feature columns. The first column, **Col1**, is the label column. From the visualization result we can see the dataset is unbalanced. There way more negative (-1) examples than positive examples (+1). We will use **SMOTE** module to increase underrepresented cases later.
![Add the column dataset](./media/how-to-designer-sample-classification-churn/sample5-addcol-1225.png)
- Use the **Split Data** module to split the dataset into train and test sets.
- Then use the Boosted Decision Tree binary classifier with the default parameters to build the prediction models. Build one model per task, that is, one model each to predict up-selling, appetency, and churn.
- In the right part of the pipeline, we use **SMOTE** module to increase the percentage of positive examples. The SMOTE percentage is set to 100 to double the positive examples. Learn more on how SMOTE module works with [SMOTE module reference0](algorithm-module-reference/smote.md).
## Results
Visualize the output of the **Evaluate Model** module to see the performance of the model on the test set.
![Evaluate the results](./media/how-to-designer-sample-classification-churn/sample5-evaluate-1225.png)
You can move the **Threshold** slider and see the metrics change for the binary classification task.
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,154 @@
---
title: 'Designer: Predict credit risk example'
titleSuffix: Azure Machine Learning
description: Build a classifier and use custom Python scripts to predict credit risk using Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 12/25/2019
---
# Build a classifier & use Python scripts to predict credit risk using Azure Machine Learning designer
**Designer (preview) sample 4**
This article shows you how to build a complex machine learning pipeline using the designer (preview). You'll learn how to implement custom logic using Python scripts and compare multiple models to choose the best option.
This sample trains a classifier to predict credit risk using credit application information such as credit history, age, and number of credit cards. However, you can apply the concepts in this article to tackle your own machine learning problems.
Here's the completed graph for this pipeline:
[![Graph of the pipeline](./media/how-to-designer-sample-classification-credit-risk-cost-sensitive/graph.png)](./media/how-to-designer-sample-classification-credit-risk-cost-sensitive/graph.png#lightbox)
## Data
This sample uses the German Credit Card dataset from the UC Irvine repository. It contains 1,000 samples with 20 features and one label. Each sample represents a person. The 20 features include numerical and categorical features. For more information about the dataset, see the [UCI website](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29). The last column is the label, which denotes the credit risk and has only two possible values: high credit risk = 2, and low credit risk = 1.
## Pipeline summary
In this pipeline, you compare two different approaches for generating models to solve this problem:
- Training with the original dataset.
- Training with a replicated dataset.
With both approaches, you evaluate the models by using the test dataset with replication to ensure that results are aligned with the cost function. Test two classifiers with both approaches: **Two-Class Support Vector Machine** and **Two-Class Boosted Decision Tree**.
The cost of misclassifying a low-risk example as high is 1, and the cost of misclassifying a high-risk example as low is 5. We use an **Execute Python Script** module to account for this misclassification cost.
Here's the graph of the pipeline:
[![Graph of the pipeline](./media/how-to-designer-sample-classification-credit-risk-cost-sensitive/graph.png)](./media/how-to-designer-sample-classification-credit-risk-cost-sensitive/graph.png#lightbox)
## Data processing
Start by using the **Metadata Editor** module to add column names to replace the default column names with more meaningful names, obtained from the dataset description on the UCI site. Provide the new column names as comma-separated values in the **New column** name field of the **Metadata Editor**.
Next, generate the training and test sets used to develop the risk prediction model. Split the original dataset into training and test sets of the same size by using the **Split Data** module. To create sets of equal size, set the **Fraction of rows in the first output dataset** option to 0.7.
### Generate the new dataset
Because the cost of underestimating risk is high, set the cost of misclassification like this:
- For high-risk cases misclassified as low risk: 5
- For low-risk cases misclassified as high risk: 1
To reflect this cost function, generate a new dataset. In the new dataset, each high-risk example is replicated five times, but the number of low-risk examples doesn't change. Split the data into training and test datasets before replication to prevent the same row from being in both sets.
To replicate the high-risk data, put this Python code into an **Execute Python Script** module:
```Python
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
df_label_1 = dataframe1[dataframe1.iloc[:, 20] == 1]
df_label_2 = dataframe1[dataframe1.iloc[:, 20] == 2]
result = df_label_1.append([df_label_2] * 5, ignore_index=True)
return result,
```
The **Execute Python Script** module replicates both the training and test datasets.
### Feature engineering
The **Two-Class Support Vector Machine** algorithm requires normalized data. So use the **Normalize Data** module to normalize the ranges of all numeric features with a `tanh` transformation. A `tanh` transformation converts all numeric features to values within a range of 0 and 1 while preserving the overall distribution of values.
The **Two-Class Support Vector Machine** module handles string features, converting them to categorical features and then to binary features with a value of zero or one. So you don't need to normalize these features.
## Models
Because you applied two classifiers, **Two-Class Support Vector Machine** (SVM) and **Two-Class Boosted Decision Tree**, and two datasets, you generate a total of four models:
- SVM trained with original data.
- SVM trained with replicated data.
- Boosted Decision Tree trained with original data.
- Boosted Decision Tree trained with replicated data.
This sample uses the standard data science workflow to create, train, and test the models:
1. Initialize the learning algorithms, using **Two-Class Support Vector Machine** and **Two-Class Boosted Decision Tree**.
1. Use **Train Model** to apply the algorithm to the data and create the actual model.
1. Use **Score Model** to produce scores by using the test examples.
The following diagram shows a portion of this pipeline, in which the original and replicated training sets are used to train two different SVM models. **Train Model** is connected to the training set, and **Score Model** is connected to the test set.
![Pipeline graph](./media/how-to-designer-sample-classification-credit-risk-cost-sensitive/score-part.png)
In the evaluation stage of the pipeline, you compute the accuracy of each of the four models. For this pipeline, use **Evaluate Model** to compare examples that have the same misclassification cost.
The **Evaluate Model** module can compute the performance metrics for as many as two scored models. So you can use one instance of **Evaluate Model** to evaluate the two SVM models and another instance of **Evaluate Model** to evaluate the two Boosted Decision Tree models.
Notice that the replicated test dataset is used as the input for **Score Model**. In other words, the final accuracy scores include the cost for getting the labels wrong.
## Combine multiple results
The **Evaluate Model** module produces a table with a single row that contains various metrics. To create a single set of accuracy results, we first use **Add Rows** to combine the results into a single table. We then use the following Python script in the **Execute Python Script** module to add the model name and training approach for each row in the table of results:
```Python
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
new_cols = pd.DataFrame(
columns=["Algorithm","Training"],
data=[
["SVM", "weighted"],
["SVM", "unweighted"],
["Boosted Decision Tree","weighted"],
["Boosted Decision Tree","unweighted"]
])
result = pd.concat([new_cols, dataframe1], axis=1)
return result,
```
## Results
To view the results of the pipeline, you can right-click the Visualize output of the last **Select Columns in Dataset** module.
![Visualize output](media/how-to-designer-sample-classification-credit-risk-cost-sensitive/sample4-lastselect-1225.png)
The first column lists the machine learning algorithm used to generate the model.
The second column indicates the type of the training set.
The third column contains the cost-sensitive accuracy value.
From these results, you can see that the best accuracy is provided by the model that was created with **Two-Class Support Vector Machine** and trained on the replicated training dataset.
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,117 @@
---
title: 'Designer: Predict flight delay example'
titleSuffix: Azure Machine Learning
description: Build a classifier and use custom R code to predict flight delays with Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 12/25/2019
---
# Build a classifier & use R to predict flight delays with Azure Machine Learning designer
**Designer (preview) sample 6**
This pipeline uses historical flight and weather data to predict if a scheduled passenger flight will be delayed by more than 15 minutes. This problem can be approached as a classification problem, predicting two classes: delayed, or on time.
Here's the final pipeline graph for this sample:
[![Graph of the pipeline](media/how-to-designer-sample-classification-flight-delay/pipeline-graph.png)](media/how-to-designer-sample-classification-flight-delay/pipeline-graph.png#lightbox)
## Data
This sample uses the **Flight Delays Data** dataset. It's part of the TranStats data collection from the U.S. Department of Transportation. The dataset contains flight delay information from April to October 2013. The dataset has been pre-processed as follows:
* Filtered to include the 70 busiest airports in the continental United States.
* Relabeled canceled flights as delayed by more than 15 mins.
* Filtered out diverted flights.
* Selected 14 columns.
To supplement the flight data, the **Weather Dataset** is used. The weather data contains hourly, land-based weather observations from NOAA, and represents observations from airport weather stations, covering the same time period as the flights dataset. It has been pre-processed as follows:
* Weather station IDs were mapped to corresponding airport IDs.
* Weather stations not associated with the 70 busiest airports were removed.
* The Date column was split into separate columns: Year, Month, and Day.
* Selected 26 columns.
## Pre-process the data
A dataset usually requires some pre-processing before it can be analyzed.
![data-process](./media/how-to-designer-sample-classification-flight-delay/data-process.png)
### Flight data
The columns **Carrier**, **OriginAirportID**, and **DestAirportID** are saved as integers. However, they're categorical attributes, use the **Edit Metadata** module to convert them to categorical.
![edit-metadata](./media/how-to-designer-sample-classification-flight-delay/edit-metadata.png)
Then use the **Select Columns** in Dataset module to exclude from the dataset columns that are possible target leakers: **DepDelay**, **DepDel15**, **ArrDelay**, **Canceled**, **Year**.
To join the flight records with the hourly weather records, use the scheduled departure time as one of the join keys. To do the join, the CSRDepTime column must be rounded down to the nearest hour, which is done by in the **Execute R Script** module.
### Weather data
Columns that have a large proportion of missing values are excluded using the **Project Columns** module. These columns include all string-valued columns: **ValueForWindCharacter**, **WetBulbFarenheit**, **WetBulbCelsius**, **PressureTendency**, **PressureChange**, **SeaLevelPressure**, and **StationPressure**.
The **Clean Missing Data** module is then applied to the remaining columns to remove rows with missing data.
Weather observation times are rounded up to the nearest full hour. Scheduled flight times and the weather observation times are rounded in opposite directions to ensure the model uses only weather before the flight time.
Since weather data is reported in local time, time zone differences are accounted for by subtracting the time zone columns from the scheduled departure time and the weather observation time. These operations are done using the **Execute R Script** module.
### Joining Datasets
Flight records are joined with weather data at origin of the flight (**OriginAirportID**) using the **Join Data** module.
![join flight and weather by origin](./media/how-to-designer-sample-classification-flight-delay/join-origin.png)
Flight records are joined with weather data using the destination of the flight (**DestAirportID**).
![Join flight and weather by destination](./media/how-to-designer-sample-classification-flight-delay/join-destination.png)
### Preparing Training and Test Samples
The **Split Data** module splits the data into April through September records for training, and October records for test.
![Split training and test data](./media/how-to-designer-sample-classification-flight-delay/split.png)
Year, month, and timezone columns are removed from the training dataset using the Select Columns module.
## Define features
In machine learning, features are individual measurable properties of something you're interested in. Finding a strong set of features requires experimentation and domain knowledge. Some features are better for predicting the target than others. Also, some features may have a strong correlation with other features, and won't add new information to the model. These features can be removed.
To build a model, you can use all the features available, or select a subset of the features.
## Choose and apply a learning algorithm
Create a model using the **Two-Class Logistic Regression** module and train it on the training dataset.
The result of the **Train Model** module is a trained classification model that can be used to score new samples to make predictions. Use the test set to generate scores from the trained models. Then use the **Evaluate Model** module to analyze and compare the quality of the models.
pipeline
After you run the pipeline, you can view the output from the **Score Model** module by clicking the output port and selecting **Visualize**. The output includes the scored labels and the probabilities for the labels.
Finally, to test the quality of the results, add the **Evaluate Model** module to the pipeline canvas, and connect the left input port to the output of the Score Model module. Run the pipeline and view the output of the **Evaluate Model** module, by clicking the output port and selecting **Visualize**.
## Evaluate
The logistic regression model has AUC of 0.631 on the test set.
![evaluate](media/how-to-designer-sample-classification-flight-delay/sample6-evaluate-1225.png)
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,72 @@
---
title: 'Designer: Classify, predict income example'
titleSuffix: Azure Machine Learning
description: Follow this example build a no-code classifier to predict income with Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 02/22/2020
---
# Build a classifier & use feature selection to predict income with Azure Machine Learning designer
**Designer (preview) sample 3**
Learn how to build a machine learning classifier without writing a single line of code using the designer (preview). This sample trains a **two-class boosted decision tree** to predict adult census income (>=50K or <=50K).
Because the question is answering "Which one?", this is called a classification problem. However, you can apply the same fundamental process to tackle any type of machine learning problem - regression, classification, clustering, and so on.
Here's the final pipeline graph for this sample:
![Graph of the pipeline](./media/how-to-designer-sample-classification-predict-income/overall-graph.png)
## Data
The dataset contains 14 features and one label column. There are multiple types of features, including numerical and categorical. The following diagram shows an excerpt from the dataset:
![data](media/how-to-designer-sample-classification-predict-income/sample3-dataset-1225.png)
## Pipeline summary
Follow these steps to create the pipeline:
1. Drag the Adult Census Income Binary dataset module into the pipeline canvas.
1. Add a **Split Data** module to create the training and test sets. Set the fraction of rows in the first output dataset to 0.7. This setting specifies that 70% of the data will be output to the left port of the module and the rest to the right port. We use the left dataset for training and the right one for testing.
1. Add the **Filter Based Feature Selection** module to select 5 features by PearsonCorrelation.
1. Add a **Two-Class Boosted Decision Tree** module to initialize a boosted decision tree classifier.
1. Add a **Train Model** module. Connect the classifier from the previous step to the left input port of the **Train Model**. Connect the filtered dataset from Filter Based Feature Selection module as training dataset. The **Train Model** will train the classifier.
1. Add Select Columns Transformation and Apply Transformation module to apply the same transformation (filtered based feature selection) to test dataset.
![apply-transformation](./media/how-to-designer-sample-classification-predict-income/transformation.png)
1. Add **Score Model** module and connect the **Train Model** module to it. Then add the test set (the output of Apply Transformation module which apply feature selection to test set too) to the **Score Model**. The **Score Model** will make the predictions. You can select its output port to see the predictions and the positive class probabilities.
This pipeline has two score modules, the one on the right has excluded label column before make the prediction. This is prepared to deploy a real-time endpoint, because the web service input will expect only features not label.
1. Add an **Evaluate Model** module and connect the scored dataset to its left input port. To see the evaluation results, select the output port of the **Evaluate Model** module and select **Visualize**.
## Results
![Evaluate the results](media/how-to-designer-sample-classification-predict-income/sample3-evaluate-1225.png)
In the evaluation results, you can see that the curves like ROC, Precision-recall and confusion metrics.
## Clean up resources
[!INCLUDE [aml-ui-cleanup](../../includes/aml-ui-cleanup.md)]
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,85 @@
---
title: 'Designer: Predict car prices (basic) example'
titleSuffix: Azure Machine Learning
description: Build an ML regression model to predict an automobile's price without writing a single line of code with Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 02/11/2020
---
# Use regression to predict car prices with Azure Machine Learning designer
**Designer (preview) sample 1**
Learn how to build a machine learning regression model without writing a single line of code using the designer (preview).
This pipeline trains a **linear regressor** to predict a car's price based on technical features such as make, model, horsepower, and size. Because you're trying to answer the question "How much?" this is called a regression problem. However, you can apply the same fundamental steps in this example to tackle any type of machine learning problem whether it be regression, classification, clustering, and so on.
The fundamental steps of a training machine learning model are:
1. Get the data
1. Pre-process the data
1. Train the model
1. Evaluate the model
Here's the final, completed graph of the pipeline. This article provides the rationale for all the modules so you can make similar decisions on your own.
![Graph of the pipeline](./media/how-to-designer-sample-regression-automobile-price-basic/overall-graph.png)
## Prerequisites
[!INCLUDE [aml-ui-prereq](../../includes/aml-ui-prereq.md)]
4. Click the sample 1 to open it。
## Get the data
This sample uses the **Automobile price data (Raw)** dataset, which is from the UCI Machine Learning Repository. The dataset contains 26 columns that contain information about automobiles, including make, model, price, vehicle features (like the number of cylinders), MPG, and an insurance risk score. The goal of this sample is to predict the price of the car.
## Pre-process the data
The main data preparation tasks include data cleaning, integration, transformation, reduction, and discretization or quantization. In the designer, you can find modules to perform these operations and other data pre-processing tasks in the **Data Transformation** group in the left panel.
Use the **Select Columns in Dataset** module to exclude normalized-losses that have many missing values. Then use **Clean Missing Data** to remove the rows that have missing values. This helps to create a clean set of training data.
![Data pre-processing](./media/how-to-designer-sample-regression-automobile-price-basic/data-processing.png)
## Train the model
Machine learning problems vary. Common machine learning tasks include classification, clustering, regression, and recommender systems, each of which might require a different algorithm. Your choice of algorithm often depends on the requirements of the use case. After you pick an algorithm, you need to tune its parameters to train a more accurate model. You then need to evaluate all models based on metrics like accuracy, intelligibility, and efficiency.
Since the goal of this sample is to predict automobile prices, and because the label column (price) is continuous data, a regression model can be a good choice. We use **Linear Regression** for this pipeline.
Use the **Split Data** module to randomly divide the input data so that the training dataset contains 70% of the original data and the testing dataset contains 30% of the original data.
## Test, evaluate, and compare
Split the dataset and use different datasets to train and test the model to make the evaluation of the model more objective.
After the model is trained, you can use the **Score Model** and **Evaluate Model** modules to generate predicted results and evaluate the models.
**Score Model** generates predictions for the test dataset by using the trained model. To check the result, select the output port of **Score Model** and then select **Visualize**.
![Score result](./media/how-to-designer-sample-regression-automobile-price-basic/sample1-score-1225.png)
Pass the scores to the **Evaluate Model** module to generate evaluation metrics. To check the result, select the output port of the **Evaluate Model** and then select **Visualize**.
![Evaluate result](./media/how-to-designer-sample-regression-automobile-price-basic/sample1-evaluate-1225.png)
## Next steps
Explore the other samples available for the designer:
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,83 @@
---
title: 'Designer: Predict car prices (advanced) example'
titleSuffix: Azure Machine Learning
description: Build & compare multiple ML regression models to predict an automobile's price based on technical features with Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 12/25/2019
---
# Train & compare multiple regression models to predict car prices with Azure Machine Learning designer
**Designer (preview) sample 2**
Learn how to build a machine learning pipeline without writing a single line of code using the designer (preview). This sample trains and compares multiple regression models to predict a car's price based on its technical features. We'll provide the rationale for the choices made in this pipeline so you can tackle your own machine learning problems.
If you're just getting started with machine learning, take a look at the [basic version](how-to-designer-sample-regression-automobile-price-basic.md) of this pipeline.
Here's the completed graph for this pipeline:
[![Graph of the pipeline](./media/how-to-designer-sample-regression-automobile-price-compare-algorithms/graph.png)](./media/how-to-designer-sample-regression-automobile-price-compare-algorithms/graph.png#lightbox)
## Pipeline summary
Use following steps to build the machine learning pipeline:
1. Get the data.
1. Pre-process the data.
1. Train the model.
1. Test, evaluate, and compare the models.
## Get the data
This sample uses the **Automobile price data (Raw)** dataset, which is from the UCI Machine Learning Repository. This dataset contains 26 columns that contain information about automobiles, including make, model, price, vehicle features (like the number of cylinders), MPG, and an insurance risk score.
## Pre-process the data
The main data preparation tasks include data cleaning, integration, transformation, reduction, and discretization or quantization. In the designer, you can find modules to perform these operations and other data pre-processing tasks in the **Data Transformation** group in the left panel.
Use the **Select Columns in Dataset** module to exclude normalized-losses that have many missing values. We then use **Clean Missing Data** to remove the rows that have missing values. This helps to create a clean set of training data.
![Data pre-processing](./media/how-to-designer-sample-regression-automobile-price-compare-algorithms/data-processing.png)
## Train the model
Machine learning problems vary. Common machine learning tasks include classification, clustering, regression, and recommender systems, each of which might require a different algorithm. Your choice of algorithm often depends on the requirements of the use case. After you pick an algorithm, you need to tune its parameters to train a more accurate model. You then need to evaluate all models based on metrics like accuracy, intelligibility, and efficiency.
Because the goal of this pipeline is to predict automobile prices, and because the label column (price) contains real numbers, a regression model is a good choice.
To compare the performance of different algorithms, we use two nonlinear algorithms, **Boosted Decision Tree Regression** and **Decision Forest Regression**, to build models. Both algorithms have parameters that you can change, but this sample uses the default values for this pipeline.
Use the **Split Data** module to randomly divide the input data so that the training dataset contains 70% of the original data and the testing dataset contains 30% of the original data.
## Test, evaluate, and compare the models
You use two different sets of randomly chosen data to train and then test the model, as described in the previous section. Split the dataset and use different datasets to train and test the model to make the evaluation of the model more objective.
After the model is trained, use the **Score Model** and **Evaluate Model** modules to generate predicted results and evaluate the models. **Score Model** generates predictions for the test dataset by using the trained model. Then pass the scores to **Evaluate Model** to generate evaluation metrics.
Here are the results:
![Compare the results](./media/how-to-designer-sample-regression-automobile-price-compare-algorithms/result.png)
These results show that the model built with **Boosted Decision Tree Regression** has a lower root mean squared error than the model built on **Decision Forest Regression**.
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)
- [Sample 7 - Text Classification: Wikipedia SP 500 Dataset](how-to-designer-sample-text-classification.md)

Просмотреть файл

@ -0,0 +1,110 @@
---
title: 'Designer: classify book reviews example'
titleSuffix: Azure Machine Learning
description: Build a multiclass logistic regression classifier to predict the company category with wikipedia SP 500 dataset using Azure Machine Learning designer.
services: machine-learning
ms.service: machine-learning
ms.subservice: core
ms.topic: sample
author: likebupt
ms.author: keli19
ms.reviewer: peterlu
ms.date: 02/11/2020
---
# Build a classifier to predict company category using Azure Machine Learning designer.
**Designer (preview) sample 7**
This sample demonstrates how to use text analytics modules to build a text classification pipeline in Azure Machine Learning designer (preview).
The goal of text classification is to assign some piece of text to one or more predefined classes or categories. The piece of text could be a document, news article, search query, email, tweet, support tickets, customer feedback, user product review etc. Applications of text classification include categorizing newspaper articles and news wire contents into topics, organizing web pages into hierarchical categories, filtering spam email, sentiment analysis, predicting user intent from search queries, routing support tickets, and analyzing customer feedback.
This pipeline trains a **multiclass logistic regression classifier** to predict the company category with **Wikipedia SP 500 dataset derived from Wikipedia**.
The fundamental steps of a training machine learning model with text data are:
1. Get the data
1. Pre-process the text data
1. Feature Engineering
Convert text feature into the numerical feature with feature extracting module such as feature hashing, extract n-gram feature from the text data.
1. Train the model
1. Score dataset
1. Evaluate the model
Here's the final, completed graph of the pipeline we'll be working on. We'll provide the rationale for all the modules so you can make similar decisions on your own.
[![Graph of the pipeline](./media/how-to-designer-sample-text-classification/nlp-modules-overall.png)](./media/how-to-designer-sample-text-classification/nlp-modules-overall.png#lightbox)
## Data
In this pipeline, we use the **Wikipedia SP 500** dataset. The dataset is derived from Wikipedia (https://www.wikipedia.org/) based on articles of each S&P 500 company. Before uploading to Azure Machine Learning designer, the dataset was processed as follows:
- Extract text content for each specific company
- Remove wiki formatting
- Remove non-alphanumeric characters
- Convert all text to lowercase
- Known company categories were added
Articles could not be found for some companies, so the number of records is less than 500.
## Pre-process the text data
We use the **Preprocess Text** module to preprocess the text data, including detect the sentences, tokenize sentences and so on. You would found all supported options in the [**Preprocess Text**](algorithm-module-reference/preprocess-text.md) article.
After pre-processing text data, we use the **Split Data** module to randomly divide the input data so that the training dataset contains 50% of the original data and the testing dataset contains 50% of the original data.
## Feature Engineering
In this sample, we will use two methods performing feature engineering.
### Feature Hashing
We used the [**Feature Hashing**](algorithm-module-reference/feature-hashing.md) module to convert the plain text of the articles to integers and used the integer values as input features to the model.
The **Feature Hashing** module can be used to convert variable-length text documents to equal-length numeric feature vectors, using the 32-bit murmurhash v3 hashing method provided by the Vowpal Wabbit library. The objective of using feature hashing is dimensionality reduction; also feature hashing makes the lookup of feature weights faster at classification time because it uses hash value comparison instead of string comparison.
In the sample pipeline, we set the number of hashing bits to 14 and set the number of n-grams to 2. With these settings, the hash table can hold 2^14 entries, in which each hashing feature represents one or more n-gram features and its value represents the occurrence frequency of that n-gram in the text instance. For many problems, a hash table of this size is more than adequate, but in some cases, more space might be needed to avoid collisions. Evaluate the performance of your machine learning solution using different number of bits.
### Extract N-Gram Feature from Text
An n-gram is a contiguous sequence of n terms from a given sequence of text. An n-gram of size 1 is referred to as a unigram; an n-gram of size 2 is a bigram; an n-gram of size 3 is a trigram. N-grams of larger sizes are sometimes referred to by the value of n, for instance, "four-gram", "five-gram", and so on.
We used [**Extract N-Gram Feature from Text**](algorithm-module-reference/extract-n-gram-features-from-text.md) module as another solution for feature engineering. This module first extracts the set of n-grams, in addition to the n-grams, the number of documents where each n-gram appears in the text is counted(DF). In this sample, TF-IDF metric is used to calculate feature values. Then, it converts unstructured text data into equal-length numeric feature vectors where each feature represents the TF-IDF of an n-gram in a text instance.
After converting text data into numeric feature vectors, A **Select Column** module is used to remove the text data from the dataset.
## Train the model
Your choice of algorithm often depends on the requirements of the use case.
Because the goal of this pipeline is to predict the category of company, a multi-class classifier model is a good choice. Considering that the number of features is large and these features are sparse, we use **Multiclass Logistic Regression** model for this pipeline.
## Test, evaluate, and compare
We split the dataset and use different datasets to train and test the model to make the evaluation of the model more objective.
After the model is trained, we would use the **Score Model** and **Evaluate Model** modules to generate predicted results and evaluate the models. However, before using the **Score Model** module, performing feature engineering as what we have done during training is required.
For **Feature Hashing** module, it is easy to perform feature engineer on scoring flow as training flow. Use **Feature Hashing** module directly to process the input text data.
For **Extract N-Gram Feature from Text** module, we would connect the **Result Vocabulary output** from the training dataflow to the **Input Vocabulary** on the scoring dataflow, and set the **Vocabulary mode** parameter to **ReadOnly**.
[![Graph of n-gram score](./media/how-to-designer-sample-text-classification/n-gram.png)](./media/how-to-designer-sample-text-classification/n-gram.png)
After finishing the engineering step, **Score Model** could be used to generate predictions for the test dataset by using the trained model. To check the result, select the output port of **Score Model** and then select **Visualize**.
We then pass the scores to the **Evaluate Model** module to generate evaluation metrics. **Evaluate Model** has two input ports, so that we could evaluate and compare scored datasets that are generated with different methods. In this sample, we compare the performance of the result generated with feature hashing method and n-gram method.
To check the result, select the output port of the **Evaluate Model** and then select **Visualize**.
## Next steps
Explore the other samples available for the designer:
- [Sample 1 - Regression: Predict an automobile's price](how-to-designer-sample-regression-automobile-price-basic.md)
- [Sample 2 - Regression: Compare algorithms for automobile price prediction](how-to-designer-sample-regression-automobile-price-compare-algorithms.md)
- [Sample 3 - Classification with feature selection: Income Prediction](how-to-designer-sample-classification-predict-income.md)
- [Sample 4 - Classification: Predict credit risk (cost sensitive)](how-to-designer-sample-classification-credit-risk-cost-sensitive.md)
- [Sample 5 - Classification: Predict churn](how-to-designer-sample-classification-churn.md)
- [Sample 6 - Classification: Predict flight delays](how-to-designer-sample-classification-flight-delay.md)

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 131 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 111 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 113 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 72 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 113 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 111 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 89 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 102 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 20 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 102 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 38 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 37 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 11 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 39 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 15 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 15 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 57 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 92 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 7.0 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 62 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 41 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 47 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 161 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 95 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 18 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 107 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 27 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 36 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 770 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 215 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 192 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 106 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 53 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 10 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 42 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 60 KiB