529 строки
96 KiB
Plaintext
529 строки
96 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NimbusML Column Selection Syntax - Flight Schedule\n",
|
|
"\n",
|
|
"In this example, we will present the column syntax introduced by NimbusML. While buiding the pipeline, users are allowed to introduce different transformations executing on different columns sequentially. Very similar to buiding deep learning models, the users can add \"layer\" after \"layers\" by indicating the input and output (column) names for each operator to avoid confusion. The computation graph can be clearly visualized.\n",
|
|
"\n",
|
|
"The problem we are solving in this example is a binary classificaiton to predict if the flight will be delayed. The training/testing data was manually created."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import time\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from IPython.display import display,IFrame,Image\n",
|
|
"from nimbusml.feature_extraction.categorical import OneHotVectorizer\n",
|
|
"from nimbusml.ensemble import LightGbmBinaryClassifier\n",
|
|
"from nimbusml.preprocessing import missing_values\n",
|
|
"from nimbusml.preprocessing.schema import ColumnConcatenator\n",
|
|
"from nimbusml import Pipeline, FileDataStream \n",
|
|
"from nimbusml.datasets import get_dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this tutorial, we create the model using NimbusML pipelines trained with nimbusml FileDataStream."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Getting data file path \n",
|
|
"train_file = get_dataset('fstrain').as_filepath()\n",
|
|
"test_file = get_dataset('fstest').as_filepath()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The input data consists of both categorical variables ('UniqueCarrier', 'Origin', 'Dest') and numeric variables ('Month', 'DayOfMonth', 'DayOfWeek', 'DepTime', 'Distance', 'DepDelay').\n",
|
|
"\n",
|
|
"|Month|DayOfMonth|DayOfWeek|DepTime|Distance|UniqueCarrier|Origin|Dest|DepDelay|Label|\n",
|
|
"|-----|----------|---------|-------|--------|-------------|------|----|--------|-----|\n",
|
|
"|1|2|2|1525|293|WN|DAL|LBB|12|0|\n",
|
|
"|1|2|2|940|192|WN|HOU|SAT|-4|0|\n",
|
|
"|1|2|2|700|1044|WN|MCI|PHX|0|0|"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We hope to:\n",
|
|
" 1. Applying missing_values.Handler to \"UniqueCarrier\"\n",
|
|
" 2. Applying OneHotVectorizer to all raw categorical columns, including column \"UniqueCarrier\" after step (1)\n",
|
|
" 3. Applying missing_values.Handler to all numeric features, i.e. (\"Month\", \"DayOfMonth\", \"DayOfWeek\", \"DepTime\", \"Distance\", \"DepDelay\")."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following figure indicates the above steps. Each item corresponds to a column and the arrow transformation/learner."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<IPython.core.display.Image object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"display(Image(filename='images/FDFigure.png'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Using the new NimbusML column syntax, a pipeline described in the above figure can be easily implemented. Users can define the input/output column names for each operator using **columns = {\"output_column_name\": \"input_column_name(s)\"}** or the new syntax \"<<\"."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Training"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"categorical_columns = ['UniqueCarrier', 'Origin', 'Dest']\n",
|
|
"numeric_columns = ['Month', 'DayOfMonth', 'DayOfWeek', 'DepTime', 'Distance', 'DepDelay']\n",
|
|
"\n",
|
|
"ppl = Pipeline([\n",
|
|
" OneHotVectorizer(columns = categorical_columns),\n",
|
|
" missing_values.Handler(columns = {'UniqueCarrier_Handler':'UniqueCarrier'}), \n",
|
|
" missing_values.Handler(columns = numeric_columns), \n",
|
|
"\n",
|
|
" # After the feature transformation, we add a LightGbm learner.\n",
|
|
" LightGbmBinaryClassifier(feature = categorical_columns + numeric_columns + ['UniqueCarrier_Handler'],\n",
|
|
" label = 'Label') \n",
|
|
" ])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Syntax for Transform: Dictionary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"When initializing a transform, the users can specify both the input and the output column names for the execution. For instance, \n",
|
|
"\n",
|
|
" missing_values.Handler(columns = {'UniqueCarrier_Handler':'UniqueCarrier'})\n",
|
|
"or\n",
|
|
" \n",
|
|
" missing_values.Handler() << {'UniqueCarrier_Handler':'UniqueCarrier'}\n",
|
|
"\n",
|
|
"indicates that:\n",
|
|
"1. The missing_values.Handler will select column \"UniqueCarrier\" for the execution, i.e. **input = 'UniqueCarrier'**\n",
|
|
"2. The transformed data will be saved to a new column named \"UniqueCarrier_Handler\", i.e. **output = 'UniqueCarrier_Handler'**\n",
|
|
"\n",
|
|
"Notice that, in most cases, we don't allow a dictionary with a string as key and list as value, such as {\"num_features\": numeric_columns}, except for ColumnConcatenator, which concatenates multiple columns into a new column. \n",
|
|
"\n",
|
|
"After this transform, in addition to the origin columns in the input dataset, a new column named \"UniqueCarrier_Handler\" will be generated. If user specifies the same output name as the input, the column will be overwritten.\n",
|
|
"\n",
|
|
"One special feature of NimbusML is that columns are allowed to include an array (vector) instead of one single value. For instance, a dataset with three columns in nimbusml may look like:\n",
|
|
"\n",
|
|
"|Month,DayOfMonth,DayOfWeek|Origin,Dest|Features|\n",
|
|
"|----------------------------|-------------|----------|\n",
|
|
"|1,2,2,1525,293 |WN,DAL |12,0 |\n",
|
|
"|1,2,2,940,192 |WN,HOU |-4,0 |\n",
|
|
"|1,2,2,700,1044 |WN,MCI |0,0 |\n",
|
|
"\n",
|
|
"In this sense, the output(s) from OneHotVectorizer(columns = [\"UniqueCarrier\"]) are concatenated into one column named \"UniqueCarrier\" with vectors, e.g. [0,0,0,1]. The output with vector values can be further used in the next step with its column name as input.\n",
|
|
"\n",
|
|
"Note: In NimbusML, some transformations, such as ColumnDropper, don't allow renaming, as there are no new columns being generated. In that case, dictionary is not allowed. A few transformations need more than two columns to generate a new column, such as ColumnConcatenator, a dictionary with list as values will be allowed, e.g. {'new_col':[concate_col1, concate_col2]}. For more details about the usage of each transformation, please refer to the documentation for API. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Syntax for Transform: List"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Users can also specify the column names using a list. For instance,\n",
|
|
"\n",
|
|
" OneHotVectorizer(columns = ['UniqueCarrier', 'Origin', 'Dest'])\n",
|
|
"indicates that:\n",
|
|
"1. The OneHotVectorizer will select columns ['UniqueCarrier', 'Origin', 'Dest'] for the execution, i.e. **input = ['UniqueCarrier', 'Origin', 'Dest']**\n",
|
|
"2. The transformed data will be saved to the same column with the same name (overwrite the origin input columns).\n",
|
|
"\n",
|
|
"For case 2, all the output columns will have the same names as the input, i.e. replacing the input columns in the origin data frame. Thus, the above syntax is equivalent to:\n",
|
|
"\n",
|
|
" OneHotVectorizer(columns = {'UniqueCarrier':'UniqueCarrier' , 'Origin':'Origin', 'Dest':'Dest'})\n",
|
|
"\n",
|
|
"For each operator, just like creating a neural network, the input and output is specified in a dictionary/list. If the input column names are not specified, all the input columns from previous transformation will be used. \n",
|
|
"\n",
|
|
"For more details about the column operations for transforms, please refer to our [documentation](https://docs.microsoft.com/en-us/nimbusml/concepts/columns)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Syntax for Learner: Role"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For most learners, NimbusML introduces a class **Role**. Users can specify the roles for different columns in the column syntax using a dictionary, for example:\n",
|
|
"\n",
|
|
" from nimbusml import Role\n",
|
|
" LightGbmBinaryClassifier(columns = {Role.Feature: categorical_columns + numeric_columns, Role.Label:'Label'})\n",
|
|
" \n",
|
|
"or\n",
|
|
"\n",
|
|
" LightGbmBinaryClassifier(feature = categorical_columns + numeric_columns, label = 'Label')\n",
|
|
"\n",
|
|
"Indicates that the input features for LightGbmBinaryClassifier are columns categorical_columns + numeric_columns, and the label column is the column named 'Label'. Other roles are Role.GroupId, Role.Weight, etc.. If the label role was specified, user can use ppl.fit(data) directly without setting the y. \n",
|
|
"\n",
|
|
"For more details about the column operations for learners, please refer to our [documentation](https://docs.microsoft.com/en-us/nimbusml/concepts/roles)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Creating FileDataStream\n",
|
|
"data_stream_train = FileDataStream.read_csv(train_file, sep=',', numeric_dtype=np.float32)\n",
|
|
"data_stream_test = FileDataStream.read_csv(test_file, sep=',', numeric_dtype=np.float32)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:03.7730095\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.7034011\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.5283012\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.4300194\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.4320082\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.3711266\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.3977217\n",
|
|
"Not adding a normalizer.\n",
|
|
"Auto-tuning parameters: UseCat = False\n",
|
|
"Auto-tuning parameters: LearningRate = 0.2\n",
|
|
"Auto-tuning parameters: NumLeaves = 20\n",
|
|
"Auto-tuning parameters: MinDataPerLeaf = 20\n",
|
|
"LightGBM objective=binary\n",
|
|
"Not training a calibrator because it is not needed.\n",
|
|
"Elapsed time: 00:00:02.4655428\n",
|
|
"2.49 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Training with NimbusML file stream\n",
|
|
"%timeit ppl.fit(data_stream_train) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Testing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Similarly, we can call pipeline.test() to generate prediction and performance metrics."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Wall time: 1.86 s\n",
|
|
"Prediction for first 5 rows:\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PredictedLabel</th>\n",
|
|
" <th>Probability</th>\n",
|
|
" <th>Score</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>0.000176</td>\n",
|
|
" <td>-17.291693</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>0.000172</td>\n",
|
|
" <td>-17.333096</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>0.000168</td>\n",
|
|
" <td>-17.378891</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>0.000175</td>\n",
|
|
" <td>-17.305880</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>0.000171</td>\n",
|
|
" <td>-17.343319</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PredictedLabel Probability Score\n",
|
|
"0 0.0 0.000176 -17.291693\n",
|
|
"1 0.0 0.000172 -17.333096\n",
|
|
"2 0.0 0.000168 -17.378891\n",
|
|
"3 0.0 0.000175 -17.305880\n",
|
|
"4 0.0 0.000171 -17.343319"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Evaluation of the model using .test(): \n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>AUC</th>\n",
|
|
" <th>Accuracy</th>\n",
|
|
" <th>Positive precision</th>\n",
|
|
" <th>Positive recall</th>\n",
|
|
" <th>Negative precision</th>\n",
|
|
" <th>Negative recall</th>\n",
|
|
" <th>Log-loss</th>\n",
|
|
" <th>Log-loss reduction</th>\n",
|
|
" <th>Test-set entropy (prior Log-Loss/instance)</th>\n",
|
|
" <th>F1 Score</th>\n",
|
|
" <th>AUPRC</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0.754464</td>\n",
|
|
" <td>0.86</td>\n",
|
|
" <td>0.75</td>\n",
|
|
" <td>0.1875</td>\n",
|
|
" <td>0.864583</td>\n",
|
|
" <td>0.988095</td>\n",
|
|
" <td>1.739584</td>\n",
|
|
" <td>-174.248426</td>\n",
|
|
" <td>0.63431</td>\n",
|
|
" <td>0.3</td>\n",
|
|
" <td>0.337565</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" AUC Accuracy Positive precision Positive recall \\\n",
|
|
"0 0.754464 0.86 0.75 0.1875 \n",
|
|
"\n",
|
|
" Negative precision Negative recall Log-loss Log-loss reduction \\\n",
|
|
"0 0.864583 0.988095 1.739584 -174.248426 \n",
|
|
"\n",
|
|
" Test-set entropy (prior Log-Loss/instance) F1 Score AUPRC \n",
|
|
"0 0.63431 0.3 0.337565 "
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Testing using file stream\n",
|
|
"%time metrics, scores = ppl.test(data_stream_test,\"Label\", output_scores = True) \n",
|
|
"\n",
|
|
"print(\"Prediction for first 5 rows:\")\n",
|
|
"display(scores[0:5])\n",
|
|
"\n",
|
|
"print(\"Evaluation of the model using .test(): \")\n",
|
|
"display(metrics)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|