зеркало из https://github.com/microsoft/SynapseML.git
docs: update notebooks - bring back fabric reviewers changes. (#1979)
* update doc for fabric * use previous multivariate anomaly detection notebook * revert change * bring back reviewers changes * use master isolationForestNotebook * format and doc bug fix * fix Multivariate Anomaly Detection doc version * Update notebooks/features/cognitive_services/CognitiveServices - Multivariate Anomaly Detection.ipynb Co-authored-by: Mark Hamilton <mhamilton723@gmail.com> * Update notebooks/features/lightgbm/LightGBM - Overview.ipynb Co-authored-by: Mark Hamilton <mhamilton723@gmail.com> * Update notebooks/features/classification/Classification - Before and After SynapseML.ipynb * Update notebooks/features/responsible_ai/Interpretability - Tabular SHAP explainer.ipynb ---------
This commit is contained in:
Родитель
46a56e1c8f
Коммит
094c180c3e
|
@ -1,27 +1,50 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Classification - Before and After SynapseML\n",
|
||||
"\n",
|
||||
"### 1. Introduction\n",
|
||||
"\n",
|
||||
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>\n",
|
||||
"\n",
|
||||
"In this tutorial, we perform the same classification task in two\n",
|
||||
"# Classification - before and after SynapseML"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"<p><img src=\"https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" style=\"width: 500px;\" title=\"Image from https://images-na.ssl-images-amazon.com/images/G/01/img16/books/bookstore/landing-page/1000638_books_landing-page_bookstore-photo-01.jpg\" /><br /></p>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this article, you perform the same classification task in two\n",
|
||||
"different ways: once using plain **`pyspark`** and once using the\n",
|
||||
"**`synapseml`** library. The two methods yield the same performance,\n",
|
||||
"but one of the two libraries is drastically simpler to use and iterate\n",
|
||||
"on (can you guess which one?).\n",
|
||||
"but highlights the simplicity of using `synapseml` compared to `pyspark`.\n",
|
||||
"\n",
|
||||
"The task is simple: Predict whether a user's review of a book sold on\n",
|
||||
"Amazon is good (rating > 3) or bad based on the text of the review. We\n",
|
||||
"accomplish this by training LogisticRegression learners with different\n",
|
||||
"The task is to predict whether a customer's review of a book sold on\n",
|
||||
"Amazon is good (rating > 3) or bad based on the text of the review. You\n",
|
||||
"accomplish it by training LogisticRegression learners with different\n",
|
||||
"hyperparameters and choosing the best model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"Import necessary Python libraries and get a spark session."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
@ -35,12 +58,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2. Read the data\n",
|
||||
"## Read the data\n",
|
||||
"\n",
|
||||
"We download and read in the data. We show a sample below:"
|
||||
"Download and read in the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -56,16 +80,16 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 3. Extract more features and process data\n",
|
||||
"## Extract features and process data\n",
|
||||
"\n",
|
||||
"Real data however is more complex than the above dataset. It is common\n",
|
||||
"for a dataset to have features of multiple types: text, numeric,\n",
|
||||
"categorical. To illustrate how difficult it is to work with these\n",
|
||||
"datasets, we add two numerical features to the dataset: the **word\n",
|
||||
"count** of the review and the **mean word length**."
|
||||
"Real data is more complex than the above dataset. It's common\n",
|
||||
"for a dataset to have features of multiple types, such as text, numeric, and\n",
|
||||
"categorical. To illustrate how difficult it's to work with these\n",
|
||||
"datasets, add two numerical features to the dataset: the **word count** of the review and the **mean word length**."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -142,25 +166,22 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 4a. Classify using pyspark\n",
|
||||
"## Classify using pyspark\n",
|
||||
"\n",
|
||||
"To choose the best LogisticRegression classifier using the `pyspark`\n",
|
||||
"library, need to *explicitly* perform the following steps:\n",
|
||||
"library, you need to *explicitly* perform the following steps:\n",
|
||||
"\n",
|
||||
"1. Process the features:\n",
|
||||
" * Tokenize the text column\n",
|
||||
" * Hash the tokenized column into a vector using hashing\n",
|
||||
" * Merge the numeric features with the vector in the step above\n",
|
||||
" * Merge the numeric features with the vector\n",
|
||||
"2. Process the label column: cast it into the proper type.\n",
|
||||
"3. Train multiple LogisticRegression algorithms on the `train` dataset\n",
|
||||
" with different hyperparameters\n",
|
||||
"4. Compute the area under the ROC curve for each of the trained models\n",
|
||||
" and select the model with the highest metric as computed on the\n",
|
||||
" `test` dataset\n",
|
||||
"5. Evaluate the best model on the `validation` set\n",
|
||||
"\n",
|
||||
"As you can see below, there is a lot of work involved and a lot of\n",
|
||||
"steps where something can go wrong!"
|
||||
"5. Evaluate the best model on the `validation` set"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -235,16 +256,16 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 4b. Classify using synapseml\n",
|
||||
"## Classify using SynapseML\n",
|
||||
"\n",
|
||||
"Life is a lot simpler when using `synapseml`!\n",
|
||||
"The pipeline can be simplified by using SynapseML:\n",
|
||||
"\n",
|
||||
"1. The **`TrainClassifier`** Estimator featurizes the data internally,\n",
|
||||
" as long as the columns selected in the `train`, `test`, `validation`\n",
|
||||
" dataset represent the features\n",
|
||||
"\n",
|
||||
"2. The **`FindBestModel`** Estimator finds the best model from a pool of\n",
|
||||
" trained models by finding the model which performs best on the `test`\n",
|
||||
" trained models by finding the model that performs best on the `test`\n",
|
||||
" dataset given the specified metric\n",
|
||||
"\n",
|
||||
"3. The **`ComputeModelStatistics`** Transformer computes the different\n",
|
||||
|
|
|
@ -11,7 +11,17 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# A 5-minute tour of SynapseML"
|
||||
"# Build your first SynapseML model\n",
|
||||
"This tutorial provides a brief introduction to SynapseML. In particular, we use SynapseML to create two different pipelines for sentiment analysis. The first pipeline combines a text featurization stage with LightGBM regression to predict ratings based on review text from a dataset containing book reviews from Amazon. The second pipeline shows how to use prebuilt models through the Azure Cognitive Services to solve this problem without training data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Set up the environment\n",
|
||||
"Import SynapseML libraries and initialize your Spark session."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -39,6 +49,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"nteract": {
|
||||
|
@ -48,7 +59,8 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# Step 1: Load our Dataset"
|
||||
"## Load a dataset\n",
|
||||
"Load your dataset and split it into train and test sets."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -77,6 +89,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -87,7 +100,8 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# Step 2: Make our Model"
|
||||
"## Create the training pipeline\n",
|
||||
"Create a pipeline that featurizes data using `TextFeaturizer` from the `synapse.ml.featurize.text` library and derives a rating using the `LightGBMRegressor` function."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -116,6 +130,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -126,7 +141,8 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# Step 3: Predict!"
|
||||
"## Predict the output of the test data\n",
|
||||
"Call the `transform` function on the model to predict and display the output of the test data as a dataframe."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -146,6 +162,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -156,7 +173,8 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# Alternate route: Let the Cognitive Services handle it"
|
||||
"## Use Cognitive Services to transform data in one step\n",
|
||||
"Alternatively, for these kinds of tasks that have a prebuilt solution, you can use SynapseML's integration with Cognitive Services to transform your data in one step."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -181,7 +199,9 @@
|
|||
"model = TextSentiment(\n",
|
||||
" textCol=\"text\",\n",
|
||||
" outputCol=\"sentiment\",\n",
|
||||
" subscriptionKey=find_secret(\"cognitive-api-key\"),\n",
|
||||
" subscriptionKey=find_secret(\n",
|
||||
" \"cognitive-api-key\"\n",
|
||||
" ), # Replace the call to find_secret with your key as a python string.\n",
|
||||
").setLocation(\"eastus\")\n",
|
||||
"\n",
|
||||
"display(model.transform(test))"
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -17,7 +18,7 @@
|
|||
"source": [
|
||||
"# Tutorial: Create a custom search engine and question-answering system\n",
|
||||
"\n",
|
||||
"In this tutorial, learn how to index and query large data loaded from a Spark cluster. You'll set up a Jupyter Notebook that performs the following actions:\n",
|
||||
"In this tutorial, learn how to index and query large data loaded from a Spark cluster. You will set up a Jupyter Notebook that performs the following actions:\n",
|
||||
"\n",
|
||||
"> + Load various forms (invoices) into a data frame in an Apache Spark session\n",
|
||||
"> + Analyze them to determine their features\n",
|
||||
|
@ -27,6 +28,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -48,7 +50,39 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pyspark.sql import SparkSession\n",
|
||||
"from synapse.ml.core.platform import running_on_synapse, find_secret\n",
|
||||
"\n",
|
||||
"# Bootstrap Spark Session\n",
|
||||
"spark = SparkSession.builder.getOrCreate()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"if running_on_synapse():\n",
|
||||
" from notebookutils.visualization import display\n",
|
||||
" import subprocess\n",
|
||||
" import sys\n",
|
||||
"\n",
|
||||
" subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"openai\"])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -63,36 +97,32 @@
|
|||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pyspark.sql import SparkSession\n",
|
||||
"from synapse.ml.core.platform import running_on_synapse, find_secret\n",
|
||||
"\n",
|
||||
"# Bootstrap Spark Session\n",
|
||||
"spark = SparkSession.builder.getOrCreate()\n",
|
||||
"if running_on_synapse():\n",
|
||||
" from notebookutils.visualization import display\n",
|
||||
" import subprocess\n",
|
||||
" import sys\n",
|
||||
"\n",
|
||||
" subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"openai\"])\n",
|
||||
"\n",
|
||||
"cognitive_key = find_secret(\"cognitive-api-key\")\n",
|
||||
"cognitive_key = find_secret(\n",
|
||||
" \"cognitive-api-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string. e.g. cognitive_key=\"27snaiw...\"\n",
|
||||
"cognitive_location = \"eastus\"\n",
|
||||
"\n",
|
||||
"translator_key = find_secret(\"translator-key\")\n",
|
||||
"translator_key = find_secret(\n",
|
||||
" \"translator-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"translator_location = \"eastus\"\n",
|
||||
"\n",
|
||||
"search_key = find_secret(\"azure-search-key\")\n",
|
||||
"search_key = find_secret(\n",
|
||||
" \"azure-search-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"search_service = \"mmlspark-azure-search\"\n",
|
||||
"search_index = \"form-demo-index-5\"\n",
|
||||
"\n",
|
||||
"openai_key = find_secret(\"openai-api-key\")\n",
|
||||
"openai_key = find_secret(\n",
|
||||
" \"openai-api-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"openai_service_name = \"synapseml-openai\"\n",
|
||||
"openai_deployment_name = \"gpt-35-turbo\"\n",
|
||||
"openai_url = f\"https://{openai_service_name}.openai.azure.com/\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -114,7 +144,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -155,6 +185,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -169,13 +200,17 @@
|
|||
},
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"<img src=\"https://mmlsparkdemo.blob.core.windows.net/ignite2021/form_svgs/Invoice11205.svg\" width=\"40%\"/>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -197,7 +232,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -230,6 +265,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -253,7 +289,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -284,6 +320,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -303,7 +340,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -331,6 +368,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -352,7 +390,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -388,6 +426,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -407,7 +446,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -457,7 +496,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -476,6 +515,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -490,12 +530,12 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"## 7 - Infer vendor adress continent with OpenAI"
|
||||
"## 7 - Infer vendor address continent with OpenAI"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -536,7 +576,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -555,6 +595,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -574,7 +615,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -606,6 +647,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -625,7 +667,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -651,6 +693,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -665,14 +708,24 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"## 10 - Build a simple chatbot that can use Azure Search as a tool 🧠🔧\n",
|
||||
"#\n",
|
||||
"## 10 - Build a chatbot that can use Azure Search as a tool 🧠🔧"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/notebooks/chatbot_flow_2.svg\" width=\"40%\" />"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -759,6 +812,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -778,7 +832,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -797,6 +851,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -816,7 +871,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 0,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"cellMetadata": {
|
||||
|
@ -862,4 +917,4 @@
|
|||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
|
|
@ -6,72 +6,43 @@
|
|||
"metadata": {},
|
||||
"source": [
|
||||
"# Recipe: Cognitive Services - Multivariate Anomaly Detection \n",
|
||||
"This recipe shows how you can use SynapseML and Azure Cognitive Services on Apache Spark for multivariate anomaly detection. Multivariate anomaly detection allows for the detection of anomalies among many variables or time series, taking into account all the inter-correlations and dependencies between the different variables. In this scenario, we use SynapseML to train a model for multivariate anomaly detection using the Azure Cognitive Services, and we then use to the model to infer multivariate anomalies within a dataset containing synthetic measurements from three IoT sensors. \n",
|
||||
"This recipe shows how you can use SynapseML and Azure Cognitive Services on Apache Spark for multivariate anomaly detection. Multivariate anomaly detection allows for the detection of anomalies among many variables or timeseries, taking into account all the inter-correlations and dependencies between the different variables. In this scenario, we use SynapseML to train a model for multivariate anomaly detection using the Azure Cognitive Services, and we then use to the model to infer multivariate anomalies within a dataset containing synthetic measurements from three IoT sensors. \n",
|
||||
"\n",
|
||||
"To learn more about the Anomaly Detector Cognitive Service please refer to [ this documentation page](https://docs.microsoft.com/en-us/azure/cognitive-services/anomaly-detector/). \n",
|
||||
"\n",
|
||||
"### Prerequisites\n",
|
||||
"- An Azure subscription - [Create one for free](https://azure.microsoft.com/en-us/free/)\n",
|
||||
"\n",
|
||||
"### Setup\n",
|
||||
"#### Create an Anomaly Detector resource\n",
|
||||
"Follow the instructions below to create an `Anomaly Detector` resource using the Azure portal or alternatively, you can also use the Azure CLI to create this resource.\n",
|
||||
"\n",
|
||||
"- In the Azure Portal, click `Create` in your resource group, and then type `Anomaly Detector`. Click on the Anomaly Detector resource.\n",
|
||||
"- Give the resource a name, and ideally use the same region as the rest of your resource group. Use the default options for the rest, and then click `Review + Create` and then `Create`.\n",
|
||||
"- Once the Anomaly Detector resource is created, open it and click on the `Keys and Endpoints` panel on the left. Copy the key for the Anomaly Detector resource into the `ANOMALY_API_KEY` environment variable, or store it in the `anomalyKey` variable in the cell below.\n",
|
||||
"\n",
|
||||
"#### Create a Storage Account resource\n",
|
||||
"In order to save intermediate data, you will need to create an Azure Blob Storage Account. Within that storage account, create a container for storing the intermediate data. Make note of the container name, and copy the connection string to that container. You will need this later to populate the `containerName` variable and the `BLOB_CONNECTION_STRING` environment variable."
|
||||
"To learn more about the Anomaly Detector Cognitive Service, refer to [this documentation page](https://docs.microsoft.com/azure/cognitive-services/anomaly-detector/). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Enter your service keys\n",
|
||||
"Let's start by setting up the environment variables for our service keys. The next cell sets the `ANOMALY_API_KEY` and the `BLOB_CONNECTION_STRING` environment variables based on the values stored in our Azure Key Vault. If you are running this in your own environment, make sure you set these environment variables before you proceed."
|
||||
"\n",
|
||||
"## Setup\n",
|
||||
"### Create an Anomaly Detector resource\n",
|
||||
"Follow the instructions to create an `Anomaly Detector` resource using the Azure portal or alternatively, you can also use the Azure CLI to create this resource.\n",
|
||||
"\n",
|
||||
"- In the Azure portal, click `Create` in your resource group, and then type `Anomaly Detector`. Click on the Anomaly Detector resource.\n",
|
||||
"- Give the resource a name, and ideally use the same region as the rest of your resource group. Use the default options for the rest, and then click `Review + Create` and then `Create`.\n",
|
||||
"- Once the Anomaly Detector resource is created, open it and click on the `Keys and Endpoints` panel on the left. Copy the key for the Anomaly Detector resource into the `ANOMALY_API_KEY` environment variable, or store it in the `anomalyKey` variable.\n",
|
||||
"\n",
|
||||
"### Create a Storage Account resource\n",
|
||||
"In order to save intermediate data, you need to create an Azure Blob Storage Account. Within that storage account, create a container for storing the intermediate data. Make note of the container name, and copy the connection string to that container. You need it later to populate the `containerName` variable and the `BLOB_CONNECTION_STRING` environment variable."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Enter your service keys\n",
|
||||
"Let's start by setting up the environment variables for our service keys. The next cell sets the `ANOMALY_API_KEY` and the `BLOB_CONNECTION_STRING` environment variables based on the values stored in our Azure Key Vault. If you're running this tutorial in your own environment, make sure you set these environment variables before you proceed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\"></div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\"></div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from pyspark.sql import SparkSession\n",
|
||||
|
@ -92,48 +63,14 @@
|
|||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\"></div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\"></div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# An Anomaly Dectector subscription key\n",
|
||||
"anomalyKey = find_secret(\"anomaly-api-key\")\n",
|
||||
"anomalyKey = find_secret(\"anomaly-api-key\") # use your own anomaly api key\n",
|
||||
"# Your storage account name\n",
|
||||
"storageName = \"anomalydetectiontest\"\n",
|
||||
"storageName = \"anomalydetectiontest\" # use your own storage account name\n",
|
||||
"# A connection string to your blob storage account\n",
|
||||
"storageKey = find_secret(\"madtest-storage-key\")\n",
|
||||
"storageKey = find_secret(\"madtest-storage-key\") # use your own storage key\n",
|
||||
"# A place to save intermediate MVAD results\n",
|
||||
"intermediateSaveDir = (\n",
|
||||
" \"wasbs://madtest@anomalydetectiontest.blob.core.windows.net/intermediateData\"\n",
|
||||
|
@ -143,10 +80,11 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First we will connect to our storage account so that anomaly detector can save intermediate results there:"
|
||||
"First we connect to our storage account so that anomaly detector can save intermediate results there:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -177,49 +115,8 @@
|
|||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"inputWidgets": {},
|
||||
"nuid": "201891b5-7ec3-4350-bdfa-306a265d2b44",
|
||||
"showTitle": false,
|
||||
"title": ""
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\"></div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\"></div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import pandas as pd\n",
|
||||
|
@ -235,6 +132,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
|
@ -271,6 +169,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -281,55 +180,14 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"We can now create an `estimator` object, which will be used to train our model. In the cell below, we specify the start and end times for the training data. We also specify the input columns to use, and the name of the column that contains the timestamps. Finally, we specify the number of data points to use in the anomaly detection sliding window, and we set the connection string to the Azure Blob Storage Account. "
|
||||
"We can now create an `estimator` object, which is used to train our model. We specify the start and end times for the training data. We also specify the input columns to use, and the name of the column that contains the timestamps. Finally, we specify the number of data points to use in the anomaly detection sliding window, and we set the connection string to the Azure Blob Storage Account. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"inputWidgets": {},
|
||||
"nuid": "38beb5f0-8a46-439e-886f-3ffd06066e8c",
|
||||
"showTitle": false,
|
||||
"title": ""
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\"></div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\"></div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"trainingStartTime = \"2020-06-01T12:00:00Z\"\n",
|
||||
"trainingEndTime = \"2020-07-02T17:55:00Z\"\n",
|
||||
|
@ -350,6 +208,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
|
@ -359,49 +218,8 @@
|
|||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
"inputWidgets": {},
|
||||
"nuid": "820249ea-8520-458e-9365-ad15e8d3583e",
|
||||
"showTitle": false,
|
||||
"title": ""
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\"></div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\"></div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"model = estimator.fit(df)"
|
||||
]
|
||||
|
@ -411,7 +229,7 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once the training is done, we can now use the model for inference. The code in the next cell specifies the start and end times for the data we would like to detect the anomalies in. It will then show the results."
|
||||
"Once the training is done, we can now use the model for inference. The code in the next cell specifies the start and end times for the data we would like to detect the anomalies in. "
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -448,13 +266,11 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"When we called `.show(5)` in the previous cell, it showed us the first five rows in the dataframe. The results were all `null` because they were not inside the inference window.\n",
|
||||
"When we called `.show(5)` in the previous cell, it showed us the first five rows in the dataframe. The results were all `null` because they weren't inside the inference window.\n",
|
||||
"\n",
|
||||
"To show the results only for the inferred data, lets select the columns we need. We can then order the rows in the dataframe by ascending order, and filter the result to only show the rows that are in the range of the inference window. In our case `inferenceEndTime` is the same as the last row in the dataframe, so can ignore that. \n",
|
||||
"\n",
|
||||
"Finally, to be able to better plot the results, lets convert the Spark dataframe to a Pandas dataframe.\n",
|
||||
"\n",
|
||||
"This is what the next cell does:"
|
||||
"Finally, to be able to better plot the results, lets convert the Spark dataframe to a Pandas dataframe.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -468,207 +284,7 @@
|
|||
"title": ""
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\">/databricks/spark/python/pyspark/sql/pandas/conversion.py:92: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:\n",
|
||||
" Unable to convert the field contributors. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.\n",
|
||||
"Direct cause: Unsupported type in conversion to Arrow: ArrayType(StructType(List(StructField(contributionScore,DoubleType,true),StructField(variable,StringType,true))),true)\n",
|
||||
"Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.\n",
|
||||
" warnings.warn(msg)\n",
|
||||
"Out[8]: </div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\">/databricks/spark/python/pyspark/sql/pandas/conversion.py:92: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:\n Unable to convert the field contributors. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.\nDirect cause: Unsupported type in conversion to Arrow: ArrayType(StructType(List(StructField(contributionScore,DoubleType,true),StructField(variable,StringType,true))),true)\nAttempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.\n warnings.warn(msg)\nOut[8]: </div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>timestamp</th>\n",
|
||||
" <th>sensor_1</th>\n",
|
||||
" <th>sensor_2</th>\n",
|
||||
" <th>sensor_3</th>\n",
|
||||
" <th>contributors</th>\n",
|
||||
" <th>isAnomaly</th>\n",
|
||||
" <th>severity</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2020-07-02T18:00:00Z</td>\n",
|
||||
" <td>1.069680</td>\n",
|
||||
" <td>0.393173</td>\n",
|
||||
" <td>3.129125</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2020-07-02T18:05:00Z</td>\n",
|
||||
" <td>0.932784</td>\n",
|
||||
" <td>0.214959</td>\n",
|
||||
" <td>3.077339</td>\n",
|
||||
" <td>[(0.5516611337661743, series_1), (0.3133429884...</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>0.06478</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2020-07-02T18:10:00Z</td>\n",
|
||||
" <td>1.012214</td>\n",
|
||||
" <td>0.466037</td>\n",
|
||||
" <td>2.909561</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>2020-07-02T18:15:00Z</td>\n",
|
||||
" <td>1.122182</td>\n",
|
||||
" <td>0.398438</td>\n",
|
||||
" <td>3.029489</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>2020-07-02T18:20:00Z</td>\n",
|
||||
" <td>1.091310</td>\n",
|
||||
" <td>0.282137</td>\n",
|
||||
" <td>2.948016</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>995</th>\n",
|
||||
" <td>2020-07-06T04:55:00Z</td>\n",
|
||||
" <td>-0.443438</td>\n",
|
||||
" <td>0.768980</td>\n",
|
||||
" <td>-0.710800</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>996</th>\n",
|
||||
" <td>2020-07-06T05:00:00Z</td>\n",
|
||||
" <td>-0.529400</td>\n",
|
||||
" <td>0.822140</td>\n",
|
||||
" <td>-0.944681</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>997</th>\n",
|
||||
" <td>2020-07-06T05:05:00Z</td>\n",
|
||||
" <td>-0.377911</td>\n",
|
||||
" <td>0.738591</td>\n",
|
||||
" <td>-0.871468</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>998</th>\n",
|
||||
" <td>2020-07-06T05:10:00Z</td>\n",
|
||||
" <td>-0.501993</td>\n",
|
||||
" <td>0.727775</td>\n",
|
||||
" <td>-0.786263</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>999</th>\n",
|
||||
" <td>2020-07-06T05:15:00Z</td>\n",
|
||||
" <td>-0.404138</td>\n",
|
||||
" <td>0.806980</td>\n",
|
||||
" <td>-0.883521</td>\n",
|
||||
" <td>None</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>1000 rows × 7 columns</p>\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>timestamp</th>\n <th>sensor_1</th>\n <th>sensor_2</th>\n <th>sensor_3</th>\n <th>contributors</th>\n <th>isAnomaly</th>\n <th>severity</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-07-02T18:00:00Z</td>\n <td>1.069680</td>\n <td>0.393173</td>\n <td>3.129125</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-07-02T18:05:00Z</td>\n <td>0.932784</td>\n <td>0.214959</td>\n <td>3.077339</td>\n <td>[(0.5516611337661743, series_1), (0.3133429884...</td>\n <td>True</td>\n <td>0.06478</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-07-02T18:10:00Z</td>\n <td>1.012214</td>\n <td>0.466037</td>\n <td>2.909561</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-07-02T18:15:00Z</td>\n <td>1.122182</td>\n <td>0.398438</td>\n <td>3.029489</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-07-02T18:20:00Z</td>\n <td>1.091310</td>\n <td>0.282137</td>\n <td>2.948016</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>995</th>\n <td>2020-07-06T04:55:00Z</td>\n <td>-0.443438</td>\n <td>0.768980</td>\n <td>-0.710800</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>996</th>\n <td>2020-07-06T05:00:00Z</td>\n <td>-0.529400</td>\n <td>0.822140</td>\n <td>-0.944681</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>997</th>\n <td>2020-07-06T05:05:00Z</td>\n <td>-0.377911</td>\n <td>0.738591</td>\n <td>-0.871468</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>998</th>\n <td>2020-07-06T05:10:00Z</td>\n <td>-0.501993</td>\n <td>0.727775</td>\n <td>-0.786263</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n <tr>\n <th>999</th>\n <td>2020-07-06T05:15:00Z</td>\n <td>-0.404138</td>\n <td>0.806980</td>\n <td>-0.883521</td>\n <td>None</td>\n <td>False</td>\n <td>0.00000</td>\n </tr>\n </tbody>\n</table>\n<p>1000 rows × 7 columns</p>\n</div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"textData": null,
|
||||
"type": "htmlSandbox"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"rdf = (\n",
|
||||
" result.select(\n",
|
||||
|
@ -687,10 +303,11 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's now format the `interpretation` column that stores the contribution score from each sensor to the detected anomalies. The next cell formats this data, and splits the contribution score of each sensor into its own column."
|
||||
"Let's now format the `contributors` column that stores the contribution score from each sensor to the detected anomalies. The next cell formats this data, and splits the contribution score of each sensor into its own column."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -704,226 +321,7 @@
|
|||
"title": ""
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style scoped>\n",
|
||||
" .ansiout {\n",
|
||||
" display: block;\n",
|
||||
" unicode-bidi: embed;\n",
|
||||
" white-space: pre-wrap;\n",
|
||||
" word-wrap: break-word;\n",
|
||||
" word-break: break-all;\n",
|
||||
" font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n",
|
||||
" font-size: 13px;\n",
|
||||
" color: #555;\n",
|
||||
" margin-left: 4px;\n",
|
||||
" line-height: 19px;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<div class=\"ansiout\">Out[9]: </div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div class=\"ansiout\">Out[9]: </div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"type": "html"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>timestamp</th>\n",
|
||||
" <th>sensor_1</th>\n",
|
||||
" <th>sensor_2</th>\n",
|
||||
" <th>sensor_3</th>\n",
|
||||
" <th>isAnomaly</th>\n",
|
||||
" <th>severity</th>\n",
|
||||
" <th>series_0</th>\n",
|
||||
" <th>series_1</th>\n",
|
||||
" <th>series_2</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2020-07-02T18:00:00Z</td>\n",
|
||||
" <td>1.069680</td>\n",
|
||||
" <td>0.393173</td>\n",
|
||||
" <td>3.129125</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2020-07-02T18:05:00Z</td>\n",
|
||||
" <td>0.932784</td>\n",
|
||||
" <td>0.214959</td>\n",
|
||||
" <td>3.077339</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>0.06478</td>\n",
|
||||
" <td>0.313343</td>\n",
|
||||
" <td>0.551661</td>\n",
|
||||
" <td>0.134996</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2020-07-02T18:10:00Z</td>\n",
|
||||
" <td>1.012214</td>\n",
|
||||
" <td>0.466037</td>\n",
|
||||
" <td>2.909561</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>2020-07-02T18:15:00Z</td>\n",
|
||||
" <td>1.122182</td>\n",
|
||||
" <td>0.398438</td>\n",
|
||||
" <td>3.029489</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>2020-07-02T18:20:00Z</td>\n",
|
||||
" <td>1.091310</td>\n",
|
||||
" <td>0.282137</td>\n",
|
||||
" <td>2.948016</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>995</th>\n",
|
||||
" <td>2020-07-06T04:55:00Z</td>\n",
|
||||
" <td>-0.443438</td>\n",
|
||||
" <td>0.768980</td>\n",
|
||||
" <td>-0.710800</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>996</th>\n",
|
||||
" <td>2020-07-06T05:00:00Z</td>\n",
|
||||
" <td>-0.529400</td>\n",
|
||||
" <td>0.822140</td>\n",
|
||||
" <td>-0.944681</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>997</th>\n",
|
||||
" <td>2020-07-06T05:05:00Z</td>\n",
|
||||
" <td>-0.377911</td>\n",
|
||||
" <td>0.738591</td>\n",
|
||||
" <td>-0.871468</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>998</th>\n",
|
||||
" <td>2020-07-06T05:10:00Z</td>\n",
|
||||
" <td>-0.501993</td>\n",
|
||||
" <td>0.727775</td>\n",
|
||||
" <td>-0.786263</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>999</th>\n",
|
||||
" <td>2020-07-06T05:15:00Z</td>\n",
|
||||
" <td>-0.404138</td>\n",
|
||||
" <td>0.806980</td>\n",
|
||||
" <td>-0.883521</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>0.00000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>1000 rows × 9 columns</p>\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+output": {
|
||||
"addedWidgets": {},
|
||||
"arguments": {},
|
||||
"data": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>timestamp</th>\n <th>sensor_1</th>\n <th>sensor_2</th>\n <th>sensor_3</th>\n <th>isAnomaly</th>\n <th>severity</th>\n <th>series_0</th>\n <th>series_1</th>\n <th>series_2</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2020-07-02T18:00:00Z</td>\n <td>1.069680</td>\n <td>0.393173</td>\n <td>3.129125</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2020-07-02T18:05:00Z</td>\n <td>0.932784</td>\n <td>0.214959</td>\n <td>3.077339</td>\n <td>True</td>\n <td>0.06478</td>\n <td>0.313343</td>\n <td>0.551661</td>\n <td>0.134996</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2020-07-02T18:10:00Z</td>\n <td>1.012214</td>\n <td>0.466037</td>\n <td>2.909561</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2020-07-02T18:15:00Z</td>\n <td>1.122182</td>\n <td>0.398438</td>\n <td>3.029489</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2020-07-02T18:20:00Z</td>\n <td>1.091310</td>\n <td>0.282137</td>\n <td>2.948016</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>995</th>\n <td>2020-07-06T04:55:00Z</td>\n <td>-0.443438</td>\n <td>0.768980</td>\n <td>-0.710800</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>996</th>\n <td>2020-07-06T05:00:00Z</td>\n <td>-0.529400</td>\n <td>0.822140</td>\n <td>-0.944681</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>997</th>\n <td>2020-07-06T05:05:00Z</td>\n <td>-0.377911</td>\n <td>0.738591</td>\n <td>-0.871468</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>998</th>\n <td>2020-07-06T05:10:00Z</td>\n <td>-0.501993</td>\n <td>0.727775</td>\n <td>-0.786263</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>999</th>\n <td>2020-07-06T05:15:00Z</td>\n <td>-0.404138</td>\n <td>0.806980</td>\n <td>-0.883521</td>\n <td>False</td>\n <td>0.00000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n </tbody>\n</table>\n<p>1000 rows × 9 columns</p>\n</div>",
|
||||
"datasetInfos": [],
|
||||
"metadata": {},
|
||||
"removedWidgets": [],
|
||||
"textData": null,
|
||||
"type": "htmlSandbox"
|
||||
}
|
||||
},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def parse(x):\n",
|
||||
" if len(x) > 0:\n",
|
||||
|
@ -1076,13 +474,15 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img width=\"1300\" src=\"https://mmlspark.blob.core.windows.net/graphics/mvad_plot.png\"/>"
|
||||
"<img width=\"1300\" src=\"https://mmlspark.blob.core.windows.net/graphics/multivariate-anomaly-detection-plot.png\"/>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -1093,11 +493,11 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"The plots above show the raw data from the sensors (inside the inference window) in orange, green, and blue. The red vertical lines in the first figure show the detected anomalies that have a severity greater than or equal to `minSeverity`. \n",
|
||||
"The plots show the raw data from the sensors (inside the inference window) in orange, green, and blue. The red vertical lines in the first figure show the detected anomalies that have a severity greater than or equal to `minSeverity`. \n",
|
||||
"\n",
|
||||
"The second plot shows the severity score of all the detected anomalies, with the `minSeverity` threshold shown in the dotted red line.\n",
|
||||
"\n",
|
||||
"Finally, the last plot shows the contribution of the data from each sensor to the detected anomalies. This helps us diagnose and understand the most likely cause of each anomaly."
|
||||
"Finally, the last plot shows the contribution of the data from each sensor to the detected anomalies. It helps us diagnose and understand the most likely cause of each anomaly."
|
||||
]
|
||||
}
|
||||
],
|
||||
|
@ -1113,9 +513,9 @@
|
|||
"widgets": {}
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.8.5 ('base')",
|
||||
"display_name": "dev",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
"name": "dev"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
|
@ -1127,12 +527,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "601a75c4c141f401603984f1538447337114e368c54c4d5b589ea94315afdca2"
|
||||
}
|
||||
"version": "3.7.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
|
|
@ -16,7 +16,7 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"# Azure OpenAI for Big Data\n",
|
||||
"# Azure OpenAI for big data\n",
|
||||
"\n",
|
||||
"The Azure OpenAI service can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, we have integrated the Azure OpenAI service with the distributed machine learning library [SynapseML](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics. "
|
||||
]
|
||||
|
@ -469,7 +469,7 @@
|
|||
"The example makes several requests to the service, one for each prompt. To complete multiple prompts in a single request, use batch mode. First, in the OpenAICompletion object, instead of setting the Prompt column to \"Prompt\", specify \"batchPrompt\" for the BatchPrompt column.\n",
|
||||
"To do so, create a dataframe with a list of prompts per row.\n",
|
||||
"\n",
|
||||
"**Note** that as of this writing there is currently a limit of 20 prompts in a single request, and a hard limit of 2048 \"tokens\", or approximately 1500 words."
|
||||
"As of this writing there is currently a limit of 20 prompts in a single request, and a hard limit of 2048 \"tokens\", or approximately 1500 words."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -1,14 +1,29 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Cognitive Services"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"<image width=\"200\" alt-text=\"icon\" src=\"https://mmlspark.blob.core.windows.net/graphics/Readme/cog_services_on_spark_2.svg\" />"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Cognitive Services\n",
|
||||
"\n",
|
||||
"<image width=\"200\" alt-text=\"icon\" src=\"https://mmlspark.blob.core.windows.net/graphics/Readme/cog_services_on_spark_2.svg\" />\n",
|
||||
"\n",
|
||||
"[Azure Cognitive Services](https://azure.microsoft.com/services/cognitive-services/) are a suite of APIs, SDKs, and services available to help developers build intelligent applications without having direct AI or data science skills or knowledge by enabling developers to easily add cognitive features into their applications. The goal of Azure Cognitive Services is to help developers create applications that can see, hear, speak, understand, and even begin to reason. The catalog of services within Azure Cognitive Services can be categorized into five main pillars - Vision, Speech, Language, Web Search, and Decision.\n",
|
||||
"\n",
|
||||
"## Usage\n",
|
||||
|
@ -78,12 +93,17 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"## Prerequisites\n",
|
||||
"\n",
|
||||
"1. Follow the steps in [Getting started](https://docs.microsoft.com/en-us/azure/cognitive-services/big-data/getting-started) to set up your Azure Databricks and Cognitive Services environment. This tutorial shows you how to install SynapseML and how to create your Spark cluster in Databricks.\n",
|
||||
"1. Follow the steps in [Getting started](https://docs.microsoft.com/azure/cognitive-services/big-data/getting-started) to set up your Azure Databricks and Cognitive Services environment. This tutorial shows you how to install SynapseML and how to create your Spark cluster in Databricks.\n",
|
||||
"1. After you create a new notebook in Azure Databricks, copy the **Shared code** below and paste into a new cell in your notebook.\n",
|
||||
"1. Choose a service sample, below, and copy paste it into a second new cell in your notebook.\n",
|
||||
"1. Replace any of the service subscription key placeholders with your own key.\n",
|
||||
|
@ -139,31 +159,42 @@
|
|||
"from synapse.ml.cognitive import *\n",
|
||||
"\n",
|
||||
"# A general Cognitive Services key for Text Analytics, Computer Vision and Form Recognizer (or use separate keys that belong to each service)\n",
|
||||
"service_key = find_secret(\"cognitive-api-key\")\n",
|
||||
"service_key = find_secret(\n",
|
||||
" \"cognitive-api-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string. e.g. service_key=\"27snaiw...\"\n",
|
||||
"service_loc = \"eastus\"\n",
|
||||
"\n",
|
||||
"# A Bing Search v7 subscription key\n",
|
||||
"bing_search_key = find_secret(\"bing-search-key\")\n",
|
||||
"bing_search_key = find_secret(\n",
|
||||
" \"bing-search-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"\n",
|
||||
"# An Anomaly Dectector subscription key\n",
|
||||
"anomaly_key = find_secret(\"anomaly-api-key\")\n",
|
||||
"# An Anomaly Detector subscription key\n",
|
||||
"anomaly_key = find_secret(\n",
|
||||
" \"anomaly-api-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"anomaly_loc = \"westus2\"\n",
|
||||
"\n",
|
||||
"# A Translator subscription key\n",
|
||||
"translator_key = find_secret(\"translator-key\")\n",
|
||||
"translator_key = find_secret(\n",
|
||||
" \"translator-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string.\n",
|
||||
"translator_loc = \"eastus\"\n",
|
||||
"\n",
|
||||
"# An Azure search key\n",
|
||||
"search_key = find_secret(\"azure-search-key\")"
|
||||
"search_key = find_secret(\n",
|
||||
" \"azure-search-key\"\n",
|
||||
") # Replace the call to find_secret with your key as a python string."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Text Analytics sample\n",
|
||||
"\n",
|
||||
"The [Text Analytics](https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/) service provides several algorithms for extracting intelligent insights from text. For example, we can find the sentiment of given input text. The service will return a score between 0.0 and 1.0 where low scores indicate negative sentiment and high score indicates positive sentiment. This sample uses three simple sentences and returns the sentiment for each."
|
||||
"The [Text Analytics](https://azure.microsoft.com/services/cognitive-services/text-analytics/) service provides several algorithms for extracting intelligent insights from text. For example, we can find the sentiment of given input text. The service will return a score between 0.0 and 1.0 where low scores indicate negative sentiment and high score indicates positive sentiment. This sample uses three simple sentences and returns the sentiment for each."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -208,7 +239,7 @@
|
|||
"source": [
|
||||
"## Text Analytics for Health Sample\n",
|
||||
"\n",
|
||||
"The [Text Analytics for Health Service](https://docs.microsoft.com/en-us/azure/cognitive-services/language-service/text-analytics-for-health/overview?tabs=ner) extracts and labels relevant medical information from unstructured texts such as doctor's notes, discharge summaries, clinical documents, and electronic health records."
|
||||
"The [Text Analytics for Health Service](https://docs.microsoft.com/azure/cognitive-services/language-service/text-analytics-for-health/overview?tabs=ner) extracts and labels relevant medical information from unstructured texts such as doctor's notes, discharge summaries, clinical documents, and electronic health records."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -238,11 +269,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Translator sample\n",
|
||||
"[Translator](https://azure.microsoft.com/en-us/services/cognitive-services/translator/) is a cloud-based machine translation service and is part of the Azure Cognitive Services family of cognitive APIs used to build intelligent apps. Translator is easy to integrate in your applications, websites, tools, and solutions. It allows you to add multi-language user experiences in 90 languages and dialects and can be used for text translation with any operating system. In this sample, we do a simple text translation by providing the sentences you want to translate and target languages you want to translate to."
|
||||
"[Translator](https://azure.microsoft.com/services/cognitive-services/translator/) is a cloud-based machine translation service and is part of the Azure Cognitive Services family of cognitive APIs used to build intelligent apps. Translator is easy to integrate in your applications, websites, tools, and solutions. It allows you to add multi-language user experiences in 90 languages and dialects and can be used for text translation with any operating system. In this sample, we do a simple text translation by providing the sentences you want to translate and target languages you want to translate to."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -281,11 +313,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Form Recognizer sample\n",
|
||||
"[Form Recognizer](https://azure.microsoft.com/en-us/services/form-recognizer/) is a part of Azure Applied AI Services that lets you build automated data processing software using machine learning technology. Identify and extract text, key/value pairs, selection marks, tables, and structure from your documents—the service outputs structured data that includes the relationships in the original file, bounding boxes, confidence and more. In this sample, we analyze a business card image and extract its information into structured data."
|
||||
"[Form Recognizer](https://azure.microsoft.com/services/form-recognizer/) is a part of Azure Applied AI Services that lets you build automated data processing software using machine learning technology. Identify and extract text, key/value pairs, selection marks, tables, and structure from your documents. The service outputs structured data that includes the relationships in the original file, bounding boxes, confidence and more. In this sample, we analyze a business card image and extract its information into structured data."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -328,12 +361,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Computer Vision sample\n",
|
||||
"\n",
|
||||
"[Computer Vision](https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/) analyzes images to identify structure such as faces, objects, and natural-language descriptions. In this sample, we tag a list of images. Tags are one-word descriptions of things in the image like recognizable objects, people, scenery, and actions."
|
||||
"[Computer Vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/) analyzes images to identify structure such as faces, objects, and natural-language descriptions. In this sample, we tag a list of images. Tags are one-word descriptions of things in the image like recognizable objects, people, scenery, and actions."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -355,7 +389,7 @@
|
|||
" ],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Run the Computer Vision service. Analyze Image extracts infortmation from/about the images.\n",
|
||||
"# Run the Computer Vision service. Analyze Image extracts information from/about the images.\n",
|
||||
"analysis = (\n",
|
||||
" AnalyzeImage()\n",
|
||||
" .setLocation(service_loc)\n",
|
||||
|
@ -373,12 +407,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Bing Image Search sample\n",
|
||||
"\n",
|
||||
"[Bing Image Search](https://azure.microsoft.com/en-us/services/cognitive-services/bing-image-search-api/) searches the web to retrieve images related to a user's natural language query. In this sample, we use a text query that looks for images with quotes. It returns a list of image URLs that contain photos related to our query."
|
||||
"[Bing Image Search](https://azure.microsoft.com/services/cognitive-services/bing-image-search-api/) searches the web to retrieve images related to a user's natural language query. In this sample, we use a text query that looks for images with quotes. It returns a list of image URLs that contain photos related to our query."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -418,11 +453,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Speech-to-Text sample\n",
|
||||
"The [Speech-to-text](https://azure.microsoft.com/en-us/services/cognitive-services/speech-services/) service converts streams or files of spoken audio to text. In this sample, we transcribe one audio file."
|
||||
"The [Speech-to-text](https://azure.microsoft.com/services/cognitive-services/speech-services/) service converts streams or files of spoken audio to text. In this sample, we transcribe one audio file."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -452,11 +488,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Text-to-Speech sample\n",
|
||||
"[Text to speech](https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#overview) is a service that allows one to build apps and services that speak naturally, choosing from more than 270 neural voices across 119 languages and variants."
|
||||
"[Text to speech](https://azure.microsoft.com/services/cognitive-services/text-to-speech/#overview) is a service that allows one to build apps and services that speak naturally, choosing from more than 270 neural voices across 119 languages and variants."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -498,12 +535,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Anomaly Detector sample\n",
|
||||
"\n",
|
||||
"[Anomaly Detector](https://azure.microsoft.com/en-us/services/cognitive-services/anomaly-detector/) is great for detecting irregularities in your time series data. In this sample, we use the service to find anomalies in the entire time series."
|
||||
"[Anomaly Detector](https://azure.microsoft.com/services/cognitive-services/anomaly-detector/) is great for detecting irregularities in your time series data. In this sample, we use the service to find anomalies in the entire time series."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -576,7 +614,7 @@
|
|||
" )\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Create a dataframe with spcificies which countries we want data on\n",
|
||||
"# Create a dataframe with specifies which countries we want data on\n",
|
||||
"df = spark.createDataFrame([(\"br\",), (\"usa\",)], [\"country\"]).withColumn(\n",
|
||||
" \"request\", http_udf(world_bank_request)(col(\"country\"))\n",
|
||||
")\n",
|
||||
|
@ -603,7 +641,11 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"## Azure Cognitive search sample\n",
|
||||
"\n",
|
||||
|
@ -613,7 +655,11 @@
|
|||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"search_service = \"mmlspark-azure-search\"\n",
|
||||
|
|
|
@ -1,42 +1,23 @@
|
|||
{
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"orig_nbformat": 2,
|
||||
"kernelspec": {
|
||||
"name": "python385jvsc74a57bd072be13fef265c65d19cf428fd1b09dd31615eed186d1dccdebb6e555960506ee",
|
||||
"display_name": "Python 3.8.5 64-bit (conda)"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2,
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# LightGBM"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[LightGBM](https://github.com/Microsoft/LightGBM) is an open-source,\n",
|
||||
"distributed, high-performance gradient boosting (GBDT, GBRT, GBM, or\n",
|
||||
"MART) framework. This framework specializes in creating high-quality and\n",
|
||||
"GPU enabled decision tree algorithms for ranking, classification, and\n",
|
||||
"many other machine learning tasks. LightGBM is part of Microsoft's\n",
|
||||
"[DMTK](http://github.com/microsoft/dmtk) project.\n",
|
||||
"[DMTK](https://github.com/microsoft/dmtk) project.\n",
|
||||
"\n",
|
||||
"### Advantages of LightGBM\n",
|
||||
"\n",
|
||||
|
@ -56,33 +37,33 @@
|
|||
"\n",
|
||||
"### LightGBM Usage:\n",
|
||||
"\n",
|
||||
"- LightGBMClassifier: used for building classification models. For example, to predict whether a company will bankrupt or not, we could build a binary classification model with LightGBMClassifier.\n",
|
||||
"- LightGBMClassifier: used for building classification models. For example, to predict whether a company enters bankruptcy or not, we could build a binary classification model with LightGBMClassifier.\n",
|
||||
"- LightGBMRegressor: used for building regression models. For example, to predict the house price, we could build a regression model with LightGBMRegressor.\n",
|
||||
"- LightGBMRanker: used for building ranking models. For example, to predict website searching result relevance, we could build a ranking model with LightGBMRanker."
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Bankruptcy Prediction with LightGBM Classifier\n",
|
||||
"\n",
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/Documentation/bankruptcy image.png\" width=\"800\" style=\"float: center;\"/>\n",
|
||||
"\n",
|
||||
"In this example, we use LightGBM to build a classification model in order to predict bankruptcy."
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Read dataset"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pyspark.sql import SparkSession\n",
|
||||
"\n",
|
||||
|
@ -92,13 +73,13 @@
|
|||
"from synapse.ml.core.platform import *\n",
|
||||
"\n",
|
||||
"from synapse.ml.core.platform import materializing_display as display"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = (\n",
|
||||
" spark.read.format(\"csv\")\n",
|
||||
|
@ -112,45 +93,45 @@
|
|||
"print(\"records read: \" + str(df.count()))\n",
|
||||
"print(\"Schema: \")\n",
|
||||
"df.printSchema()"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"display(df)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Split the dataset into train and test"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train, test = df.randomSplit([0.85, 0.15], seed=1)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Add featurizer to convert features to vector"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pyspark.ml.feature import VectorAssembler\n",
|
||||
"\n",
|
||||
|
@ -158,65 +139,74 @@
|
|||
"featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
|
||||
"train_data = featurizer.transform(train)[\"Bankrupt?\", \"features\"]\n",
|
||||
"test_data = featurizer.transform(test)[\"Bankrupt?\", \"features\"]"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Check if the data is unbalanced"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"display(train_data.groupBy(\"Bankrupt?\").count())"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Training"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.lightgbm import LightGBMClassifier\n",
|
||||
"\n",
|
||||
"model = LightGBMClassifier(\n",
|
||||
" objective=\"binary\", featuresCol=\"features\", labelCol=\"Bankrupt?\", isUnbalance=True\n",
|
||||
")"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"model = model.fit(train_data)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"By calling \"saveNativeModel\", it allows you to extract the underlying lightGBM model for fast deployment after you train on Spark."
|
||||
],
|
||||
"metadata": {}
|
||||
"\"saveNativeModel\" allows you to extract the underlying lightGBM model for fast deployment after you train on Spark."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.lightgbm import LightGBMClassificationModel\n",
|
||||
"\n",
|
||||
|
@ -226,29 +216,29 @@
|
|||
" \"/models/lgbmclassifier.model\"\n",
|
||||
" )\n",
|
||||
"if running_on_synapse_internal():\n",
|
||||
" model.saveNativeModel(\"Files/models/lgbmclassifier.model\")\n",
|
||||
" model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
|
||||
" \"Files/models/lgbmclassifier.model\"\n",
|
||||
" )\n",
|
||||
" model.saveNativeModel(\"Files/models/lgbmclassifier.model\")\n",
|
||||
" model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
|
||||
" \"Files/models/lgbmclassifier.model\"\n",
|
||||
" )\n",
|
||||
"else:\n",
|
||||
" model.saveNativeModel(\"/tmp/lgbmclassifier.model\")\n",
|
||||
" model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
|
||||
" \"/tmp/lgbmclassifier.model\"\n",
|
||||
" )"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Feature Importances Visualization"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
|
@ -273,30 +263,30 @@
|
|||
"plt.xlabel(\"importances\")\n",
|
||||
"plt.ylabel(\"features\")\n",
|
||||
"plt.show()"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Prediction"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"predictions = model.transform(test_data)\n",
|
||||
"predictions.limit(10).toPandas()"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.train import ComputeModelStatistics\n",
|
||||
"\n",
|
||||
|
@ -306,117 +296,116 @@
|
|||
" scoredLabelsCol=\"prediction\",\n",
|
||||
").transform(predictions)\n",
|
||||
"display(metrics)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Quantile Regression for Drug Discovery with LightGBMRegressor\n",
|
||||
"\n",
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/Documentation/drug.png\" width=\"800\" style=\"float: center;\"/>\n",
|
||||
"\n",
|
||||
"In this example, we show how to use LightGBM to build a simple regression model."
|
||||
],
|
||||
"metadata": {}
|
||||
"In this example, we show how to use LightGBM to build a regression model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Read dataset"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"triazines = spark.read.format(\"libsvm\").load(\n",
|
||||
" \"wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight\"\n",
|
||||
")"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# print some basic info\n",
|
||||
"print(\"records read: \" + str(triazines.count()))\n",
|
||||
"print(\"Schema: \")\n",
|
||||
"triazines.printSchema()\n",
|
||||
"display(triazines.limit(10))"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Split dataset into train and test"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train, test = triazines.randomSplit([0.85, 0.15], seed=1)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Training"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.lightgbm import LightGBMRegressor\n",
|
||||
"\n",
|
||||
"model = LightGBMRegressor(\n",
|
||||
" objective=\"quantile\", alpha=0.2, learningRate=0.3, numLeaves=31\n",
|
||||
").fit(train)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(model.getFeatureImportances())"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Prediction"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"scoredData = model.transform(test)\n",
|
||||
"display(scoredData)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.train import ComputeModelStatistics\n",
|
||||
"\n",
|
||||
|
@ -424,27 +413,27 @@
|
|||
" evaluationMetric=\"regression\", labelCol=\"label\", scoresCol=\"prediction\"\n",
|
||||
").transform(scoredData)\n",
|
||||
"display(metrics)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## LightGBM Ranker"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Read dataset"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = spark.read.format(\"parquet\").load(\n",
|
||||
" \"wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_train.parquet\"\n",
|
||||
|
@ -454,20 +443,20 @@
|
|||
"print(\"Schema: \")\n",
|
||||
"df.printSchema()\n",
|
||||
"display(df.limit(10))"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Training"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.lightgbm import LightGBMRanker\n",
|
||||
"\n",
|
||||
|
@ -487,38 +476,57 @@
|
|||
" evalAt=[1, 3, 5],\n",
|
||||
" metric=\"ndcg\",\n",
|
||||
")"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"lgbm_ranker_model = lgbm_ranker.fit(df)"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Model Prediction"
|
||||
],
|
||||
"metadata": {}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dt = spark.read.format(\"parquet\").load(\n",
|
||||
" \"wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_test.parquet\"\n",
|
||||
")\n",
|
||||
"predictions = lgbm_ranker_model.transform(dt)\n",
|
||||
"predictions.limit(10).toPandas()"
|
||||
],
|
||||
"outputs": [],
|
||||
"metadata": {}
|
||||
]
|
||||
}
|
||||
]
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.8.5 64-bit (conda)",
|
||||
"name": "python385jvsc74a57bd072be13fef265c65d19cf428fd1b09dd31615eed186d1dccdebb6e555960506ee"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.5"
|
||||
},
|
||||
"orig_nbformat": 2
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
|
|
|
@ -1,29 +1,38 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ONNX Inference on Spark\n",
|
||||
"# ONNX Inference on Spark\n",
|
||||
"\n",
|
||||
"In this example, we will train a LightGBM model, convert the model to ONNX format and use the converted model to infer some testing data on Spark.\n",
|
||||
"In this example, you train a LightGBM model and convert the model to [ONNX](https://onnx.ai/) format. Once converted, you use the model to infer some testing data on Spark.\n",
|
||||
"\n",
|
||||
"Python dependencies:\n",
|
||||
"This example uses the following Python packages and versions:\n",
|
||||
"\n",
|
||||
"- onnxmltools==1.7.0\n",
|
||||
"- lightgbm==3.2.1\n"
|
||||
"- `onnxmltools==1.7.0`\n",
|
||||
"- `lightgbm==3.2.1`\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Load training data"
|
||||
"## Load the example data\n",
|
||||
"\n",
|
||||
"To load the example data, add the following code examples to cells in your notebook and then run the cells:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pyspark.sql import SparkSession\n",
|
||||
|
@ -34,13 +43,7 @@
|
|||
"from synapse.ml.core.platform import *\n",
|
||||
"\n",
|
||||
"from synapse.ml.core.platform import materializing_display as display"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
|
@ -61,10 +64,25 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use LightGBM to train a model"
|
||||
"The output should look similar to the following table, though the values and number of rows may differ:\n",
|
||||
"\n",
|
||||
"| Interest Coverage Ratio | Net Income Flag | Equity to Liability |\n",
|
||||
"| ----- | ----- | ----- |\n",
|
||||
"| 0.5641 | 1.0 | 0.0165 |\n",
|
||||
"| 0.5702 | 1.0 | 0.0208 |\n",
|
||||
"| 0.5673 | 1.0 | 0.0165 |"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Use LightGBM to train a model"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -103,10 +121,30 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Export the trained model to a LightGBM booster, convert it to ONNX format."
|
||||
"## Convert the model to ONNX format\n",
|
||||
"\n",
|
||||
"The following code exports the trained model to a LightGBM booster and then converts it to ONNX format:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.core.platform import running_on_binder\n",
|
||||
"\n",
|
||||
"if running_on_binder():\n",
|
||||
" !pip install lightgbm==3.2.1\n",
|
||||
" from IPython import get_ipython"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -115,11 +153,6 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from synapse.ml.core.platform import running_on_binder\n",
|
||||
"\n",
|
||||
"if running_on_binder():\n",
|
||||
" !pip install lightgbm==3.2.1\n",
|
||||
" from IPython import get_ipython\n",
|
||||
"import lightgbm as lgb\n",
|
||||
"from lightgbm import Booster, LGBMClassifier\n",
|
||||
"\n",
|
||||
|
@ -141,10 +174,11 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Load the ONNX payload into an `ONNXModel`, and inspect the model inputs and outputs."
|
||||
"After conversion, load the ONNX payload into an `ONNXModel` and inspect the model inputs and outputs:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -162,6 +196,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
|
@ -183,10 +218,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Create some testing data and transform the data through the ONNX model."
|
||||
"## Use the model for inference\n",
|
||||
"\n",
|
||||
"To perform inference with the model, the following code creates testing data and transforms the data through the ONNX model."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -217,11 +255,39 @@
|
|||
"\n",
|
||||
"display(onnx_ml.transform(testDf))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The output should look similar to the following table, though the values and number of rows may differ:\n",
|
||||
"\n",
|
||||
"| Index | Features | Prediction | Probability |\n",
|
||||
"| ----- | ----- | ----- | ----- |\n",
|
||||
"| 1 | `\"{\"type\":1,\"values\":[0.105...` | 0 | `\"{\"0\":0.835...` |\n",
|
||||
"| 2 | `\"{\"type\":1,\"values\":[0.814...` | 0 | `\"{\"0\":0.658...` |"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Tags",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.8"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
|
|
@ -1,29 +1,42 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exploring Art across Culture and Medium with Fast, Conditional, k-Nearest Neighbors\n",
|
||||
"\n",
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/art/cross_cultural_matches.jpg\" width=\"600\"/>\n",
|
||||
"\n",
|
||||
"This notebook serves as a guideline for match-finding via k-nearest-neighbors. In the code below, we will set up code that allows queries involving cultures and mediums of art amassed from the Metropolitan Museum of Art in NYC and the Rijksmuseum in Amsterdam."
|
||||
"This article serves as a guideline for match-finding via k-nearest-neighbors. You set up code that allows queries involving cultures and mediums of art amassed from the Metropolitan Museum of Art in NYC and the Rijksmuseum in Amsterdam."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"### Overview of the BallTree\n",
|
||||
"The structure functioning behind the kNN model is a BallTree, which is a recursive binary tree where each node (or \"ball\") contains a partition of the points of data to be queried. Building a BallTree involves assigning data points to the \"ball\" whose center they are closest to (with respect to a certain specified feature), resulting in a structure that allows binary-tree-like traversal and lends itself to finding k-nearest neighbors at a BallTree leaf."
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/art/cross_cultural_matches.jpg\" width=\"600\"/>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Setup\n",
|
||||
"## Overview of the BallTree\n",
|
||||
"The structure functioning behind the KNN model is a BallTree, which is a recursive binary tree where each node (or \"ball\") contains a partition of the points of data to be queried. Building a BallTree involves assigning data points to the \"ball\" whose center they're closest to (with respect to a certain specified feature), resulting in a structure that allows binary-tree-like traversal and lends itself to finding k-nearest neighbors at a BallTree leaf."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Setup\n",
|
||||
"Import necessary Python libraries and prepare dataset."
|
||||
]
|
||||
},
|
||||
|
@ -100,11 +113,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Define categories to be queried on\n",
|
||||
"We will be using two kNN models: one for culture, and one for medium. The categories for each grouping are defined below."
|
||||
"## Define categories to be queried on\n",
|
||||
"Two KNN models are used: one for culture, and one for medium."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -146,11 +160,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define and fit ConditionalKNN models\n",
|
||||
"Below, we create ConditionalKNN models for both the medium and culture columns; each model takes in an output column, features column (feature vector), values column (cell values under the output column), and label column (the quality that the respective KNN is conditioned on)."
|
||||
"## Define and fit ConditionalKNN models\n",
|
||||
"Create ConditionalKNN models for both the medium and culture columns; each model takes in an output column, features column (feature vector), values column (cell values under the output column), and label column (the quality that the respective KNN is conditioned on)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -190,11 +205,11 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Define matching and visualizing methods\n",
|
||||
"## Define matching and visualizing methods\n",
|
||||
"\n",
|
||||
"After the initial dataset and category setup, we prepare methods that will query and visualize the conditional kNN's results. \n",
|
||||
"After the initial dataset and category setup, prepare methods that will query and visualize the conditional KNN's results.\n",
|
||||
"\n",
|
||||
"`addMatches()` will create a Dataframe with a handful of matches per category."
|
||||
"`addMatches()` creates a Dataframe with a handful of matches per category."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -213,6 +228,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
|
@ -260,11 +276,12 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Putting it all together\n",
|
||||
"Below, we define `test_all()` to take in the data, CKNN models, the art id values to query on, and the file path to save the output visualization to. The medium and culture models were previously trained and loaded."
|
||||
"## Putting it all together\n",
|
||||
"Define `test_all()` to take in the data, CKNN models, the art id values to query on, and the file path to save the output visualization to. The medium and culture models were previously trained and loaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -305,13 +322,23 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Demo\n",
|
||||
"The following cell performs batched queries given desired image IDs and a filename to save the visualization.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Demo\n",
|
||||
"The following cell performs batched queries given desired image IDs and a filename to save the visualization."
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/art/cross_cultural_matches.jpg\" width=\"600\"/>"
|
||||
]
|
||||
},
|
||||
|
|
|
@ -1,14 +1,22 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## HyperParameterTuning - Fighting Breast Cancer\n",
|
||||
"# HyperParameterTuning - Fighting Breast Cancer\n",
|
||||
"\n",
|
||||
"We can do distributed randomized grid search hyperparameter tuning with SynapseML.\n",
|
||||
"\n",
|
||||
"First, we import the packages"
|
||||
"This tutorial shows how SynapseML can be used to identify the best combination of hyperparameters for your chosen classifiers, ultimately resulting in more accurate and reliable models. In order to demonstrate this, we'll show how to perform distributed randomized grid search hyperparameter tuning to build a model to identify breast cancer. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1 - Set up dependencies\n",
|
||||
"Start by importing pandas and setting up our Spark session."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -25,10 +33,11 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now let's read the data and split it to tuning and test sets:"
|
||||
"Next, read the data and split it into tuning and test sets."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -49,7 +58,7 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next, define the models that will be tuned:"
|
||||
"Define the models to be used."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -74,12 +83,14 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can specify the hyperparameters using the HyperparamBuilder.\n",
|
||||
"We can add either DiscreteHyperParam or RangeHyperParam hyperparameters.\n",
|
||||
"TuneHyperparameters will randomly choose values from a uniform distribution."
|
||||
"## 2 - Find the best model using AutoML\n",
|
||||
"\n",
|
||||
"Import SynapseML's AutoML classes from `synapse.ml.automl`.\n",
|
||||
"Specify the hyperparameters using the `HyperparamBuilder`. Add either `DiscreteHyperParam` or `RangeHyperParam` hyperparameters. `TuneHyperparameters` will randomly choose values from a uniform distribution:"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -129,9 +140,11 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 3 - Evaluate the model\n",
|
||||
"We can view the best model's parameters and retrieve the underlying best model pipeline"
|
||||
]
|
||||
},
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -11,11 +12,11 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"## Interpretability - Tabular SHAP explainer\n",
|
||||
"# Interpretability - Tabular SHAP explainer\n",
|
||||
"\n",
|
||||
"In this example, we use Kernel SHAP to explain a tabular classification model built from the Adults Census dataset.\n",
|
||||
"\n",
|
||||
"First we import the packages and define some UDFs we will need later."
|
||||
"First we import the packages and define some UDFs we need later."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -54,6 +55,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -64,7 +66,7 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"Now let's read the data and train a simple binary classification model."
|
||||
"Now let's read the data and train a binary classification model."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -126,6 +128,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -159,6 +162,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -169,7 +173,7 @@
|
|||
}
|
||||
},
|
||||
"source": [
|
||||
"We create a TabularSHAP explainer, set the input columns to all the features the model takes, specify the model and the target output column we are trying to explain. In this case, we are trying to explain the \"probability\" output which is a vector of length 2, and we are only looking at class 1 probability. Specify targetClasses to `[0, 1]` if you want to explain class 0 and 1 probability at the same time. Finally we sample 100 rows from the training data for background data, which is used for integrating out features in Kernel SHAP."
|
||||
"We create a TabularSHAP explainer, set the input columns to all the features the model takes, specify the model and the target output column we're trying to explain. In this case, we're trying to explain the \"probability\" output, which is a vector of length 2, and we're only looking at class 1 probability. Specify targetClasses to `[0, 1]` if you want to explain class 0 and 1 probability at the same time. Finally we sample 100 rows from the training data for background data, which is used for integrating out features in Kernel SHAP."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -242,6 +246,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -308,6 +313,7 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"application/vnd.databricks.v1+cell": {
|
||||
|
@ -315,10 +321,13 @@
|
|||
"nuid": "8f22fceb-0fc0-4a86-a0ca-2a7b47b4795a",
|
||||
"showTitle": false,
|
||||
"title": ""
|
||||
}
|
||||
},
|
||||
"tags": [
|
||||
"hide-synapse-internal"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"Your results will look like:\n",
|
||||
"Your results should look like:\n",
|
||||
"\n",
|
||||
"<img src=\"https://mmlspark.blob.core.windows.net/graphics/explainers/tabular-shap.png\" style=\"float: right;\"/>"
|
||||
]
|
||||
|
|
Загрузка…
Ссылка в новой задаче