azureml-examples/tutorials/get-started-notebooks/explore-data.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "# Tutorial: Upload, access and explore your data in Azure Machine Learning\n",
    "\n",
    "In this tutorial you learn how to:\n",
    "\n",
    "> * Upload your data to cloud storage\n",
    "> * Create an Azure Machine Learning data asset\n",
    "> * Access your data in a notebook for interactive development\n",
    "> * Create new versions of data assets\n",
    "\n",
    "The start of a machine learning project typically involves exploratory data analysis (EDA), data-preprocessing (cleaning, feature engineering), and the building of Machine Learning model prototypes to validate hypotheses. This _prototyping_ project phase is highly interactive. It lends itself to development in an IDE or a Jupyter notebook, with a _Python interactive console_. This tutorial describes these ideas.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "* If you opened this notebook from Azure Machine Learning studio, you need a compute instance to run the code. If you don't have a compute instance, select **Create compute** on the toolbar to first create one.  You can use all the default settings.  \n",
    "\n",
    "    ![Create compute](./media/create-compute.png)\n",
    "\n",
    "* If you're seeing this notebook elsewhere, complete [Create resources you need to get started](https://docs.microsoft.com/azure/machine-learning/quickstart-create-resources) to create an Azure Machine Learning workspace and a compute instance.\n",
    "\n",
    "## Set your kernel\n",
    "\n",
    "* If your compute instance is stopped, start it now.  \n",
    "        \n",
    "    ![Start compute](./media/start-compute.png)\n",
    "\n",
    "* Once your compute instance is running, make sure the that the kernel, found on the top right, is `Python 3.10 - SDK v2`.  If not, use the dropdown to select this kernel.\n",
    "\n",
    "    ![Set the kernel](./media/set-kernel.png)\n",
    "\n",
    "### Download the data used in this tutorial\n",
    "\n",
    "For data ingestion, the Azure Data Explorer handles raw data in [these formats](https://learn.microsoft.com/azure/data-explorer/ingestion-supported-formats). This tutorial uses this [CSV-format credit card client data sample](https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv). We see the steps proceed in an Azure Machine Learning resource. In that resource, we'll create a local folder with the suggested name of **data** directly under the folder where this notebook is located.\n",
    "\n",
    "> [!NOTE]\n",
    "> This tutorial depends on data placed in an Azure Machine Learning resource folder location. For this tutorial, 'local' means a folder location in that Azure Machine Learning resource. \n",
    "\n",
    "1. Select **Open terminal** below the three dots, as shown in this image:\n",
    "\n",
    "    ![Open terminal](./media/open-terminal.png)\n",
    "\n",
    "1. The terminal window opens in a new tab. \n",
    "1. Make sure you `cd` to the same folder where this notebook is located.  For example, if the notebook is in a folder named **get-started-notebooks**:\n",
    "\n",
    "    ```\n",
    "    cd get-started-notebooks    #  modify this to the path where your notebook is located\n",
    "    ```\n",
    "\n",
    "1. Enter these commands in the terminal window to copy the data to your compute instance:\n",
    "\n",
    "    ```\n",
    "    mkdir data\n",
    "    cd data                     # the sub-folder where you'll store the data\n",
    "    wget https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv\n",
    "    ```\n",
    "1. You can now close the terminal window.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "[Learn more about this data on the UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)\n",
    "\n",
    "## Create handle to workspace\n",
    "\n",
    "Before we dive in the code, you need a way to reference your workspace. You'll create `ml_client` for a handle to the workspace.  You'll then use `ml_client` to manage resources and jobs.\n",
    "\n",
    "In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:\n",
    "\n",
    "1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.\n",
    "1. Copy the value for workspace, resource group and subscription ID into the code.\n",
    "1. You'll need to copy one value, close the area and paste, then come back for the next one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "gather": {
     "logged": 1675966726847
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "from azure.ai.ml import MLClient\n",
    "from azure.identity import DefaultAzureCredential\n",
    "from azure.ai.ml.entities import Data\n",
    "from azure.ai.ml.constants import AssetTypes\n",
    "\n",
    "# authenticate\n",
    "credential = DefaultAzureCredential()\n",
    "\n",
    "# Get a handle to the workspace\n",
    "ml_client = MLClient(\n",
    "    credential=credential,\n",
    "    subscription_id=\"<SUBSCRIPTION_ID>\",\n",
    "    resource_group_name=\"<RESOURCE_GROUP>\",\n",
    "    workspace_name=\"<AML_WORKSPACE_NAME>\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> [!NOTE]\n",
    "> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen in the next code cell)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "\n",
    "## Upload data to cloud storage\n",
    "\n",
    "Azure Machine Learning uses Uniform Resource Identifiers (URIs), which point to storage locations in the cloud. A URI makes it easy to access data in notebooks and jobs. Data URI formats look similar to the web URLs that you use in your web browser to access web pages. For example:\n",
    "\n",
    "* Access data from public https server: `https://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>`\n",
    "* Access data from Azure Data Lake Gen 2: `abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>`\n",
    "\n",
    "An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long storage paths (URIs) that point to your most frequently used data, you can create a data asset, and then access that asset with a friendly name.\n",
    "\n",
    "Data asset creation also creates a *reference* to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk data source integrity. You can create Data assets from Azure Machine Learning datastores, Azure Storage, public URLs, and local files.\n",
    "\n",
    "> [!TIP]\n",
    "> For smaller-size data uploads, Azure Machine Learning data asset creation works well for data uploads from local machine resources to cloud storage. This approach avoids the need for extra tools or utilities. However, a larger-size data upload might require a dedicated tool or utility - for example, **azcopy**. The azcopy command-line tool moves data to and from Azure Storage. Learn more about [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).\n",
    "\n",
    "The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.  \n",
    "\n",
    "Each time you create a data asset, you need a unique version for it.  If the version already exists, you'll get an error.  In this code, we're using the \"initial\" for the first read of the data.  If that version already exists, we'll skip creating it again.\n",
    "\n",
    "You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there. \n",
    "\n",
    "In this tutorial, we use the name \"initial\" as the first version. The [Create production machine learning pipelines](pipeline.ipynb) tutorial will also use this version of the data, so here we are using a value that you'll see again in that tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1675461156382
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "from azure.ai.ml.entities import Data\n",
    "from azure.ai.ml.constants import AssetTypes\n",
    "\n",
    "# update the 'my_path' variable to match the location of where you downloaded the data on your\n",
    "# local filesystem\n",
    "\n",
    "my_path = \"./data/default_of_credit_card_clients.csv\"\n",
    "# set the version number of the data asset\n",
    "v1 = \"initial\"\n",
    "\n",
    "my_data = Data(\n",
    "    name=\"credit-card\",\n",
    "    version=v1,\n",
    "    description=\"Credit card data\",\n",
    "    path=my_path,\n",
    "    type=AssetTypes.URI_FILE,\n",
    ")\n",
    "\n",
    "## create data asset if it doesn't already exist:\n",
    "try:\n",
    "    data_asset = ml_client.data.get(name=\"credit-card\", version=v1)\n",
    "    print(\n",
    "        f\"Data asset already exists. Name: {my_data.name}, version: {my_data.version}\"\n",
    "    )\n",
    "except:\n",
    "    ml_client.data.create_or_update(my_data)\n",
    "    print(f\"Data asset created. Name: {my_data.name}, version: {my_data.version}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "You can see the uploaded data by selecting **Data** on the left. You'll see the data is uploaded and a data asset is created:\n",
    "\n",
    "![Image of data section of studio shows uploaded data](./media/access-and-explore-data.png)\n",
    "\n",
    "This data is named **credit-card**, and in the **Data assets** tab, we can see it in the **Name** column. This data uploaded to your workspace's default datastore named **workspaceblobstore**, seen in the **Data source** column. \n",
    "\n",
    "An Azure Machine Learning datastore is a *reference* to an *existing* storage account on Azure. A datastore offers these benefits:\n",
    "\n",
    "1. A common and easy-to-use API, to interact with different storage types (Blob/Files/Azure Data Lake Storage) and authentication methods.\n",
    "1. An easier way to discover useful datastores, when working as a team.\n",
    "1. In your scripts, a way to hide connection information for credential-based data access (service principal/SAS/key).\n",
    "\n",
    "\n",
    "## Access your data in a notebook\n",
    "\n",
    "Pandas directly support URIs - this example shows how to read a CSV file from an Azure Machine Learning Datastore:\n",
    "\n",
    "```\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/<datastore_name>/paths/<folder>/<filename>.csv\")\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "However, as mentioned previously, it can become hard to remember these URIs. Additionally, you must manually substitute all **<_substring_>** values in the **pd.read_csv** command with the real values for your resources. \n",
    "\n",
    "You'll want to create data assets for frequently accessed data. Here's an easier way to access the CSV file in Pandas:\n",
    "\n",
    "> [!IMPORTANT]\n",
    "> In a notebook cell, execute this code to install the `azureml-fsspec` Python library in your Jupyter kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "%pip install -U azureml-fsspec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1675445030495
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# get a handle of the data asset and print the URI\n",
    "data_asset = ml_client.data.get(name=\"credit-card\", version=v1)\n",
    "print(f\"Data asset URI: {data_asset.path}\")\n",
    "\n",
    "# read into pandas - note that you will see 2 headers in your data frame - that is ok, for now\n",
    "\n",
    "df = pd.read_csv(data_asset.path)\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "Read [Access data from Azure cloud storage during interactive development](how-to-access-data-interactive.md) to learn more about data access in a notebook.\n",
    "\n",
    "## Create a new version of the data asset\n",
    "\n",
    "You might have noticed that the data needs a little light cleaning, to make it fit to train a machine learning model. It has:\n",
    "\n",
    "* two headers\n",
    "* a client ID column; we wouldn't use this feature in Machine Learning\n",
    "* spaces in the response variable name\n",
    "\n",
    "Also, compared to the CSV format, the Parquet file format becomes a better way to store this data. Parquet offers compression, and it maintains schema. Therefore, to clean the data and store it in Parquet, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "gather": {
     "logged": 1675445038545
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "# read in data again, this time using the 2nd row as the header\n",
    "df = pd.read_csv(data_asset.path, header=1)\n",
    "# rename column\n",
    "df.rename(columns={\"default payment next month\": \"default\"}, inplace=True)\n",
    "# remove ID column\n",
    "df.drop(\"ID\", axis=1, inplace=True)\n",
    "\n",
    "# write file to filesystem\n",
    "df.to_parquet(\"./data/cleaned-credit-card.parquet\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "This table shows the structure of the data in the original **default_of_credit_card_clients.csv** file .CSV file downloaded in an earlier step. The uploaded data contains 23 explanatory variables and 1 response variable, as shown here:\n",
    "\n",
    "|Column Name(s) | Variable Type  |Description  |\n",
    "|---------|---------|---------|\n",
    "|X1     |   Explanatory      |    Amount of the given credit (NT dollar): it includes both the individual consumer credit and their family (supplementary) credit.    |\n",
    "|X2     |   Explanatory      |   Gender (1 = male; 2 = female).      |\n",
    "|X3     |   Explanatory      |   Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).      |\n",
    "|X4     |   Explanatory      |    Marital status (1 = married; 2 = single; 3 = others).     |\n",
    "|X5     |   Explanatory      |    Age (years).     |\n",
    "|X6-X11     | Explanatory        |  History of past payment. We tracked the past monthly payment records (from April to September  2005). -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.      |\n",
    "|X12-17     | Explanatory        |  Amount of bill statement (NT dollar) from April to September  2005.      |\n",
    "|X18-23     | Explanatory        |  Amount of previous payment (NT dollar) from April to September  2005.      |\n",
    "|Y     | Response        |    Default payment (Yes = 1, No = 0)     |\n",
    "\n",
    "Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage).  For this version, we'll add a time value, so that each time this code is run, a different version number will be created.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1675382989789
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "from azure.ai.ml.entities import Data\n",
    "from azure.ai.ml.constants import AssetTypes\n",
    "import time\n",
    "\n",
    "# Next, create a new *version* of the data asset (the data is automatically uploaded to cloud storage):\n",
    "v2 = \"cleaned\" + time.strftime(\"%Y.%m.%d.%H%M%S\", time.gmtime())\n",
    "my_path = \"./data/cleaned-credit-card.parquet\"\n",
    "\n",
    "# Define the data asset, and use tags to make it clear the asset can be used in training\n",
    "\n",
    "my_data = Data(\n",
    "    name=\"credit-card\",\n",
    "    version=v2,\n",
    "    description=\"Default of credit card clients data.\",\n",
    "    tags={\"training_data\": \"true\", \"format\": \"parquet\"},\n",
    "    path=my_path,\n",
    "    type=AssetTypes.URI_FILE,\n",
    ")\n",
    "\n",
    "## create the data asset\n",
    "\n",
    "my_data = ml_client.data.create_or_update(my_data)\n",
    "\n",
    "print(f\"Data asset created. Name: {my_data.name}, version: {my_data.version}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "The cleaned parquet file is the latest version data source. This code shows the CSV version result set first, then the Parquet version:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "gather": {
     "logged": 1675383001940
    },
    "jupyter": {
     "outputs_hidden": false,
     "source_hidden": false
    },
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# get a handle of the data asset and print the URI\n",
    "data_asset_v1 = ml_client.data.get(name=\"credit-card\", version=v1)\n",
    "data_asset_v2 = ml_client.data.get(name=\"credit-card\", version=v2)\n",
    "\n",
    "# print the v1 data\n",
    "print(f\"V1 Data asset URI: {data_asset_v1.path}\")\n",
    "v1df = pd.read_csv(data_asset_v1.path)\n",
    "print(v1df.head(5))\n",
    "\n",
    "# print the v2 data\n",
    "print(\n",
    "    \"_____________________________________________________________________________________________________________\\n\"\n",
    ")\n",
    "print(f\"V2 Data asset URI: {data_asset_v2.path}\")\n",
    "v2df = pd.read_parquet(data_asset_v2.path)\n",
    "print(v2df.head(5))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "nteract": {
     "transient": {
      "deleting": false
     }
    }
   },
   "source": [
    "## Next steps\n",
    "\n",
    "Read [Create data assets](https://learn.microsoft.com/azure/machine-learning/how-to-create-data-assets) for more information about data assets.\n",
    "\n",
    "Read [Create datastores](https://learn.microsoft.com/azure/machine-learning/how-to-datastore) to learn more about datastores.\n",
    "\n",
    "Continue with tutorials to learn how to develop a training script.\n",
    "\n",
    "> [Model development on a cloud workstation](https://learn.microsoft.com/azure/machine-learning/tutorial-cloud-workstation)"
   ]
  }
 ],
 "metadata": {
  "description": {
   "description": "Upload data to cloud storage, create a data asset, create new versions for data assets, use the data for interactive development."
  },
  "kernel_info": {
   "name": "python310-sdkv2"
  },
  "kernelspec": {
   "display_name": "Python 3.10 - SDK v2",
   "language": "python",
   "name": "python310-sdkv2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "microsoft": {
   "host": {
    "AzureML": {
     "notebookHasBeenCompleted": true
    }
   }
  },
  "nteract": {
   "version": "nteract-front-end@1.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}