* Import schedule samples added

* data import schedule samples added

* Corrected formatting

* lifecycle settings added

* Fixed test fail error

* fixed broken test

* updates

* Updated for auto-delete-settings
This commit is contained in:
AmarBadal 2023-05-25 10:09:27 -07:00 коммит произвёл GitHub
Родитель f87d6591d5
Коммит 400ade00e9
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
5 изменённых файлов: 501 добавлений и 6 удалений

Просмотреть файл

@ -1,7 +1,8 @@
## Working with Data in Azure Machine Learning CLI 2.0
This repository contains example `YAML` files for creating `data` using Azure Machine learning CLI 2.0. This directory includes:
- Sample `YAML` files for creating `data` asset from a `datastore`. These examples use `workspaceblobstore` datastore, which is created by default when a `workspace` is created. The examples use shorthand `azureml` scheme for pointing to a path on the `datastore` using syntax `azureml://datastores/${{datastore-name}}/paths/${{path_on_datastore}}`.<BR>
- Sample `YAML` files for creating `data` asset from a `datastore`. These examples use `workspaceblobstore` datastore, which is created by default when a `workspace` is created. The examples use shorthand `azureml` scheme for pointing to a path on the `datastore` using syntax `azureml://datastores/${{datastore-name}}/paths/${{path_on_datastore}}`.<BR>
- Sample `YAML` files for creating a `data` asset by uploading local file or folder.
- Sample `YAML` files for creating a `data` asset by using `URI` of file or folder on an Azure storage account or `URL` of a file available in the public domain.
@ -9,12 +10,14 @@ This repository contains example `YAML` files for creating `data` using Azure Ma
- Sample `YAML` files for creating a `data` asset by using an `MLTable` file on an Azure storage account or `URL` of a file available in the public domain.
- To create a data asset using any of the sample `YAML` files provided for the above scenarios, execute following command:
```cli
> az ml data create -f <file-name>.yml
```
- Sample `YAML` files for creating `data` asset by importing data from external data sources. These examples use `workspaceblobstore` datastore, which is created by default when a `workspace` is created. The examples use shorthand `azureml` scheme for pointing to a path on the `datastore` using syntax `azureml://datastores/workspaceblobstore/paths/<my_path>/${{name}}`.
>__NOTE:__ Choose `path` as "azureml://datastores/${{datastore-name}}/paths/${{path_on_datastore}}" if you wish to cache the imported data in separate locations. This would provide reproducibility capabilities but add to storage cost. If you wish to over-write the data in successive imports, choose `path` as "azureml://datastores/${{datastore-name}}/paths/<my_path>", this would save you from incurring duplicate storage cost but you would lose the reproducibility as the newer version of data asset would have over-written the underlying data in the data path.
- Sample `YAML` files for creating `data` asset by importing data from external data sources. These examples use `workspaceblobstore` datastore, which is created by default when a `workspace` is created. The examples use shorthand `azureml` scheme for pointing to a path on the `datastore` using syntax `azureml://datastores/workspaceblobstore/paths/<my_path>/${{name}}`.
>__NOTE:__ Choose `path` as "azureml://datastores/${{datastore-name}}/paths/${{path_on_datastore}}" if you wish to cache the imported data in separate locations. This would provide reproducibility capabilities but add to storage cost. If you wish to over-write the data in successive imports, choose `path` as "azureml://datastores/${{datastore-name}}/paths/<my_path>", this would save you from incurring duplicate storage cost but you would lose the reproducibility as the newer version of data asset would have over-written the underlying data in the data path.
- Sample `YAML` files for importing data from Snowflake DB and creating `data` asset.
@ -22,11 +25,77 @@ This repository contains example `YAML` files for creating `data` using Azure Ma
- Sample `YAML` files for importing data from Amazon S3 bucket and creating `data` asset.
- To create a data asset using any of the sample `YAML` files provided by data import from external data sources, execute following command:
- To create a data asset using any of the sample <source>.yml `YAML` files provided by data import from external data sources, execute following command:
```cli
> az ml data import -f <file-name>.yml
```
> **NOTE: Ensure you have copied the sample data into your default storage account by running the [copy-data.sh](../../../setup/setup-repo/copy-data.sh) script (`azcopy` is required).**
- To create a data asset in AzureML managed HOBO datastore use snowflake-import-managed.yml which points to path of "workspacemanageddatastore" `YAML` files provided by data import from external data sources, execute following command:
To learn more about Azure Machine Learning CLI 2.0, [follow this link](https://docs.microsoft.com/azure/machine-learning/how-to-configure-cli).
```cli
> az ml data import -f <file-name>.yml
```
- To import data asset using schedule we have two options - either call any of the <source>.yml in a Schedule YAML as in simple_import-schedule.yml or define an "inline schedule YAML" where you define both the schedule and import details in one single YAML as in data_import_schedule_database_inline.yml
```cli
> az ml schedule create -f <file-name>.yml
```
- The import data asset that is imported on to workspacemanageddatastore has data lifecycle management capability. There will be a default value and condition set for "auto-delete-settings" that could be altered.
- Use the following command for the imported dataset on to workspacemanageddatastore to check the current auto-delete-settings -
```cli
> az ml data show -n <imported-data-asset-name> -v <version>
```
- To update the settings -
- use the following command to change the value
```cli
> az ml data update --name 'data_import_to_managed_datastore' --version '1' --set auto_delete_setting.value='45d'
```
- To update the settings -
- use the following command to change the condition - valid values for condition are - 'created_greater_than' and 'last_accessed_greater_than'
```cli
> az ml data update --name 'data_import_to_managed_datastore' --version '1' --set auto_delete_setting.condition='created_greater_than'
```
- To update the settings -
- use the following command to change the condition and values -
```cli
> az ml data update --name 'data_import_to_managed_datastore' --version '1' --set auto_delete_setting.condition='created_greater_than' auto_delete_setting.value='30d'
```
- To delete the settings -
- use the following command to remove the auto-delete-setting -
```cli
> az ml data update --name 'data_import_to_managed_datastore' --version '1' --remove auto_delete_setting
```
- To add back the settings -
- use the following command -
```cli
> az ml data update --name 'data_import_to_managed_datastore' --version '1' --set auto_delete_setting.condition='created_greater_than' auto_delete_setting.value='30d'
```
- Use the following command to query all the imported data assets that have certain values for condition or value
>```cli
auto_delete_setting.value: az ml data list --name 'data_import_to_managed_datastore' --query "[?auto_delete_setting.value=='30d']"
>az ml data list --name 'data_import_to_managed_datastore' --query "[?auto_delete_setting.condition=='last_accessed_greater_than']"
```
>**NOTE: Ensure you have copied the sample data into your default storage account by running the [copy-data.sh](../../../setup/setup-repo/copy-data.sh) script (`azcopy` is required).**
To learn more about Azure Machine Learning CLI 2.0, [follow this link](https://docs.microsoft.com/azure/machine-learning/how-to-configure-cli).

Просмотреть файл

@ -0,0 +1,18 @@
$schema: http://azureml/sdk-2-0/Schedule.json
name: schedule_data_import_inline
display_name: Schedule data import inline
description: Schedule data import inline
trigger:
type: cron
expression: "15 10 * * 1"
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data:
type: mltable
name: my_snowflake_ds
path: azureml://datastores/workspaceblobstore/paths/snowflake/${{name}}
source:
type: database
query: select * from my_table
connection: azureml:my_snowflake_connection

Просмотреть файл

@ -0,0 +1,11 @@
$schema: http://azureml/sdk-2-0/Schedule.json
name: schedule_data_import
display_name: Schedule data import
description: Schedule data import
trigger:
type: cron
expression: "15 10 * * 1"
time_zone: "Pacific Standard Time" # optional - default will be UTC
import_data: ./snowflake-import.yml

Просмотреть файл

@ -0,0 +1,14 @@
$schema: http://azureml/sdk-2-0/DataImport.json
# Supported connections include:
# Connection: azureml:<workspace_connection_name>
# Supported paths include:
# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}
type: mltable
name: snowflake_sample
source:
type: database
query: select * from my_sample_table
connection: azureml:my_snowflakedb_connection
path: azureml://datastores/workspacemanageddatastore

Просмотреть файл

@ -1,6 +1,7 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -39,6 +40,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -89,6 +91,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -141,6 +144,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -181,6 +185,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -209,6 +214,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -239,6 +245,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -485,6 +492,380 @@
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Importing data on a Schedule. \n",
"You can import data on a schedule created on a recurrence trigger or a cron trigger"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of importing data from Snowflake on Recurrence trigger"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml.constants import TimeZone\n",
"from azure.ai.ml.entities import (\n",
" ImportDataSchedule,\n",
" RecurrenceTrigger,\n",
" RecurrencePattern,\n",
")\n",
"from datetime import datetime\n",
"\n",
"source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n",
"\n",
"path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n",
"\n",
"\n",
"my_data = DataImport(\n",
" type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n",
")\n",
"\n",
"schedule_name = \"my_simple_sdk_create_schedule_recurrence\"\n",
"\n",
"schedule_start_time = datetime.utcnow()\n",
"\n",
"recurrence_trigger = RecurrenceTrigger(\n",
" frequency=\"day\",\n",
" interval=1,\n",
" schedule=RecurrencePattern(hours=1, minutes=[0, 1]),\n",
" start_time=schedule_start_time,\n",
" time_zone=TimeZone.UTC,\n",
")\n",
"\n",
"\n",
"import_schedule = ImportDataSchedule(\n",
" name=schedule_name, trigger=recurrence_trigger, import_data=my_data\n",
")\n",
"\n",
"ml_client.schedules.begin_create_or_update(import_schedule).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a similar example of creating an import data on a schedule - this time it is a Cron Trigger"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import CronTrigger, ImportDataSchedule\n",
"\n",
"source = Database(connection=\"azureml:my_sf_connection\", query=\"select * from my_table\")\n",
"\n",
"path = \"azureml://datastores/workspaceblobstore/paths/snowflake/schedule/${{name}}\"\n",
"\n",
"\n",
"my_data = DataImport(\n",
" type=\"mltable\", source=source, path=path, name=\"my_schedule_sfds_test\"\n",
")\n",
"\n",
"schedule_name = \"my_simple_sdk_create_schedule_cron\"\n",
"\n",
"cron_trigger = CronTrigger(\n",
" expression=\"15 10 * * 1\",\n",
" start_time=datetime.utcnow(),\n",
" end_time=\"2023-12-03T18:40:00\",\n",
")\n",
"import_schedule = ImportDataSchedule(\n",
" name=schedule_name, trigger=cron_trigger, import_data=my_data\n",
")\n",
"ml_client.schedules.begin_create_or_update(import_schedule).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"NOTE: The import schedule is a schedule, so all the other CRUD operations of Schedule are available on this as well."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Disable the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.schedules.begin_disable(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the detail of the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"created_schedule = ml_client.schedules.get(name=schedule_name)\n",
"[created_schedule.name]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"List schedules in a workspace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schedules = ml_client.schedules.list()\n",
"[s.name for s in schedules]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Enable a schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.schedules.begin_enable(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update a schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Update trigger expression\n",
"import_schedule.trigger.expression = \"10 10 * * 1\"\n",
"import_schedule = ml_client.schedules.begin_create_or_update(\n",
" schedule=import_schedule\n",
").result()\n",
"print(import_schedule)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Delete the schedule"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Only disabled schedules can be deleted\n",
"ml_client.schedules.begin_disable(name=schedule_name).result()\n",
"ml_client.schedules.begin_delete(name=schedule_name).result()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Management on Workspace managed datastore:\n",
"\n",
"Data import can be performed to an AzureML managed HOBO storage called \"workspacemanageddatastore\" by specifying \n",
"path: azureml://datastores/workspacemanageddatastore\n",
"The datastore will be automatically back-filled if not present.\n",
"\n",
"When done so, it comes with the added benefit of data life cycle management.\n",
"\n",
"The following shows a simple data import on to the workspacemanageddatastore. Same can be done using the schedules defined above - "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import DataImport\n",
"from azure.ai.ml.data_transfer import Database\n",
"from azure.ai.ml import MLClient\n",
"\n",
"# Supported connections include:\n",
"# Connection: azureml:<workspace_connection_name>\n",
"# Supported paths include:\n",
"# Datastore: azureml://datastores/<data_store_name>/paths/<my_path>/${{name}}\n",
"\n",
"data_import = DataImport(\n",
" name=\"my_sf_managedasset\",\n",
" source=Database(\n",
" connection=\"azureml:my_snowflakedb_connection\",\n",
" query=\"select * from my_sample_table\",\n",
" ),\n",
" path=\"azureml://datastores/workspacemanageddatastore\",\n",
")\n",
"ml_client.data.import_data(data_import=data_import)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Following are the examples of doing data lifecycle management aka altering the auto_delete_settings"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Get imported data asset details:\n",
"```python\n",
"\n",
"\n",
"# Get data asset details\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# update auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = AutoDeleteSetting(\n",
" condition=\"created_greater_than\", value=\"45d\"\n",
")\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Update auto delete setting:\", my_data)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Update auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# update auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = AutoDeleteSetting(\n",
" condition=\"last_accessed_greater_than\", value=\"30d\"\n",
")\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Update auto delete setting:\", my_data)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Delete auto delete settings - "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.entities._assets.auto_delete_setting import AutoDeleteSetting\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# remove auto delete setting\n",
"name = \"my_sf_managedasset\"\n",
"version = \"1\"\n",
"my_data = ml_client.data.get(name=name, version=version)\n",
"my_data.auto_delete_setting = None\n",
"my_data = ml_client.data.create_or_update(my_data)\n",
"print(\"Remove auto delete setting:\", my_data)\n",
"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
@ -528,6 +909,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
@ -557,6 +939,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [