зеркало из https://github.com/microsoft/hi-ml.git
Родитель
190e8d4312
Коммит
f3ea7173d7
|
@ -1,16 +1,18 @@
|
|||
# Datasets
|
||||
|
||||
## Key concepts
|
||||
|
||||
We'll first outline a few concepts that are helpful for understanding datasets.
|
||||
|
||||
### Blob Storage
|
||||
|
||||
Firstly, there is [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction).
|
||||
Each blob storage account has multiple containers - you can think of containers as big disks that store files.
|
||||
The `hi-ml` package assumes that your datasets live in one of those containers, and each top level folder corresponds
|
||||
to one dataset.
|
||||
|
||||
|
||||
### AzureML Data Stores
|
||||
|
||||
Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described
|
||||
[here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data). Data stores provide access to
|
||||
one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around
|
||||
|
@ -24,6 +26,7 @@ One of these data stores is designated as the default data store.
|
|||
### AzureML Datasets
|
||||
|
||||
Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
|
||||
|
||||
* A data store
|
||||
* A set of files accessed through that data store
|
||||
|
||||
|
@ -31,7 +34,9 @@ You can view all datasets in your AzureML workspace by clicking on one of the ic
|
|||
navigation bar of the AzureML studio.
|
||||
|
||||
### Preparing data
|
||||
|
||||
To simplify usage, the `hi-ml` package creates AzureML datasets for you. All you need to do is to
|
||||
|
||||
* Create a blob storage account for your data, and within it, a container for your data.
|
||||
* Create a data store that points to that storage account, and store the credentials for the blob storage account in it
|
||||
|
||||
|
@ -42,14 +47,18 @@ just reference the name of the folder, and the package will create a dataset for
|
|||
|
||||
The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
|
||||
`input_datasets` argument to your call of `submit_to_azure_if_needed` like this:
|
||||
|
||||
```python
|
||||
from health_azure import submit_to_azure_if_needed
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=["my_folder"],
|
||||
default_datastore="my_datastore")
|
||||
default_datastore="my_datastore",
|
||||
)
|
||||
input_folder = run_info.input_datasets[0]
|
||||
```
|
||||
|
||||
What will happen under the hood?
|
||||
|
||||
* The toolbox will check if there is already an AzureML dataset called "my_folder". If so, it will use that. If there
|
||||
is no dataset of that name, it will create one from all the files in blob storage in folder "my_folder". The dataset
|
||||
will be created using the data store provided, "my_datastore".
|
||||
|
@ -66,20 +75,24 @@ dataset (or even more than one).
|
|||
Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.
|
||||
|
||||
You can use that via the `output_datasets` argument:
|
||||
|
||||
```python
|
||||
from health_azure import submit_to_azure_if_needed
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=["my_folder"],
|
||||
output_datasets=["new_dataset"],
|
||||
default_datastore="my_datastore")
|
||||
default_datastore="my_datastore",
|
||||
)
|
||||
input_folder = run_info.input_datasets[0]
|
||||
output_folder = run_info.output_datasets[0]
|
||||
```
|
||||
|
||||
Your script can now read files from `input_folder`, transform them, and write them to `output_folder`. The latter
|
||||
will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
|
||||
will be uploaded to blob storage, and registered as a dataset.
|
||||
|
||||
### Mounting and downloading
|
||||
|
||||
An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted,
|
||||
the files are accessed via the network once needed - this is very helpful for large datasets where downloads would
|
||||
create a long waiting time before the job start.
|
||||
|
@ -98,29 +111,35 @@ input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mo
|
|||
output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=[input_dataset],
|
||||
output_datasets=[output_dataset])
|
||||
output_datasets=[output_dataset],
|
||||
)
|
||||
input_folder = run_info.input_datasets[0]
|
||||
output_folder = run_info.output_datasets[0]
|
||||
```
|
||||
|
||||
### Local execution
|
||||
|
||||
For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML.
|
||||
Clearly, your script needs to be able to access data in those runs too.
|
||||
|
||||
There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
|
||||
`DatasetConfig` objects:
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from health_azure import DatasetConfig, submit_to_azure_if_needed
|
||||
input_dataset = DatasetConfig(name="my_folder",
|
||||
datastore="my_datastore",
|
||||
local_folder=Path("/datasets/my_folder_local"))
|
||||
local_folder=Path("/datasets/my_folder_local"),
|
||||
)
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=[input_dataset])
|
||||
input_datasets=[input_dataset],
|
||||
)
|
||||
input_folder = run_info.input_datasets[0]
|
||||
```
|
||||
|
||||
Secondly, if `local_folder` is not specified, then the dataset will either be downloaded or mounted to a temporary folder locally, depending on the `use_mounting` flag. The path to it will be available in `run_info` as above.
|
||||
|
||||
```python
|
||||
input_folder = run_info.input_datasets[0]
|
||||
```
|
||||
|
@ -132,30 +151,56 @@ Note that mounting the dataset locally is only supported on Linux because it req
|
|||
Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from `/tmp/mnist`.
|
||||
AzureML has the capability to download/mount a dataset to such a fixed location. With the `hi-ml` package, you can
|
||||
trigger that behaviour via an additional option in the `DatasetConfig` objects:
|
||||
|
||||
```python
|
||||
from health_azure import DatasetConfig, submit_to_azure_if_needed
|
||||
input_dataset = DatasetConfig(name="my_folder",
|
||||
datastore="my_datastore",
|
||||
use_mounting=True,
|
||||
target_folder="/tmp/mnist")
|
||||
target_folder="/tmp/mnist",
|
||||
)
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=[input_dataset])
|
||||
input_datasets=[input_dataset],
|
||||
)
|
||||
# Input_folder will now be "/tmp/mnist"
|
||||
input_folder = run_info.input_datasets[0]
|
||||
```
|
||||
|
||||
This is also true when running locally - if `local_folder` is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the `target_folder`.
|
||||
|
||||
### Overwriting existing output datasets
|
||||
|
||||
When creating an output dataset with the same name as an existing dataset, the default behaviour of `hi-ml` is to overwrite the existing datasets. This is as if a run fails during the upload stage, corrupt files may be created. Allowing overwriting means that these corrupt datasets will not cause errors. If you wish to disable this behaviour, it can be controlled using the `overwrite_existing` parameter (only available in sdk v1, hence setting `strictly_aml_v1=True`):
|
||||
|
||||
```python
|
||||
from health_azure import DatasetConfig, submit_to_azure_if_needed
|
||||
output_dataset = DatasetConfig(name="my_folder",
|
||||
datastore="my_datastore",
|
||||
overwrite_existing=False,
|
||||
)
|
||||
|
||||
# fails if output dataset already exists:
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
output_datasets=[output_dataset],
|
||||
strictly_aml_v1=True,
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
### Dataset versions
|
||||
|
||||
AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
|
||||
workspace. In the `hi-ml` toolbox, you would always use the latest version of a dataset unless specified otherwise.
|
||||
If you do need a specific version, use the `version` argument in the `DatasetConfig` objects:
|
||||
|
||||
```python
|
||||
from health_azure import DatasetConfig, submit_to_azure_if_needed
|
||||
input_dataset = DatasetConfig(name="my_folder",
|
||||
datastore="my_datastore",
|
||||
version=7)
|
||||
version=7,
|
||||
)
|
||||
run_info = submit_to_azure_if_needed(...,
|
||||
input_datasets=[input_dataset])
|
||||
input_datasets=[input_dataset],
|
||||
)
|
||||
input_folder = run_info.input_datasets[0]
|
||||
```
|
||||
|
|
|
@ -301,6 +301,7 @@ class DatasetConfig:
|
|||
self,
|
||||
name: str,
|
||||
datastore: str = "",
|
||||
overwrite_existing: bool = True,
|
||||
version: Optional[int] = None,
|
||||
use_mounting: Optional[bool] = None,
|
||||
target_folder: Optional[PathOrString] = None,
|
||||
|
@ -311,6 +312,8 @@ class DatasetConfig:
|
|||
this will be the name given to the newly created dataset.
|
||||
:param datastore: The name of the AzureML datastore that holds the dataset. This can be empty if the AzureML
|
||||
workspace has only a single datastore, or if the default datastore should be used.
|
||||
:param overwrite_existing: Only applies to uploading datasets. If True, the dataset will be overwritten if it
|
||||
already exists. If False, the dataset creation will fail if the dataset already exists.
|
||||
:param version: The version of the dataset that should be used. This is only used for input datasets.
|
||||
If the version is not specified, the latest version will be used.
|
||||
:param use_mounting: If True, the dataset will be "mounted", that is, individual files will be read
|
||||
|
@ -331,6 +334,7 @@ class DatasetConfig:
|
|||
raise ValueError("The name of the dataset must be a non-empty string.")
|
||||
self.name = name
|
||||
self.datastore = datastore
|
||||
self.overwrite_existing = overwrite_existing
|
||||
self.version = version
|
||||
self.use_mounting = use_mounting
|
||||
# If target_folder is "" then convert to None
|
||||
|
@ -447,7 +451,7 @@ class DatasetConfig:
|
|||
|
||||
:param workspace: The AzureML workspace to read from.
|
||||
:param dataset_index: Suffix for using datasets as named inputs, the dataset will be marked OUTPUT_{index}
|
||||
:return:
|
||||
:return: An AzureML OutputFileDatasetConfig object, representing the output dataset.
|
||||
"""
|
||||
status = f"Output dataset {self.name} (index {dataset_index}) will be "
|
||||
datastore = get_datastore(workspace, self.datastore)
|
||||
|
@ -464,7 +468,7 @@ class DatasetConfig:
|
|||
result = dataset.as_mount()
|
||||
else:
|
||||
status += "uploaded when the job completes."
|
||||
result = dataset.as_upload()
|
||||
result = dataset.as_upload(overwrite=self.overwrite_existing)
|
||||
logging.info(status)
|
||||
return result
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче