ENH: Expose v1 dataset overwrite (#853)

Closes #839
This commit is contained in:
Peter Hessey 2023-03-31 10:35:23 +01:00 коммит произвёл GitHub
Родитель 190e8d4312
Коммит f3ea7173d7
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
2 изменённых файлов: 61 добавлений и 12 удалений

Просмотреть файл

@ -1,16 +1,18 @@
# Datasets
## Key concepts
We'll first outline a few concepts that are helpful for understanding datasets.
### Blob Storage
Firstly, there is [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction).
Each blob storage account has multiple containers - you can think of containers as big disks that store files.
The `hi-ml` package assumes that your datasets live in one of those containers, and each top level folder corresponds
to one dataset.
### AzureML Data Stores
Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described
[here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data). Data stores provide access to
one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around
@ -24,6 +26,7 @@ One of these data stores is designated as the default data store.
### AzureML Datasets
Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
* A data store
* A set of files accessed through that data store
@ -31,7 +34,9 @@ You can view all datasets in your AzureML workspace by clicking on one of the ic
navigation bar of the AzureML studio.
### Preparing data
To simplify usage, the `hi-ml` package creates AzureML datasets for you. All you need to do is to
* Create a blob storage account for your data, and within it, a container for your data.
* Create a data store that points to that storage account, and store the credentials for the blob storage account in it
@ -42,14 +47,18 @@ just reference the name of the folder, and the package will create a dataset for
The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
`input_datasets` argument to your call of `submit_to_azure_if_needed` like this:
```python
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
default_datastore="my_datastore")
default_datastore="my_datastore",
)
input_folder = run_info.input_datasets[0]
```
What will happen under the hood?
* The toolbox will check if there is already an AzureML dataset called "my_folder". If so, it will use that. If there
is no dataset of that name, it will create one from all the files in blob storage in folder "my_folder". The dataset
will be created using the data store provided, "my_datastore".
@ -66,20 +75,24 @@ dataset (or even more than one).
Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.
You can use that via the `output_datasets` argument:
```python
from health_azure import submit_to_azure_if_needed
run_info = submit_to_azure_if_needed(...,
input_datasets=["my_folder"],
output_datasets=["new_dataset"],
default_datastore="my_datastore")
default_datastore="my_datastore",
)
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
```
Your script can now read files from `input_folder`, transform them, and write them to `output_folder`. The latter
will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
will be uploaded to blob storage, and registered as a dataset.
### Mounting and downloading
An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted,
the files are accessed via the network once needed - this is very helpful for large datasets where downloads would
create a long waiting time before the job start.
@ -98,29 +111,35 @@ input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mo
output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset],
output_datasets=[output_dataset])
output_datasets=[output_dataset],
)
input_folder = run_info.input_datasets[0]
output_folder = run_info.output_datasets[0]
```
### Local execution
For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML.
Clearly, your script needs to be able to access data in those runs too.
There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
`DatasetConfig` objects:
```python
from pathlib import Path
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
local_folder=Path("/datasets/my_folder_local"))
local_folder=Path("/datasets/my_folder_local"),
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_datasets=[input_dataset],
)
input_folder = run_info.input_datasets[0]
```
Secondly, if `local_folder` is not specified, then the dataset will either be downloaded or mounted to a temporary folder locally, depending on the `use_mounting` flag. The path to it will be available in `run_info` as above.
```python
input_folder = run_info.input_datasets[0]
```
@ -132,30 +151,56 @@ Note that mounting the dataset locally is only supported on Linux because it req
Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from `/tmp/mnist`.
AzureML has the capability to download/mount a dataset to such a fixed location. With the `hi-ml` package, you can
trigger that behaviour via an additional option in the `DatasetConfig` objects:
```python
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
use_mounting=True,
target_folder="/tmp/mnist")
target_folder="/tmp/mnist",
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_datasets=[input_dataset],
)
# Input_folder will now be "/tmp/mnist"
input_folder = run_info.input_datasets[0]
```
This is also true when running locally - if `local_folder` is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the `target_folder`.
### Overwriting existing output datasets
When creating an output dataset with the same name as an existing dataset, the default behaviour of `hi-ml` is to overwrite the existing datasets. This is as if a run fails during the upload stage, corrupt files may be created. Allowing overwriting means that these corrupt datasets will not cause errors. If you wish to disable this behaviour, it can be controlled using the `overwrite_existing` parameter (only available in sdk v1, hence setting `strictly_aml_v1=True`):
```python
from health_azure import DatasetConfig, submit_to_azure_if_needed
output_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
overwrite_existing=False,
)
# fails if output dataset already exists:
run_info = submit_to_azure_if_needed(...,
output_datasets=[output_dataset],
strictly_aml_v1=True,
)
```
### Dataset versions
AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
workspace. In the `hi-ml` toolbox, you would always use the latest version of a dataset unless specified otherwise.
If you do need a specific version, use the `version` argument in the `DatasetConfig` objects:
```python
from health_azure import DatasetConfig, submit_to_azure_if_needed
input_dataset = DatasetConfig(name="my_folder",
datastore="my_datastore",
version=7)
version=7,
)
run_info = submit_to_azure_if_needed(...,
input_datasets=[input_dataset])
input_datasets=[input_dataset],
)
input_folder = run_info.input_datasets[0]
```

Просмотреть файл

@ -301,6 +301,7 @@ class DatasetConfig:
self,
name: str,
datastore: str = "",
overwrite_existing: bool = True,
version: Optional[int] = None,
use_mounting: Optional[bool] = None,
target_folder: Optional[PathOrString] = None,
@ -311,6 +312,8 @@ class DatasetConfig:
this will be the name given to the newly created dataset.
:param datastore: The name of the AzureML datastore that holds the dataset. This can be empty if the AzureML
workspace has only a single datastore, or if the default datastore should be used.
:param overwrite_existing: Only applies to uploading datasets. If True, the dataset will be overwritten if it
already exists. If False, the dataset creation will fail if the dataset already exists.
:param version: The version of the dataset that should be used. This is only used for input datasets.
If the version is not specified, the latest version will be used.
:param use_mounting: If True, the dataset will be "mounted", that is, individual files will be read
@ -331,6 +334,7 @@ class DatasetConfig:
raise ValueError("The name of the dataset must be a non-empty string.")
self.name = name
self.datastore = datastore
self.overwrite_existing = overwrite_existing
self.version = version
self.use_mounting = use_mounting
# If target_folder is "" then convert to None
@ -447,7 +451,7 @@ class DatasetConfig:
:param workspace: The AzureML workspace to read from.
:param dataset_index: Suffix for using datasets as named inputs, the dataset will be marked OUTPUT_{index}
:return:
:return: An AzureML OutputFileDatasetConfig object, representing the output dataset.
"""
status = f"Output dataset {self.name} (index {dataset_index}) will be "
datastore = get_datastore(workspace, self.datastore)
@ -464,7 +468,7 @@ class DatasetConfig:
result = dataset.as_mount()
else:
status += "uploaded when the job completes."
result = dataset.as_upload()
result = dataset.as_upload(overwrite=self.overwrite_existing)
logging.info(status)
return result