ENH: Expose v1 dataset overwrite (#853)

Closes #839
2023-03-31 10:35:23 +01:00 · 2023-03-31 10:35:23 +01:00 · f3ea7173d7
--- a/docs/source/datasets.md
+++ b/docs/source/datasets.md
@ -1,16 +1,18 @@
 # Datasets

 ## Key concepts
+
 We'll first outline a few concepts that are helpful for understanding datasets.

 ### Blob Storage
+
 Firstly, there is [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction).
 Each blob storage account has multiple containers - you can think of containers as big disks that store files.
 The `hi-ml` package assumes that your datasets live in one of those containers, and each top level folder corresponds
 to one dataset.

-
 ### AzureML Data Stores
+
 Secondly, there are data stores. This is a concept coming from Azure Machine Learning, described
 [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data). Data stores provide access to
 one blob storage account. They exist so that the credentials to access blob storage do not have to be passed around
@ -24,6 +26,7 @@ One of these data stores is designated as the default data store.
 ### AzureML Datasets

 Thirdly, there are datasets. Again, this is a concept coming from Azure Machine Learning. A dataset is defined by
+
 * A data store
 * A set of files accessed through that data store

@ -31,7 +34,9 @@ You can view all datasets in your AzureML workspace by clicking on one of the ic
 navigation bar of the AzureML studio.

 ### Preparing data
+
 To simplify usage, the `hi-ml` package creates AzureML datasets for you. All you need to do is to
+
 * Create a blob storage account for your data, and within it, a container for your data.
 * Create a data store that points to that storage account, and store the credentials for the blob storage account in it

@ -42,14 +47,18 @@ just reference the name of the folder, and the package will create a dataset for

 The simplest way of specifying that your script uses a folder of data from blob storage is as follows: Add the
 `input_datasets` argument to your call of `submit_to_azure_if_needed` like this:
+
 ```python
 from health_azure import submit_to_azure_if_needed
 run_info = submit_to_azure_if_needed(...,
                                     input_datasets=["my_folder"],
-                                     default_datastore="my_datastore")
+                                     default_datastore="my_datastore",
+                                    )
 input_folder = run_info.input_datasets[0]
 ```
+
 What will happen under the hood?
+
 * The toolbox will check if there is already an AzureML dataset called "my_folder". If so, it will use that. If there
 is no dataset of that name, it will create one from all the files in blob storage in folder "my_folder". The dataset
 will be created using the data store provided, "my_datastore".
@ -66,20 +75,24 @@ dataset (or even more than one).
 Output datasets are helpful if you would like to run, for example, a script that transforms one dataset into another.

 You can use that via the `output_datasets` argument:
+
 ```python
 from health_azure import submit_to_azure_if_needed
 run_info = submit_to_azure_if_needed(...,
                                     input_datasets=["my_folder"],
                                     output_datasets=["new_dataset"],
-                                     default_datastore="my_datastore")
+                                     default_datastore="my_datastore",
+                                    )
 input_folder = run_info.input_datasets[0]
 output_folder = run_info.output_datasets[0]
 ```
+
 Your script can now read files from `input_folder`, transform them, and write them to `output_folder`. The latter
 will be a folder on the temp file system of the machine. At the end of the script, the contents of that temp folder
 will be uploaded to blob storage, and registered as a dataset.

 ### Mounting and downloading
+
 An input dataset can be downloaded before the start of the actual script run, or it can be mounted. When mounted,
 the files are accessed via the network once needed - this is very helpful for large datasets where downloads would
 create a long waiting time before the job start.
@ -98,29 +111,35 @@ input_dataset = DatasetConfig(name="my_folder", datastore="my_datastore", use_mo
 output_dataset = DatasetConfig(name="new_dataset", datastore="my_datastore", use_mounting=True)
 run_info = submit_to_azure_if_needed(...,
                                     input_datasets=[input_dataset],
-                                     output_datasets=[output_dataset])
+                                     output_datasets=[output_dataset],
+                                    )
 input_folder = run_info.input_datasets[0]
 output_folder = run_info.output_datasets[0]
 ```

 ### Local execution
+
 For debugging, it is essential to have the ability to run a script on a local machine, outside of AzureML.
 Clearly, your script needs to be able to access data in those runs too.

 There are two ways of achieving that: Firstly, you can specify an equivalent local folder in the
 `DatasetConfig` objects:
+
 ```python
 from pathlib import Path
 from health_azure import DatasetConfig, submit_to_azure_if_needed
 input_dataset = DatasetConfig(name="my_folder",
                              datastore="my_datastore",
-                              local_folder=Path("/datasets/my_folder_local"))
+                              local_folder=Path("/datasets/my_folder_local"),
+                             )
 run_info = submit_to_azure_if_needed(...,
-                                     input_datasets=[input_dataset])
+                                     input_datasets=[input_dataset],
+                                    )
 input_folder = run_info.input_datasets[0]
 ```

 Secondly, if `local_folder` is not specified, then the dataset will either be downloaded or mounted to a temporary folder locally, depending on the `use_mounting` flag. The path to it will be available in `run_info` as above.
+
 ```python
 input_folder = run_info.input_datasets[0]
 ```
@ -132,30 +151,56 @@ Note that mounting the dataset locally is only supported on Linux because it req
 Occasionally, scripts expect the input dataset at a fixed location, for example, data is always read from `/tmp/mnist`.
 AzureML has the capability to download/mount a dataset to such a fixed location. With the `hi-ml` package, you can
 trigger that behaviour via an additional option in the `DatasetConfig` objects:
+
 ```python
 from health_azure import DatasetConfig, submit_to_azure_if_needed
 input_dataset = DatasetConfig(name="my_folder",
                              datastore="my_datastore",
                              use_mounting=True,
-                              target_folder="/tmp/mnist")
+                              target_folder="/tmp/mnist",
+                             )
 run_info = submit_to_azure_if_needed(...,
-                                     input_datasets=[input_dataset])
+                                     input_datasets=[input_dataset],
+                                    )
 # Input_folder will now be "/tmp/mnist"
 input_folder = run_info.input_datasets[0]
 ```

 This is also true when running locally - if `local_folder` is not specified and an AzureML workspace can be found, then the dataset will be downloaded or mounted to the `target_folder`.

+### Overwriting existing output datasets
+
+When creating an output dataset with the same name as an existing dataset, the default behaviour of `hi-ml` is to overwrite the existing datasets. This is as if a run fails during the upload stage, corrupt files may be created. Allowing overwriting means that these corrupt datasets will not cause errors. If you wish to disable this behaviour, it can be controlled using the `overwrite_existing` parameter (only available in sdk v1, hence setting `strictly_aml_v1=True`):
+
+```python
+from health_azure import DatasetConfig, submit_to_azure_if_needed
+output_dataset = DatasetConfig(name="my_folder",
+                               datastore="my_datastore",
+                               overwrite_existing=False,
+                              )
+
+# fails if output dataset already exists:
+run_info = submit_to_azure_if_needed(...,
+                                     output_datasets=[output_dataset],
+                                     strictly_aml_v1=True,
+                                    )
+
+```
+
 ### Dataset versions
+
 AzureML datasets can have versions, starting at 1. You can view the different versions of a dataset in the AzureML
 workspace. In the `hi-ml` toolbox, you would always use the latest version of a dataset unless specified otherwise.
 If you do need a specific version, use the `version` argument in the `DatasetConfig` objects:
+
 ```python
 from health_azure import DatasetConfig, submit_to_azure_if_needed
 input_dataset = DatasetConfig(name="my_folder",
                              datastore="my_datastore",
-                              version=7)
+                              version=7,
+                             )
 run_info = submit_to_azure_if_needed(...,
-                                     input_datasets=[input_dataset])
+                                     input_datasets=[input_dataset],
+                                    )
 input_folder = run_info.input_datasets[0]
 ```
--- a/hi-ml-azure/src/health_azure/datasets.py
+++ b/hi-ml-azure/src/health_azure/datasets.py
@ -301,6 +301,7 @@ class DatasetConfig:
        self,
        name: str,
        datastore: str = "",
+        overwrite_existing: bool = True,
        version: Optional[int] = None,
        use_mounting: Optional[bool] = None,
        target_folder: Optional[PathOrString] = None,
@ -311,6 +312,8 @@ class DatasetConfig:
            this will be the name given to the newly created dataset.
        :param datastore: The name of the AzureML datastore that holds the dataset. This can be empty if the AzureML
            workspace has only a single datastore, or if the default datastore should be used.
+        :param overwrite_existing: Only applies to uploading datasets. If True, the dataset will be overwritten if it
+            already exists. If False, the dataset creation will fail if the dataset already exists.
        :param version: The version of the dataset that should be used. This is only used for input datasets.
            If the version is not specified, the latest version will be used.
        :param use_mounting: If True, the dataset will be "mounted", that is, individual files will be read
@ -331,6 +334,7 @@ class DatasetConfig:
            raise ValueError("The name of the dataset must be a non-empty string.")
        self.name = name
        self.datastore = datastore
+        self.overwrite_existing = overwrite_existing
        self.version = version
        self.use_mounting = use_mounting
        # If target_folder is "" then convert to None
@ -447,7 +451,7 @@ class DatasetConfig:

        :param workspace: The AzureML workspace to read from.
        :param dataset_index: Suffix for using datasets as named inputs, the dataset will be marked OUTPUT_{index}
-        :return:
+        :return: An AzureML OutputFileDatasetConfig object, representing the output dataset.
        """
        status = f"Output dataset {self.name} (index {dataset_index}) will be "
        datastore = get_datastore(workspace, self.datastore)
@ -464,7 +468,7 @@ class DatasetConfig:
            result = dataset.as_mount()
        else:
            status += "uploaded when the job completes."
-            result = dataset.as_upload()
+            result = dataset.as_upload(overwrite=self.overwrite_existing)
        logging.info(status)
        return result