# Data Movement and Batch Shipyard The focus of this article is to explain how to move data between on-premises, Azure storage and Batch compute nodes in the context of Batch Shipyard. Please refer to the [installation doc](01-batch-shipyard-installation.md) for information regarding required software in order to take advantage of the data movement features of Batch Shipyard. ## Overview Most HPC applications and simulations require some form of input data. For example, a simulation may require various input parameters and definitions of physical systems in order to produce meaningful output. Such input data may not be scriptable and could be binary data files. In another scenario, Deep learning frameworks may need millions of images and labels to perform training. Typical on premises cluster systems have a shared file system for which nodes of a cluster have access to for retrieving programs and data. In the cloud, the *data movement* problem is a universal issue, especially for many workloads that are amenable to batch-style processing. How do I get my data which exists somewhere, perhaps on premises, into the cloud such that my virtual machines have access to them? This problem is even more acute for services which deliver resources on-demand such as Azure Batch. How do I get my data to nodes which may be ephemeral? And on the flip-side, how do I get my data from nodes after completion of a task? Before diving into ingress and egress support specifics, let's examine a high level overview of the data movement support provided by Batch Shipyard. ``` (I) +-------------------------------------------------------------------------+ +-------------------------> | | | | Azure Storage (Blob and File) | | +------------------+ | | | +------------+-------------------------------------------+----------------+ | | | | | | (E) | (I) ^ | (I) ^ | | v | (E) v | (E) | | | | | v +--------------------------+-------------+ +--------------+-------+ | | | | | +---+-------------+ (E) | Azure Batch Pool "A" | | Azure Batch Pool "B" | | | <-----------+ | | | | | | +----------------+ +----------------+ | | +----------------+ | | Local Machine | | | | | | | (I) | | | | | | (I) | | Compute Node 0 | | Compute Node 1 | | <-------+ | Compute Node 0 | | | +-----------> | | [GlusterFS] | | [GlusterFS] | | | | | | +---+-------------+ | | | | | | | +----------------+ | | | +----------------+ +----------------+ | | | | ^ | | | +----------------+ | v | | +----------------+ +----------------+ | (I) | | | | | | | | | | +-------> | | Compute Node 1 | | +-----------------+-------+ | | Compute Node 2 | | Compute Node 3 | | | | | | | | | | [GlusterFS] | | [GlusterFS] | | | +----------------+ | | Shared File System(s) | | | | | | | | | | | | +----------------+ +----------------+ | +----------------------+ +-------------------------+ | | +----------------------------------------+ ``` Arrows marked `(I)` are ingress actions with respect to the destination, and arrows marked `(E)` are egress actions with respect to the source. On the left-hand side of the diagram above containing the `Local Machine` and `Shared File System(s)`, we will consider this as on premises for the purposes of this document. "On premises" is where you would be invoking `shipyard.py` actions. The machine that can invoke `shipyard.py` has access to your local machine file system and can also access local shared file systems. From on premises, you can ingress your data to both Azure Storage and Azure Batch Pools directly. You can also egress data from Azure Batch Pools and Azure Storage (via [blobxfer](https://github.com/Azure/blobxfer) or [AzCopy](https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/)) back to your local machine. Within Azure, you can ingress data from Azure Storage to Azure Batch pools, as well as egress data back out to Azure Storage when tasks complete. You may also ingress data from other Azure Batch Tasks which may or may not be in the same pool as input data for subsequent Azure Batch Tasks. The rest of the document will explain each ingress and egress data movement operation in detail. ## Data Ingress Batch Shipyard provides multiple paths for ingressing data to ultimately be used by your job and tasks. Data ingress can be invoked automatically with certain configuration options or through the command `data ingress` specified to `shipyard.py`. ### From On Premises You'll need to decide if all or part of your data is long-lived, i.e., will it be required for many jobs and tasks in the future, or if it is short-lived, i.e., it will only be used for this one job and task. Even if your job is long-lived, is the data small enough such that data ingress is not a large burden for each job and task that references it? If your answer to these questions is that the data is either short-lived or is small enough such that data ingress is not a large burden, then you may opt for Batch Shipyard's direct to GlusterFS or compute node data ingress capability. With this feature, your files accessible on premises will be directly ingressed to compute node(s), bypassing the hops to and from Azure storage. You can define these files under the global configuration file property `global_resources`:`files`. The `destination` member should include a property named `shared_data_volume` which references your GlusterFS volume. Alternatively, if your pool only contains one compute node, you should not define a GlusterFS volume and instead only define a `relative_destination_path` property which will ingress data directly to that path on the compute node. Any files in the `source` path (matching the optional `include` and `exclude`) filters will then be ingressed into the compute nodes using the `data ingress` command or by specifying `transfer_files_on_pool_creation` as `true` in the pool configuration file. There are many configuration options under the `data_transfer` member which may help optimize for your particular scenario. The following transfer methods from on premises are available: * `scp`: secure copy to a single node in the pool * `multinode_scp`: secure copy to multiple nodes simultaneously in the pool * `rsync+ssh`: rsync over ssh to a single node in the pool * `multinode_rsync+ssh`: rsync over ssh to multiple nodes simultaneously in the pool In the case where your data is long-lived or is too large to be repeatedly transferred for each job and task that requires it, you may be better off ingressing this data to Azure Storage first. By doing so, you pay for the "long hop" once and can then leverage the potentially lower-latency and higher-bandwidth intra-datacenter transfer for subsequent jobs and tasks that require this data. This is a two-step process with Batch Shipyard. Define the files required under the `global_resources`:`files` property where the `destination` member includes a property named `storage_account_settings` where the value is a link to a storage account from your credentials config file. Within the `data_transfer` property, you will need to specify either `container` or `file_share` depending upon if you want to send your files to Azure Blob or File Storage, respectively. Please note that there can be significant difference in performance between the two, please visit [this page](https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/) for more information. The second step is outlined in the next section. Note that `files` is an array, therefore, Batch Shipyard accepts any number of source/destination pairings and even mixed GlusterFS and Azure Storage ingress objects. ### From Azure Storage (Blob and File) Data from Azure Storage can be ingressed to compute nodes in many different ways with Batch Shipyard. The recommended method when using Batch Shipyard is to take advantage of the `input_data` property for configuration objects at the pool, job, and task-level. For pool-level ingress, the `input_data` property is specified in the pool property under `pool_specification`. To transfer from Azure Storage, you would specify the `azure_storage` property under `input_data` and within `azure_storage` multiple Azure storage sources can be specified. At the pool-level, `container` or `file_share` data is ingressed to all compute nodes to the specified destination location as part of pool creation if the `transfer_files_on_pool_creation` property is `true`. Note that the pool must be ready in order for the `data ingress` command to work. Additionally, although you can combine on premises ingress to Azure Storage and then ingress to compute node, if there is a possiblity of overlap, it is recommended to separate these two processes by ingressing data from `files` with the `data ingress` command first. After that action completes, create the pool with `transfer_files_on_pool_creation` to `false` (so the data that was ingressed with `data ingress` is not ingressed again) and specify `input_data` with the appropriate properties from the data that was just ingressed with the `data ingress` command. `input_data` for each job in `job_specifications` will ingress data to any compute node running the specified job once and only once. Any `input_data` defined for the job will be downloaded for this job which can be run on any number of compute nodes depending upon the number of constituent tasks and repeat invocations. However, `input_data` is only downloaded once per job invocation on a compute node. For example, if `job-1`:`task-1` is run on compute node A and then `job-1`:`task-2` is run on compute node B, then this `input_data` is ingressed to both compute node A and B. However, if `job-1`:`task-3` is then run on compute node A after `job-1`:`task-1`, then the `input_data` is not transferred again. `input_data` for each task in the task array for each job will ingress data to the compute node running the specified task. Note that for task-level `input_data`, the `destination` property is optional. If not specified, data will be ingressed to the `$AZ_BATCH_TASK_WORKING_DIR` by default. For multi-instance tasks, the download only applies to the compute node running the application command. Data is not ingressed with the coordination command which is run on all nodes. There is one additional method of ingressing data from Azure Storage, which is through Azure Batch resource files. In the jobs config file, you can specify resource files through the `resource_files` property within each task. However, this can be quite restrictive: (1) HTTP/HTTPS endpoints only, (2) resource files must be manually defined one-at-a-time, (3) limit on the number of resource files that can be defined for each task. It is recommended to use `input_data` if you have many files that you need to download from a container or file share. ### From Azure Batch Tasks Data from previously run Azure Batch tasks, including those run by Batch Shipyard, can be ingressed into compute nodes with the `input_data` property for properties at the pool, job, and task-level. Note that the compute node where the task was run must still be active and must not have been removed or deleted. To transfer from a previous Azure Batch Task that has completed, you would specify the `azure_batch` property under `input_data`. As with Azure Storage in the previous section, you can specify multiple properties within the `azure_batch` array to reference multiple different tasks. `job_id` is the job id of the task and `task_id` is the id of that task for which to ingress files generated by the task. `include` and `exclude` filters are optional but can be specified to reduce the scope of the transfer. The `destination` property is needed if this is not a task-level `input_data` property. If omitted at the task-level, the default will be `$AZ_BATCH_TASK_WORKING_DIR`, similar to the default for Azure Storage ingress with `input_data` above. The behavior at pool-level, job-level and task-level are nearly identical with Azure Storage ingress with `input_data` above except that instead of transferring from Azure Storage, data is transferred from another compute node that has run the specified task. Note that `azure_batch` and `azure_storage` may be specified within the same `input_data` properties if required. ## Data Egress Batch Shipyard provides the ability to egress from compute nodes to various locations. ### To On Premises `shipyard.py` provides actions to help egress data off a compute node back to on premises. These actions are: 1. `data getfile`: get a single file or all files from a compute node using a job id, task id and a file name. 2. `data getfilenode`: get a single file or all files from a compute node using a node id and file path. 3. `data stream`: stream a text file (decoded as UTF-8) back to the local console or stream a file back to local disk. This is particularly useful for progress monitoring via a file or tailing an output. ### To Azure Storage If you need to egress data from a compute node and persist it to Azure Storage, Batch Shipyard provides the `output_data` property on tasks of a job in `job_specifications`. `source` defines which directory within the task directory to egress data from; if nothing is specified for `source` then the default is `$AZ_BATCH_TASK_DIR` (which contains files like `stdout.txt` and `stderr.txt`). Note that you can specify a source originating from the shared directory here as well, e.g., `$AZ_BATCH_NODE_SHARED_DIR`, there is no restriction that limits you to just the task directory. `include` defines an optional include filter to be applied across the `source` files. Finally, define the `container` or `file_share` property to egress to Azure Blob or File Storage, respectively. ## Configuration and Usage Documentation Please see [this page](10-batch-shipyard-configuration.md) for a full explanation of each data movement configuration option. Please see [this page](20-batch-shipyard-usage.md) for documentation on `data` command usage.