batch-shipyard/docs/70-batch-shipyard-data-move...

15 KiB

Data Movement and Batch Shipyard

The focus of this article is to explain how to move data between on-premises, Azure storage and Batch compute nodes in the context of Batch Shipyard. Please refer to the installation doc for information regarding required software in order to take advantage of the data movement features of Batch Shipyard.

Overview

Most HPC applications and simulations require some form of input data. For example, a simulation may require various input parameters and definitions of physical systems in order to produce meaningful output. Such input data may not be scriptable and could be binary data files. In another scenario, Deep learning frameworks may need millions of images and labels to perform training. Typical on premises cluster systems have a shared file system for which nodes of a cluster have access to for retrieving programs and data.

In the cloud, the data movement problem is a universal issue, especially for many workloads that are amenable to batch-style processing. How do I get my data which exists somewhere, perhaps on premises, into the cloud such that my virtual machines have access to them? This problem is even more acute for services which deliver resources on-demand such as Azure Batch. How do I get my data to nodes which may be ephemeral? And on the flip-side, how do I get my data from nodes after completion of a task?

Before diving into ingress and egress support specifics, let's examine a high level overview of the data movement support provided by Batch Shipyard.

                      (I)           +-------------------------------------------------------------------------+
        +-------------------------> |                                                                         |
        |                           |                      Azure Storage (Blob and File)                      |
        |        +------------------+                                                                         |
        |        |                  +------------+-------------------------------------------+----------------+
        |        |                               |                                           |
        |        | (E)                           | (I)         ^                             | (I)    ^
        |        |                               v             | (E)                         v        | (E)
        |        |                                             |                                      |
        |        v                  +--------------------------+-------------+         +--------------+-------+
        |                           |                                        |         |                      |
    +---+-------------+      (E)    |          Azure Batch Pool "A"          |         | Azure Batch Pool "B" |
    |                 | <-----------+                                        |         |                      |
    |                 |             | +----------------+  +----------------+ |         |  +----------------+  |
    |  Local Machine  |             | |                |  |                | |    (I)  |  |                |  |
    |                 |    (I)      | | Compute Node 0 |  | Compute Node 1 | | <-------+  | Compute Node 0 |  |
    |                 +-----------> | |  [GlusterFS]   |  |  [GlusterFS]   | |         |  |                |  |
    +---+-------------+             | |                |  |                | |         |  +----------------+  |
        |                           | +----------------+  +----------------+ |         |                      |
        |         ^                 |                                        |         |  +----------------+  |
        v         |                 | +----------------+  +----------------+ |  (I)    |  |                |  |
                  |                 | |                |  |                | +-------> |  | Compute Node 1 |  |
+-----------------+-------+         | | Compute Node 2 |  | Compute Node 3 | |         |  |                |  |
|                         |         | |  [GlusterFS]   |  |  [GlusterFS]   | |         |  +----------------+  |
|  Shared File System(s)  |         | |                |  |                | |         |                      |
|                         |         | +----------------+  +----------------+ |         +----------------------+
+-------------------------+         |                                        |
                                    +----------------------------------------+

Arrows marked (I) are ingress actions with respect to the destination, and arrows marked (E) are egress actions with respect to the source.

On the left-hand side of the diagram above containing the Local Machine and Shared File System(s), we will consider this as on premises for the purposes of this document. "On premises" is where you would be invoking shipyard.py actions. The machine that can invoke shipyard.py has access to your local machine file system and can also access local shared file systems.

From on premises, you can ingress your data to both Azure Storage and Azure Batch Pools directly. You can also egress data from Azure Batch Pools and Azure Storage (via blobxfer or AzCopy) back to your local machine.

Within Azure, you can ingress data from Azure Storage to Azure Batch pools, as well as egress data back out to Azure Storage when tasks complete. You may also ingress data from other Azure Batch Tasks which may or may not be in the same pool as input data for subsequent Azure Batch Tasks.

The rest of the document will explain each ingress and egress data movement operation in detail.

Data Ingress

Batch Shipyard provides multiple paths for ingressing data to ultimately be used by your job and tasks. Data ingress can be invoked automatically with certain configuration options or through the command data ingress specified to shipyard.py.

From On Premises

You'll need to decide if all or part of your data is long-lived, i.e., will it be required for many jobs and tasks in the future, or if it is short-lived, i.e., it will only be used for this one job and task. Even if your job is long-lived, is the data small enough such that data ingress is not a large burden for each job and task that references it?

If your answer to these questions is that the data is either short-lived or is small enough such that data ingress is not a large burden, then you may opt for Batch Shipyard's direct to GlusterFS or compute node data ingress capability. With this feature, your files accessible on premises will be directly ingressed to compute node(s), bypassing the hops to and from Azure storage. You can define these files under the global configuration file property global_resources:files. The destination member should include a property named shared_data_volume which references your GlusterFS volume. Alternatively, if your pool only contains one compute node, you should not define a GlusterFS volume and instead only define a relative_destination_path property which will ingress data directly to that path on the compute node. Any files in the source path (matching the optional include and exclude) filters will then be ingressed into the compute nodes using the data ingress command or by specifying transfer_files_on_pool_creation as true in the pool configuration file. There are many configuration options under the data_transfer member which may help optimize for your particular scenario. The following transfer methods from on premises are available:

  • scp: secure copy to a single node in the pool
  • multinode_scp: secure copy to multiple nodes simultaneously in the pool
  • rsync+ssh: rsync over ssh to a single node in the pool
  • multinode_rsync+ssh: rsync over ssh to multiple nodes simultaneously in the pool

In the case where your data is long-lived or is too large to be repeatedly transferred for each job and task that requires it, you may be better off ingressing this data to Azure Storage first. By doing so, you pay for the "long hop" once and can then leverage the potentially lower-latency and higher-bandwidth intra-datacenter transfer for subsequent jobs and tasks that require this data. This is a two-step process with Batch Shipyard. Define the files required under the global_resources:files property where the destination member includes a property named storage_account_settings where the value is a link to a storage account from your credentials config file. Within the data_transfer property, you will need to specify either container or file_share depending upon if you want to send your files to Azure Blob or File Storage, respectively. Please note that there can be significant difference in performance between the two, please visit this page for more information. The second step is outlined in the next section.

Note that files is an array, therefore, Batch Shipyard accepts any number of source/destination pairings and even mixed GlusterFS and Azure Storage ingress objects.

From Azure Storage (Blob and File)

Data from Azure Storage can be ingressed to compute nodes in many different ways with Batch Shipyard. The recommended method when using Batch Shipyard is to take advantage of the input_data property for configuration objects at the pool, job, and task-level.

For pool-level ingress, the input_data property is specified in the pool property under pool_specification. To transfer from Azure Storage, you would specify the azure_storage property under input_data and within azure_storage multiple Azure storage sources can be specified. At the pool-level, container or file_share data is ingressed to all compute nodes to the specified destination location as part of pool creation if the transfer_files_on_pool_creation property is true. Note that the pool must be ready in order for the data ingress command to work. Additionally, although you can combine on premises ingress to Azure Storage and then ingress to compute node, if there is a possiblity of overlap, it is recommended to separate these two processes by ingressing data from files with the data ingress command first. After that action completes, create the pool with transfer_files_on_pool_creation to false (so the data that was ingressed with data ingress is not ingressed again) and specify input_data with the appropriate properties from the data that was just ingressed with the data ingress command.

input_data for each job in job_specifications will ingress data to any compute node running the specified job once and only once. Any input_data defined for the job will be downloaded for this job which can be run on any number of compute nodes depending upon the number of constituent tasks and repeat invocations. However, input_data is only downloaded once per job invocation on a compute node. For example, if job-1:task-1 is run on compute node A and then job-1:task-2 is run on compute node B, then this input_data is ingressed to both compute node A and B. However, if job-1:task-3 is then run on compute node A after job-1:task-1, then the input_data is not transferred again.

input_data for each task in the task array for each job will ingress data to the compute node running the specified task. Note that for task-level input_data, the destination property is optional. If not specified, data will be ingressed to the $AZ_BATCH_TASK_WORKING_DIR by default. For multi-instance tasks, the download only applies to the compute node running the application command. Data is not ingressed with the coordination command which is run on all nodes.

There is one additional method of ingressing data from Azure Storage, which is through Azure Batch resource files. In the jobs config file, you can specify resource files through the resource_files property within each task. However, this can be quite restrictive: (1) HTTP/HTTPS endpoints only, (2) resource files must be manually defined one-at-a-time, (3) limit on the number of resource files that can be defined for each task. It is recommended to use input_data if you have many files that you need to download from a container or file share.

From Azure Batch Tasks

Data from previously run Azure Batch tasks, including those run by Batch Shipyard, can be ingressed into compute nodes with the input_data property for properties at the pool, job, and task-level. Note that the compute node where the task was run must still be active and must not have been removed or deleted.

To transfer from a previous Azure Batch Task that has completed, you would specify the azure_batch property under input_data. As with Azure Storage in the previous section, you can specify multiple properties within the azure_batch array to reference multiple different tasks.

job_id is the job id of the task and task_id is the id of that task for which to ingress files generated by the task. include and exclude filters are optional but can be specified to reduce the scope of the transfer. The destination property is needed if this is not a task-level input_data property. If omitted at the task-level, the default will be $AZ_BATCH_TASK_WORKING_DIR, similar to the default for Azure Storage ingress with input_data above.

The behavior at pool-level, job-level and task-level are nearly identical with Azure Storage ingress with input_data above except that instead of transferring from Azure Storage, data is transferred from another compute node that has run the specified task.

Note that azure_batch and azure_storage may be specified within the same input_data properties if required.

Data Egress

Batch Shipyard provides the ability to egress from compute nodes to various locations.

To On Premises

shipyard.py provides actions to help egress data off a compute node back to on premises. These actions are:

  1. data getfile: get a single file or all files from a compute node using a job id, task id and a file name.
  2. data getfilenode: get a single file or all files from a compute node using a node id and file path.
  3. data stream: stream a text file (decoded as UTF-8) back to the local console or stream a file back to local disk. This is particularly useful for progress monitoring via a file or tailing an output.

To Azure Storage

If you need to egress data from a compute node and persist it to Azure Storage, Batch Shipyard provides the output_data property on tasks of a job in job_specifications. source defines which directory within the task directory to egress data from; if nothing is specified for source then the default is $AZ_BATCH_TASK_DIR (which contains files like stdout.txt and stderr.txt). Note that you can specify a source originating from the shared directory here as well, e.g., $AZ_BATCH_NODE_SHARED_DIR, there is no restriction that limits you to just the task directory. include defines an optional include filter to be applied across the source files. Finally, define the container or file_share property to egress to Azure Blob or File Storage, respectively.

Configuration and Usage Documentation

Please see this page for a full explanation of each data movement configuration option. Please see this page for documentation on data command usage.