azure-batch-cli-extensions/doc/taskFactories.md

16 KiB

Task factories

Task factories provide a way for a job and all its tasks to be created in one command instead of calling azure batch task create for each task.

Note: If the CLI should lose connectivity during the addition of tasks, the operation will not be completed and the job will continue with a partial set of tasks. The remainder of the tasks must be added manually using azure batch task create.

There are currently three kinds of task factories:

  • Task Collection - tasks are explicitly defined as a part of the job
  • Parametric Sweep - a set of tasks are created by substituting a range or sequence of values into a template
  • Per File - a template task is replicated for each available input file

See below for details.

Task collection

This task factory is where each task is individually specified according to the Batch API schema. The task collection task factory most closely mirrors the Batch task creation API.

An example of a task collection task factory:

  "job": {
    "id": "my-ffmpeg-job",
    "constraints": {
      "maxWallClockTime": "PT5H",
      "maxTaskRetryCount": 3
    },
    "poolInfo": {
      "poolId": "my-ffmpeg-pool"
    },
    "jobPreparationTask": {
      "commandLine" : "sudo apt-get install ffmpeg -y",
      "runElevated": true,
      "waitForSuccess": true
    },
    "taskFactory": {
      "type": "taskCollection",
      "tasks": [
        {
          "id" : "mytask1",
          "commandLine": "ffmpeg -i sampleVideo1.mkv -vcodec copy -acodec copy output.mp4 -y",
        },
        {
          "id" : "mytask2",
          "commandLine": "ffmpeg -i sampleVideo2.mkv -vcodec copy -acodec copy output.mp4 -y",
        }
      ]
    }
  }

Samples

The following samples use the task collection task factory:

Parametric sweep

The parametric sweep task factory creates a set of tasks by substituting a range or sequence of values into a template. Substitutions can be made in most attributes of the task, but are most commonly made in the commandLine attribute or resourceFile collection.

Currently the following task attributes are not supported in a parametric sweep task factory:

  • id: The ID of a task will be automatically generated.
  • dependsOn: Dependencies between tasks within a factory, or tasks created by other means are not yet supported.

An example:

  "job": {
    "id": "my-ffmpeg-job",
    "poolInfo": {
      "poolId": "my-ffmpeg-pool"
    },
    "taskFactory": {
      "type": "parametricSweep",
      "parameterSets": [
          {
              "start": 1,
              "end": 500,
              "step": 1
          }
      ],
      "repeatTask": {
          "commandLine": "ffmpeg -i sampleVideo{0}.mkv -vcodec copy -acodec copy output{0}.mp4 -y",
      }
    }
  }

The range of values used to create the tasks are set in parameterSets. The first task to be created is represented by the start field, and the last that that could potentially be created is represented by the end field. Whether this last task is created will depend on the chosen increment size; the vlaue of step. For example, a parameteric sweep with a start of 5, end of 10 and a step of 3 will produce two tasks using the values 5 and 8.

Multiple parameterSets can be defined to produce multi-dimensional parametric sweeps.

The task template into which the parameter or parameters will be substituted is defined in repeatTask. Substitutions are achieved through the use of placeholders. A placeholder for parameter substitutions is represented by {0}. The number 0 here represents the index of the parameter set to be substituted. Where a literal { or } character is required, it can be escaped by duplicating it: {{ or }}. The parameter can also be padded with zeros to a maximum length of 9 characters by using the format {0:4} where the number 0 represents the index of the parameter set and the parameter will be zero-padded to 4 characters, e.g.: 0001.

The above task factory would be expanded into the following tasks:

  "tasks": [
    {
      "id" : "0",
      "commandLine": "ffmpeg -i sampleVideo1.mkv -vcodec copy -acodec copy output1.mp4 -y",
    },
    {
      "id" : "1",
      "commandLine": "ffmpeg -i sampleVideo2.mkv -vcodec copy -acodec copy output2.mp4 -y",
    },
    {
      ...
    },
    {
      "id" : "499",
      "commandLine": "ffmpeg -i sampleVideo500.mkv -vcodec copy -acodec copy output500.mp4 -y",
    }
  ]

An example of a task factory with a two-dimensional sweep with zero-padding:

  "job": {
    "id": "my-ffmpeg-job",
    "poolInfo": {
      "poolId": "my-ffmpeg-pool"
    },
    "taskFactory": {
      "type": "parametricSweep",
      "parameterSets": [
        {
          "start": 1,
          "end": 500,
          "step": 1
        },
        {
          "start": 500,
          "end": 1000,
          "step": 500
        }
      ],
      "repeatTask": {
        "commandLine": "ffmpeg -i sampleVideo_{0:3}.mkv -vcodec copy -acodec copy scale={1}:{1} output_x{1}_{0:3}.mp4 -y",
      }
    }
  }

Where the following tasks would be created:

  "tasks": [
    {
      "id" : "0",
      "commandLine": "ffmpeg -i sampleVideo_001.mkv -vcodec copy -acodec copy scale=500:500 output_x500_001.mp4 -y",
    },
    {
      "id" : "1",
      "commandLine": "ffmpeg -i sampleVideo_001.mkv -vcodec copy -acodec copy scale=1000:1000 output_x1000_001_.mp4 -y",
    },
    {
      "id" : "2",
      "commandLine": "ffmpeg -i sampleVideo_002.mkv -vcodec copy -acodec copy scale=500:500 output_x500_002.mp4 -y",
    },
    {
      "id" : "3",
      "commandLine": "ffmpeg -i sampleVideo_002.mkv -vcodec copy -acodec copy scale=1000:1000 output_x1000_002.mp4 -y",
    },
    {
      ...
    },
    {
      "id" : "998",
      "commandLine": "ffmpeg -i sampleVideo500.mkv -vcodec copy -acodec copy scale=500:500 output_x500_500.mp4 -y",
    },
    {
      "id" : "999",
      "commandLine": "ffmpeg -i sampleVideo500.mkv -vcodec copy -acodec copy scale=1000:1000 output_x1000_500.mp4 -y",
    }
  ]

A mergeTask may also be specified to run after all tasks in the parametric sweep factory. This allows for map-reduce patterns where one task does parsing or combining of output upon completion of all tasks in the task factory. Similar to the repeatTask both the id and dependsOn properties are not allowed, as these are auto-populated with an id of 'merge' and the dependant tasks automatically.

Note: It is not advised to use file groups as inputs to a mergeTask. File groups are expanded at the time of adding a task to the job and as such, will not contain output generated from dependant tasks. If you desire to output data from dependant tasks to the mergeTask please look into using either autoStorageContainerName or containerUrl REST API properties.

A basic example of using mergeTask:

    "job": {
            "type": "Microsoft.Batch/batchAccounts/jobs",
            "apiVersion": "2018-12-01",
            "properties": {
                "id": "mergetask",
                "poolInfo": {
                  "poolId": "my-mergetask-pool"
                },
                "taskFactory": {
                  "type": "parametricSweep",
                  "parameterSets": [
                  {
                    "start": 1,
                    "end": 500,
                    "step": 1
                  }
                ],
                "repeatTask": {
                    "commandLine": "/bin/bash -c 'echo {0}'",
                    "outputFiles": [
                        {
                            "filePattern": "**/stdout.txt",
                            "destination": {
                                "autoStorage": {
                                    "path": "output-{0}",
                                    "fileGroup": "outputData"
                                }
                            },
                            "uploadOptions": {
                                "uploadCondition": "TaskSuccess"
                            }
                        }
                    ]
                },
                "mergeTask" : {
                    "displayName": "myMergeTask",
                    "commandLine": "/bin/bash -c 'ls'",
                    "resourceFiles": [
                        {
                            "autoStorageContainerName": "fgrp-outputData"
                        }
                    ]
                }
            }
        }
    }

Samples

The following samples use the parametric sweep task factory:

Task per file

The task per file task factory generates a task per input file from a file group or Azure storage container. Substitutions can be made in most attributes of the task, but are most commonly made in the commandLine attribute, resourceFile collection or taskOutput attribute.

Currently the following task attributes are not supported in a task per file task factory:

  • id: The ID of a task will be automatically generated.
  • dependsOn: Dependencies between tasks within a factory, or tasks created by other means are not yet supported.

An example:

  "job": {
    "id": "my-ffmpeg-job",
    "poolInfo": {
      "poolId": "my-ffmpeg-pool"
    },
    "taskFactory": {
      "type": "taskPerFile",
      "source": {
        "fileGroup": "raw-images"
      },    
      "repeatTask": {
        "commandLine": "ffmpeg -i {fileName} -vcodec copy -acodec copy {fileNameWithoutExtension}.mp4 -y",
        "resourceFiles": [
          {
            "httpUrl": "{url}",
            "filePath" : "{fileName}" 
          }
        ]
      }
    }
  }

The list of files used to create the tasks are set in source. Similar to new ResourceFiles property, there are two ways to specify the file container in Azure Storage.

  1. Specify the name of a file group to reference data stored in a linked storage account.
  2. Specify the full container URL include the SAS key which has to have List and Read permission. For example:
  "source": {
	  "containerUrl": "https://storage.blob.core.windows.net/container?sv=2015-04-05sig=tAp0r3I3SV5PbjpZ5CIjvuo1jdUs5xW"
  }    

The files can be further filtered by including a prefix. This prefix can be a partial filename, or a subdirectory. If prefix is not specified, all the files in the container will be used for creating task. An example using prefix:

  "source": {
	  "fileGroup": "raw-images",
	  "prefix": "first_pass/img_"
  }    

The task template into which the file URL/name will be substituted is defined in repeatTask. Substitutions are achieved through the use of placeholders. A placeholder for name substitutions is represented by {keyword}. The keyword here represents which part of file URL to be substituted. The supported keyword are:

Keyword Note Example
{url} The full URL of file location http://account.blob.azure.com/container/path/blob.ext?sasToken
{filePath} The file name including the path (virtual directory) path/blob.ext
{fileName} The file name only, without path blob.ext
{fileNameWithoutExtension} The file name without last extension blob

Where a literal { or } character is required, it can be escaped by duplicating it: {{ or }}.

For example, if the files in file group are:

raw-images/first_pass/mov_processing/1.mkv
raw-images/first_pass/mov_processing/2.mkv
raw-images/first_pass/alpha.mkv

The above task factory would be expanded into the following tasks:

  "tasks": [
    {
      "id" : "0",
      "commandLine": "ffmpeg -i 1.mkv -vcodec copy -acodec copy 1.mp4 -y",
      "resourceFiles": [
        {
          "httpUrl": "http://account.blob.azure.com/raw-images/first_pass/mov_processing/1.mkv?sasToken",
          "filePath" : "1.mkv" 
        }
      ]
    },
    {
      "id" : "1",
      "commandLine": "ffmpeg -i 2.mkv -vcodec copy -acodec copy 2.mp4 -y",
      "resourceFiles": [
        {
          "httpUrl": "http://account.blob.azure.com/raw-images/first_pass/mov_processing/2.mkv?sasToken",
          "filePath" : "2.mkv" 
        }
      ]
    },
    {
      "id" : "2",
      "commandLine": "ffmpeg -i alpha.mkv -vcodec copy -acodec copy alpha.mp4 -y",
      "resourceFiles": [
        {
          "httpUrl": "http://account.blob.azure.com/raw-images/first_pass/alpha.mkv?sasToken",
          "filePath" : "alpha.mkv" 
        }
      ]
    }
  ]

A mergeTask may also be specified to run after all tasks in the task per file factory. This allows for map-reduce patterns where one task does parsing or combining of output upon completion of all tasks in the task factory. Similar to the repeatTask both the id and dependsOn properties are not allowed, as these are auto-populated with an id of 'merge' and the dependant tasks automatically.

Note: It is not advised to use file groups as inputs to a mergeTask. File groups are expanded at the time of adding a task to the job and as such, will not contain output generated from dependant tasks. If you desire to output data from dependant tasks to the mergeTask please look into using either autoStorageContainerName or containerUrl REST API properties.

A basic example of using mergeTask:

    "job": {
            "type": "Microsoft.Batch/batchAccounts/jobs",
            "apiVersion": "2018-12-01",
            "properties": {
                "id": "mergetask",
                "poolInfo": {
                  "poolId": "my-mergetask-pool"
                },
                "taskFactory": {
                    "type": "taskPerFile",
                    "source" : {
                        "fileGroup" : "inputData"
                    },
                    "repeatTask": {
                        "commandLine": "/bin/bash -c 'cat {fileName}'",
                        "resourceFiles": [
                            {
                                "httpUrl" : "{url}",
                                "filePath" : "{fileName}"
                            }
                        ],
                        "outputFiles": [
                            {
                                "filePattern": "**/stdout.txt",
                                "destination": {
                                    "autoStorage": {
                                        "path": "output-{fileName}",
                                        "fileGroup": "outputData"
                                    }
                                },
                                "uploadOptions": {
                                    "uploadCondition": "TaskSuccess"
                                }
                            }
                        ]
                    },
                    "mergeTask" : {
                        "displayName": "myMergeTask",
                        "commandLine": "/bin/bash -c 'ls'",
                        "resourceFiles": [
                            {
                                "autoStorageContainerName": "fgrp-outputData"
                            }
                        ]
                    }

Samples

The following samples use the task per file task factory: