30 KiB
Batch Shipyard Usage
This page contains in-depth details on how to use the Batch Shipyard tool.
Please see the Batch Shipyard Docker Image CLI section for
information regarding how to use the alfpark/batch-shipyard:cli-latest
Docker image if not invoking the Python script directly.
Batch Shipyard Invocation
If you installed Batch Shipyard using the install.sh
script, then
you can invoke as:
# Change directory to batch-shipyard installed directory
./shipyard
You can also invoke shipyard
from any directory if given the full path
to the script.
If you are on Windows and installed using the install.cmd
script, then
you can invoke as:
shipyard.cmd
If you installed manually (i.e., did not use the installer scripts), then you will need to invoke the Python interpreter and pass the script as an argument. For example:
python3 shipyard.py
The -h
or --help
option will list the available options, which are
explained below.
Note about interoperability with Azure Tooling and Azure Batch APIs
Nearly all REST calls or commands that are issued against the normal Azure Batch APIs and tooling such as the Azure Portal or Azure CLI will work fine against Azure Batch Shipyard created resources. However, there are some notable exceptions:
- All pools must be created with Batch Shipyard if you intend to use any Batch Shipyard functionality.
- Please note all of the current limitations for other actions.
- Batch Shipyard pools that are deleted outside of Batch Shipyard will not
have their associated metadata (in Azure Storage) cleaned up. Please use
the
pool del
command instead. You can use thestorage
command to clean up orphaned data if you accidentially deleted Batch Shipyard pools outside of Batch Shipyard.
Commands and Sub-commands
shipyard
(and shipyard.py
) is invoked with a commands and sub-commands as
positional arguments, i.e.:
shipyard <command> <subcommand> <options>
For instance:
shipyard pool add --configdir config
Would create a pool on the Batch account as specified in the config files
found in the config
directory. Please note that <options>
must be
specified after the command and subcommand.
You can issue the -h
or --help
option at every level to view all
available options for that level and additional help text. For example:
shipyard -h
shipyard pool -h
shipyard pool add -h
Shared Options
There are a set of shared options which are used between most sub-commands. These options must be specified after the command and sub-command. These are:
-y, --yes Assume yes for all confirmation prompts
--show-config Show configuration
-v, --verbose Verbose output
--configdir TEXT Configuration directory where all
configuration files can be found. Each json
config file must be named exactly the same
as the regular switch option, e.g.,
pool.json for --pool. Individually specified
config options take precedence over this
option.
--credentials TEXT Credentials json config file
--config TEXT Global json config file
--fs TEXT Filesystem json config file
--pool TEXT Pool json config file
--jobs TEXT Jobs json config file
--subscription-id TEXT Azure Subscription ID
--keyvault-uri TEXT Azure KeyVault URI
--keyvault-credentials-secret-id TEXT
Azure KeyVault credentials secret id
--aad-endpoint TEXT Azure Active Directory endpoint
--aad-directory-id TEXT Azure Active Directory directory (tenant) id
--aad-application-id TEXT Azure Active Directory application (client)
id
--aad-auth-key TEXT Azure Active Directory authentication key
--aad-user TEXT Azure Active Directory user
--aad-password TEXT Azure Active Directory password
--aad-cert-private-key TEXT Azure Active Directory private key for X.509
certificate
--aad-cert-thumbprint TEXT Azure Active Directory certificate SHA1
thumbprint
-y
or--yes
is to assume yes for all confirmation prompts--show-config
will output the merged configuration prior to execution-v
or--verbose
is for verbose output--configdir path
can be used instead of the individual config switches below if all configuration json files are in one directory and named after their switch. For example, if you have a directory namedconfig
and under that directory you have the filescredentials.json
,config.json
,pool.json
andjobs.json
, then you can use this argument instead of the following:--credentials path/to/credentials.json
is required for all actions except for a select fewkeyvault
commands.--config path/to/config.json
is required for all actions.--pool path/to/pool.json
is required for most actions.--jobs path/to/jobs.json
is required for job-related actions.--fs path/to/fs.json
is required for fs-related actions and some pool actions.
--subscription-id
is the Azure Subscription Id associated with the Batch account or Remote file system resources. This is only required for creating pools with a virtual network specification or withfs
commands.--keyvault-uri
is required for allkeyvault
commands.--keyvault-credentials-secret-id
is required if utilizing a credentials json stored in Azure KeyVault--aad-endpoint
is the Active Directory endpoint for the resource. Note that this can cause conflicts for actions that require multiple endpoints for different resources. It is better to specify endpoints explicitly in the credential file.--aad-directory-id
is the Active Directory Directory Id (or Tenant Id)--aad-application-id
is the Active Directory Application Id (or Client Id)--aad-auth-key
is the authentication key for the application (or client)--aad-user
is the Azure Active Directory user--aad-password
is the Azure Active Directory password for the user--aad-cert-private-key
is the Azure Active Directory Service Principal RSA private key corresponding to the X.509 certificate for certificate-based auth--aad-cert-thumbprint
is the X.509 certificate thumbprint for Azure Active Directory certificate-based auth
Note that only one of Active Directory Service Principal or User/Password can
be specified at once, i.e., --aad-auth-key
, --aad-password
, and
--aad-cert-private-key
are mutually exclusive.
Note that the following options can be specified as environment variables instead:
SHIPYARD_CONFIGDIR
in lieu of--configdir
SHIPYARD_CREDENTIALS_JSON
in lieu of--credentials
SHIPYARD_CONFIG_JSON
in lieu of--config
SHIPYARD_POOL_JSON
in lieu of--pool
SHIPYARD_JOBS_JSON
in lieu of--jobs
SHIPYARD_FS_JSON
in lieu of--fs
SHIPYARD_SUBSCRIPTION_ID
in lieu of--subscription-id
SHIPYARD_KEYVAULT_URI
in lieu of--keyvault-uri
SHIPYARD_KEYVAULT_CREDENTIALS_SECRET_ID
in lieu of--keyvault-credentials-secret-id
SHIPYARD_AAD_ENDPOINT
in lieu of--aad-endpoint
SHIPYARD_AAD_DIRECTORY_ID
in lieu of--aad-directory-id
SHIPYARD_AAD_APPLICATION_ID
in lieu of--aad-application-id
SHIPYARD_AAD_AUTH_KEY
in lieu of--aad-auth-key
SHIPYARD_AAD_USER
in lieu of--aad-user
SHIPYARD_AAD_PASSWORD
in lieu of--aad-password
SHIPYARD_AAD_CERT_PRIVATE_KEY
in lieu of--aad-cert-private-key
SHIPYARD_AAD_CERT_THUMBPRINT
in lieu of--aad-cert-thumbprint
Commands
shipyard
(and shipyard.py
) script contains the following top-level
commands:
cert Certificate actions
data Data actions
fs Filesystem in Azure actions
jobs Jobs actions
keyvault KeyVault actions
misc Miscellaneous actions
pool Pool actions
storage Storage actions
cert
commands deal with certificates to be used with Azure Batchdata
commands deal with data ingress and egress from Azurefs
commands deal with Batch Shipyard provisioned remote filesystems in Azurejobs
commands deal with Azure Batch jobs and taskskeyvault
commands deal with Azure KeyVault secrets for use with Batch Shipyardmisc
commands are miscellaneous commands that don't fall into other categoriespool
commands deal with Azure Batch poolsstorage
commands deal with Batch Shipyard metadata on Azure Storage
cert
Command
The cert
command has the following sub-commands:
add Add a certificate to a Batch account
create Create a certificate to use with a Batch...
del Deletes a certificate from the Batch account
list List all certificates in a Batch account
add
will add a certificate to the Batch accountcreate
will create a certificate locally for use with the Batch account. You must edit yourconfig.json
to incorporate the generated certificate and then invoked thecert add
command. Please see the credential encryption guide for more information.del
will delete a certificate from the Batch accountlist
will list certificates in the Batch account
data
Command
The data
command has the following sub-commands:
getfile Retrieve file(s) from a job/task
getfilenode Retrieve file(s) from a compute node
ingress Ingress data into Azure
listfiles List files for tasks in jobs
stream Stream a text file to the local console
getfile
will retrieve a file with job, task, filename semantics--all --filespec <jobid>,<taskid>,<include pattern>
can be given to download all files for the job and task with an optional include pattern--filespec <jobid>,<taskid>,<filename>
can be given to download one specific file from the job and task. If<taskid>
is set to@FIRSTRUNNING
, then the first running task within the job of<jobid>
will be used to locate the<filename>
.
getfilenode
will retrieve a file with node id and filename semantics--all --filespec <nodeid>,<include pattern>
can be given to download all files from the compute node with the optional include pattern--filespec <nodeid>,<filename>
can be given to download one specific file from compute node
ingress
will ingress data as specified in configuration files--to-fs <STORAGE_CLUSTER_ID>
transfers data as specified in configuration files to the specified remote file system storage cluster instead of Azure Storage
listfiles
will list files for all tasks in jobs--jobid
force scope to just this job id--taskid
force scope to just this task id
stream
will stream a file as text (UTF-8 decoded) to the local console or binary if streamed to disk--disk
will write the streamed data as binary to disk instead of output to local console--filespec <jobid>,<taskid>,<filename>
can be given to stream a specific file. If<taskid>
is set to@FIRSTRUNNING
, then the first running task within the job of<jobid>
will be used to locate the<filename>
.
fs
Command
The fs
command has the following sub-commands which work on two different
parts of a remote filesystem:
cluster Filesystem storage cluster in Azure actions
disks Managed disk actions
fs cluster
Command
fs cluster
command has the following sub-commands:
add Create a filesystem storage cluster in Azure
del Delete a filesystem storage cluster in Azure
expand Expand a filesystem storage cluster in Azure
resize Resize a filesystem storage cluster in Azure.
ssh Interactively login via SSH to a filesystem...
start Starts a previously suspended filesystem...
status Query status of a filesystem storage cluster...
suspend Suspend a filesystem storage cluster in Azure
As the fs.json
configuration file can contain multiple storage cluster
definitions, all fs cluster
commands require the argument
STORAGE_CLUSTER_ID
after any option below is specified targeting the
storage cluster to perform actions against.
add
will create a remote fs cluster as defined in the fs config filedel
will delete a remote fs cluster as defined in the fs config file--delete-resource-group
will delete the entire resource group that contains the server. Please take care when using this option as any resource in the resoure group is deleted which may be other resources that are not Batch Shipyard related.--delete-data-disks
will delete attached data disks--delete-virtual-network
will delete the virtual network and all of its subnets--generate-from-prefix
will attempt to generate all resource names using conventions used. This is helpful when there was an issue with cluster deletion and the original virtual machine(s) resources can no longer by enumerated. Note that OS disks and data disks cannot be deleted with this option. Please usefs disks del
to delete disks that may have been used in the storage cluster.--no-wait
does not wait for deletion completion. It is not recommended to use this parameter.
expand
expands the number of disks used by the underlying filesystems on the file server.--no-rebalance
rebalances the data and metadata among the disks for better data spread and performance after the disk is added to the array.
resize
resizes the storage cluster with additional virtual machines as specified in the configuration. This is an experimental feature.ssh
will interactively log into a virtual machine in the storage cluster. If neither--cardinal
or--hostname
are specified,--cardinal 0
is assumed.COMMAND
is an optional argument to specify the command to run. If your command has switches, prefaceCOMMAND
with double dash as per POSIX convention, e.g.,fs cluster ssh mycluster -- df -h
.--cardinal
is the zero-based cardinal number of the virtual machine in the storage cluster to connect to.--hostname
is the hostname of the virtual machine in the storage cluster to connect to--tty
allocates a pseudo-terminal
start
will start a previously suspended storage cluster--no-wait
does not wait for the restart to complete. It is not recommended to use this parameter.
status
displays the status of the storage cluster--detail
reports in-depth details about each virtual machine in the storage cluster--hosts
will output the public IP to hosts mapping for mounting aglusterfs
based remote filesystem locally.glusterfs
must be allowed in the network security rules for this to work properly.
suspend
suspends a storage cluster--no-wait
does not wait for the suspension to complete. It is not recommended to use this parameter.
fs disks
Command
fs disks
command has the following sub-commands:
add Create managed disks in Azure
del Delete managed disks in Azure
list List managed disks in resource group
add
creates managed disks as specified in the fs config filedel
deletes managed disks as specified in the fs config file--all
deletes all managed disks found in a specified resource group--name
deletes a specific named disk in a resource group--no-wait
does not wait for disk deletion to complete. It is not recommended to use this parameter.--resource-group
deletes one or more managed disks in this resource group
list
lists managed disks found in a resource group--resource-group
lists disks in this resource group only--restrict-scope
lists disks only if found in the fs config file
jobs
Command
The jobs
command has the following sub-commands:
add Add jobs
cmi Cleanup multi-instance jobs
del Delete jobs and job schedules
deltasks Delete specified tasks in jobs
disable Disable jobs and job schedules
enable Enable jobs and job schedules
list List jobs
listtasks List tasks within jobs
migrate Migrate jobs or job schedules to another pool
stats Get statistics about jobs
term Terminate jobs and job schedules
termtasks Terminate specified tasks in jobs
add
will add all jobs and tasks defined in the jobs configuration file to the Batch pool--recreate
will recreate any completed jobs with the same id--tail
will tail the specified file of the last job and task added with this command invocation
cmi
will cleanup any stale multi-instance tasks and jobs. Note that this sub-command is typically not required ifmulti_instance_auto_complete
is set totrue
in the job specification for the job.--delete
will delete any stale cleanup jobs
del
will delete jobs and job scheudles specified in the jobs configuration file. If an autopool is specified for all jobs and a jobid option is not specified, the storage associated with the autopool will be cleaned up.--all-jobs
will delete all jobs found in the Batch account--all-jobschedules
will delete all job schedules found in the Batch account--jobid
force deletion scope to just this job id--jobscheduleid
force deletion scope to just this job schedule id--termtasks
will manually terminate tasks prior to deletion. Termination of running tasks requires a valid SSH user.--wait
will wait for deletion to complete
deltasks
will delete tasks within jobs specified in the jobs configuration file. Active or running tasks will be terminated first.--jobid
force deletion scope to just this job id--taskid
force deletion scope to just this task id--wait
will wait for deletion to complete
disable
will disable jobs or job schedules--jobid
force disable scope to just this job id--jobscheduleid
force disable scope to just this job schedule id--requeue
requeue running tasks--terminate
terminate running tasks--wait
wait for running tasks to complete
enable
will enable jobs or job schedules--jobid
force enable scope to just this job id--jobscheduleid
force enable scope to just this job schedule id
list
will list all jobs in the Batch accountlisttasks
will list tasks from jobs specified in the jobs configuration file--all
list all tasks in all jobs in the account--jobid
force scope to just this job id--poll-until-tasks-complete
will poll until all tasks have completed
migrate
will migrate jobs or job schedules to another pool--jobid
force migration scope to just this job id--jobscheduleid
force migration scope to just this job schedule id--poolid
force migration to this specified pool id--requeue
requeue running tasks--terminate
terminate running tasks--wait
wait for running tasks to complete
stats
will generate a statistics summary of a job or jobs--jobid
will query the specified job instead of all jobs
term
will terminate jobs and job schedules found in the jobs configuration file. If an autopool is specified for all jobs and a jobid option is not specified, the storage associated with the autopool will be cleaned up.--all-jobs
will terminate all jobs found in the Batch account--all-jobschedules
will terminate all job schedules found in the Batch account--jobid
force termination scope to just this job id--jobscheduleid
force termination scope to just this job schedule id--termtasks
will manually terminate tasks prior to termination. Termination of running tasks requires a valid SSH user.--wait
will wait for termination to complete
termtasks
will terminate tasks within jobs specified in the jobs configuration file. Termination of running tasks requires a valid SSH user.--force
force send docker kill signal regardless of task state--jobid
force termination scope to just this job id--taskid
force termination scope to just this task id--wait
will wait for termination to complete
keyvault
Command
The keyvault
command has the following sub-commands:
add Add a credentials json as a secret to Azure...
del Delete a secret from Azure KeyVault
list List secret ids and metadata in an Azure...
The following subcommands require --keyvault-*
and --aad-*
options in
order to work. Alternatively, you can specify these in the credentials.json
file, but these options are mutually exclusive of other properties.
Please refer to the
Azure KeyVault and Batch Shipyard guide
for more information.
add
will add the specified credentials json as a secret to an Azure KeyVault. A valid credentials json must be specified as an option.NAME
argument is required which is the name of the secret associated with the credentials json to store in the KeyVault
del
will delete a secret from the Azure KeyVaultNAME
argument is required which is the name of the secret to delete from the KeyVault
list
will list all secret ids and metadata in an Azure KeyVault
misc
Command
The misc
command has the following sub-commands:
tensorboard Create a tunnel to a Tensorboard instance for...
tensorboard
will create a tunnel to the compute node that is running or has run the specified task--jobid
specifies the job id to use. If this is not specified, the first and only jobspec is used from jobs.json.--taskid
specifies the task id to use. If this is not specified, the last run or running task for the job is used.--logdir
specifies the TensorFlow logs directory generated by summary operations--image
specifies an alternate TensorFlow image to use for Tensorboard. Thetensorboard.py
file must be in the expected location in the Docker image as stock TensorFlow images. If not specified, Batch Shipyard will attempt to find a suitable TensorFlow image from Docker images in the global resource list or will acquire one on demand for this command.
pool
Command
The pool
command has the following sub-commands:
add Add a pool to the Batch account
asu Add an SSH user to all nodes in pool
autoscale Pool autoscale actions
del Delete a pool from the Batch account
delnode Delete a node from a pool
dsu Delete an SSH user from all nodes in pool
grls Get remote login settings for all nodes in...
list List all pools in the Batch account
listimages List Docker images in a pool
listnodes List nodes in pool
listskus List available VM configurations available to...
rebootnode Reboot a node or nodes in a pool
resize Resize a pool
ssh Interactively login via SSH to a node in a...
stats Get statistics about a pool
udi Update Docker images in a pool
add
will add the pool defined in the pool configuration file to the Batch accountasu
will add the SSH user defined in the pool configuration file to all nodes in the specified poolautoscale
will invoke the autoscale subcommand. The autoscale subcommand has 4 subcommands:disable
will disable autoscale on the poolenable
will enable autoscale on the poolevaluate
will evaluate the autoscale formula in the pool configuration filelastexec
will query the last execution information for autoscale
del
will delete the pool defined in the pool configuration file from the Batch account along with associated metadata in Azure Storage used by Batch Shipyard. It is recommended to use this command instead of deleting a pool directly from the Azure Portal, Batch Labs, or other tools as this action can conveniently remove all associated Batch Shipyard metadata on Azure Storage.--poolid
will delete the specified pool instead of the pool from the pool configuration file--wait
will wait for deletion to complete
delnode
will delete the specified node from the pool--all-start-task-failed
will delete all nodes in the start task failed state--all-unusable
will delete all nodes in the unusable state--nodeid
is the node id to delete
dsu
will delete the SSH user defined in the pool configuration file from all nodes in the specified poolgrls
will retrieve all of the remote login settings for every node in the specified poollist
will list all pools in the Batch accountlistimages
will query the nodes in the pool for Docker images. Common and mismatched images will be listed. Requires a provisioned SSH user and private key.listnodes
will list all nodes in the specified poolrebootnode
will reboot a specified node in the pool--all-start-task-failed
will reboot all nodes in the start task failed state--nodeid
is the node id to reboot
resize
will resize the pool to thevm_count
specified in the pool configuration file--wait
will wait for resize to complete
ssh
will interactively log into a compute node via SSH. If neither--cardinal
or--nodeid
are specified,--cardinal 0
is assumed.COMMAND
is an optional argument to specify the command to run. If your command has switches, prefaceCOMMAND
with double dash as per POSIX convention, e.g.,pool ssh -- sudo docker ps -a
.--cardinal
is the zero-based cardinal number of the compute node in the pool to connect to as listed bygrls
--nodeid
is the node id to connect to in the pool--tty
allocates a pseudo-terminal
stats
will generate a statistics summary of the pool--poolid
will query the specified pool instead of the pool from the pool configuration file
udi
will update Docker images on all compute nodes of the pool. This command requires a valid SSH user.--image
will restrict the update to just the image or image:tag--digest
will restrict the update to just the image or image:tag and a specific digest
storage
Command
The storage
command has the following sub-commands:
clear Clear Azure Storage containers used by Batch...
del Delete Azure Storage containers used by Batch...
clear
will clear the Azure Storage containers used by Batch Shipyard for metadata purposes--poolid
will target a specific pool id rather than from configuration
del
will delete the Azure Storage containers used by Batch Shipyard for metadata purposes--clear-tables
will clear tables instead of deleting them--poolid
will target a specific pool id
Example Invocations
shipyard pool add --credentials credentials.json --config config.json --pool pool.json
# ... or if all config files are in the current working directory named as above ...
shipyard pool add --configdir .
# ... or use environment variables instead
SHIPYARD_CONFIGDIR=. shipyard pool add
The above invocation will add the pool specified to the Batch account. Notice that the options and shared options are given after the command and sub-command and not before.
shipyard jobs add --configdir .
# ... or use environment variables instead
SHIPYARD_CONFIGDIR=. shipyard jobs add
The above invocation will add the jobs specified in the jobs.json file to the designated pool.
shipyard data stream --configdir . --filespec job1,task-00000,stdout.txt
# ... or use environment variables instead
SHIPYARD_CONFIGDIR=. shipyard data stream --filespec job1,task-00000,stdout.txt
The above invocation will stream the stdout.txt file from the job job1
and
task task1
from a live compute node. Because all portions of the
--filespec
option are specified, the tool will not prompt for any input.
Batch Shipyard Docker Image CLI Invocation
If using the alfpark/batch-shipyard:cli-latest Docker image, then you would invoke the tool as:
docker run --rm -it alfpark/batch-shipyard:cli-latest <command> <subcommand> <options...>
where <command> <subcommand>
is the command and subcommand as described
above and <options...>
are any additional options to pass to the
<subcommand>
.
Invariably, you will need to pass config files to the tool which reside
on the host and not in the container by default. Please use the -v
volume
mount option with docker run
to mount host directories inside the container.
For example, if your Batch Shipyard configs are stored in the host path
/home/user/batch-shipyard-configs
you could modify the docker run command
as:
docker run --rm -it -v /home/user/batch-shipyard-configs:/configs -e SHIPYARD_CONFIGDIR=/configs alfpark/batch-shipyard:cli-latest <command> <subcommand> <options...>
Notice that we specified a Docker environment variable via
-e SHIPYARD_CONFIGDIR
to match the container path of the volume mount.
Additionally, if you wish to ingress data from locally accessible file systems using Batch Shipyard, then you will need to map additional volume mounts as appropriate from the host to the container.
Batch Shipyard may generate files with some actions, such as adding a SSH
user or creating a pool with an SSH user. In this case, you will need
to create a volume mount with the -v
option and also ensure that the
pool specification ssh
object has a generated_file_export_path
property
set to the volume mount path. This will ensure that generated files will be
written to the host and persisted after the docker container exits. Otherwise,
the generated files will only reside within the docker container and
will not be available for use on the host (e.g., SSH into compute node with
generated RSA private key or use the generated SSH docker tunnel script).
Remote Filesystem Support
For more information regarding remote filesystems and Batch Shipyard, please see this page.
Data Movement
For more information regarding data movement with respect to Batch Shipyard, please see this page.
Multi-Instance Tasks
For more information regarding Multi-Instance Tasks and/or MPI jobs using Batch Shipyard, please see this page.
Current Limitations
Please see this page for current limitations.
Explore Recipes and Samples
Visit the recipes directory for different sample Docker workloads using Azure Batch and Batch Shipyard.
Need Help?
Open an issue on the GitHub project page.