batch-shipyard/CHANGELOG.md

1801 строка
81 KiB
Markdown
Исходник Постоянная ссылка Обычный вид История

# Change Log
## [Unreleased]
2019-12-14 00:32:26 +03:00
## [3.9.1] - 2019-12-13
### Added
- Support `--no-wait` on pool creation to allow the command to skip
waiting for the pool to become idle
- Allow ability to ignore GPU warnings, please see pool configuration
- Ubuntu 18.04 SR-IOV IB/RDMA Packer script
### Changed
- **Breaking Change:** improved per-job autoscratch setup. As part of this
change the `auto_scratch` property in the jobs configuration has changed.
- Provide ability to setup via dependency or blocking behavior
- Allow specifying the number of VMs to span
- Allow specifying the per-job autoscratch task id
- Allow multiple multi-instance tasks per job in non-`native` mode
- Update GlusterFS version to 7
### Fixed
- Fix merge task regression with enhanced autogenerated task id support
2019-12-14 00:32:26 +03:00
- Fix job schedule submission regression
([#329](https://github.com/Azure/batch-shipyard/issues/329))
- Fix per-job autoscratch provisioning due to upstream dependency changes
2019-11-15 22:08:52 +03:00
## [3.9.0] - 2019-11-15 (SC19 Edition)
### Added
- Support for [Encrypted Singularity Containers](docs/50-batch-shipyard-encrypted-containers.md)
- Enhanced support for autogenerated task id schemas
([#324](https://github.com/Azure/batch-shipyard/issues/324))
- SR-IOV IB/RDMA support for Ubuntu-based images
- Support for CentOS/CentOS-HPC 7.7
- Support Hyper-V Gen2 boot
- Ubuntu 16.04 SR-IOV IB/RDMA Packer script
### Changed
- **Breaking Change:** the `singularity_images` property in the global
configuration has been modified to accommodate encrypted container images.
Please see the global configuration doc for more information.
- **Breaking Change:** non-`native` pools using `STANDARD_NC24rs_v3` will now
default to using SR-IOV IB/RDMA settings. `native` pools using this VM size
will rely on the Azure Batch runtime to bind the correct container settings.
Please see the
[Azure Batch Guidance issue](https://github.com/Azure/Batch/issues/73)
for more information.
- Improve Azure blob/fileshare mount logic and add retries
- Set proper `FI_PROVIDER` env var to `intel-ofi` MPI runtime selection
- Updated Docker CE to 19.03.5
- Updated Singularity to 3.5.0
- Updated NC/ND driver to 418.87.01, NV driver to 430.46
- Updated LIS to 4.3.4
- Updated blobxfer to 1.9.4
- Updated Python to 3.7.5 for pre-built binaries
- Updated dependencies to latest, where applicable
- Update OSUMicroBenchmarks recipe to MVAPICH-2.3.2
### Fixed
- Fix task `output_data` to correctly honor virtual directories in remote
paths for native pools
([#313](https://github.com/Azure/batch-shipyard/issues/313))
2019-11-15 22:08:52 +03:00
- Fix blobfuse not properly remounting on reboot
([#320](https://github.com/Azure/batch-shipyard/issues/320))
- Fix services/images table overflow
([#327](https://github.com/Azure/batch-shipyard/issues/327))
- Fix `pre_execution_command` in non-native mode to not be wrapped in
a shell
- Fix potential race in `--tail` between task completion and streaming last
portion of file
- Fix ephemeral device detection
- Allow further retries on packager manager contention
- Fix RemoteFS issues
- Non-Samba enabled clusters
- Single disk servers
- Fix Slurm issues
- Slurm cluster provisioning without Public IP address ([#322](https://github.com/Azure/batch-shipyard/pull/322))
- Priority tier check for partitions ([#323](https://github.com/Azure/batch-shipyard/pull/323))
- Various documentation updates
([#314](https://github.com/Azure/batch-shipyard/issues/314),
[#315](https://github.com/Azure/batch-shipyard/issues/315),
[#319](https://github.com/Azure/batch-shipyard/issues/319))
### Removed
- Support for CentOS 7.5, CentOS-HPC 7.1/7.3, and WindowsServerSemiAnnual
Datacenter-Core-1709-with-Containers-smalldisk/Datacenter-Core-1803-with-Containers-smalldisk
2019-09-12 19:29:08 +03:00
## [3.8.2] - 2019-09-12
### Changed
2019-09-12 19:29:08 +03:00
- `blobxfer` program output for handling `input_data` and `output_data` on
non-native pools is now captured to a separate file named
`blobxfer-download.log` and `blobxfer-upload.log`, respectively. This
prevents pollution of the stdout/stderr streams by the data transfer
phases.
- Updated Docker CE to 19.03.2
- Updated NC/ND driver to 418.87.00
- Updated blobxfer to 1.9.2
- Updated dependencies
2019-09-12 19:29:08 +03:00
### Fixed
- Fix prefix filter not being applied on task factory `remote_path`
([#303](https://github.com/Azure/batch-shipyard/issues/303))
2019-09-12 19:29:08 +03:00
- Fix non-string pickling in recurring job definitions
([#306](https://github.com/Azure/batch-shipyard/issues/306))
- Fix potential null values on node error collections and node agent info
on preempted nodes
([#307](https://github.com/Azure/batch-shipyard/issues/307),
[#309](https://github.com/Azure/batch-shipyard/issues/309))
- Fix task termination for infinite retry tasks and in non-native mode over
SSH ([#308](https://github.com/Azure/batch-shipyard/issues/308))
- Fix non-native data transfer sequence coupling
([#310](https://github.com/Azure/batch-shipyard/issues/310))
- Prevent job submission on pools without task runner
([#312](https://github.com/Azure/batch-shipyard/issues/312))
- Fix task `output_data` with include filters for native pools
([#313](https://github.com/Azure/batch-shipyard/issues/313))
- Fix downloading of cascade logs on start task failure
2019-09-12 19:29:08 +03:00
- Update documentation regarding AAD and subscription id requirements
along with better error messages
([#305](https://github.com/Azure/batch-shipyard/issues/305))
- Update documentation regarding Windows vs Linux environment
variables
([#311](https://github.com/Azure/batch-shipyard/issues/311))
2019-08-19 19:24:33 +03:00
## [3.8.1] - 2019-08-19
### Changed
- Updated blobxfer to 1.9.1
### Fixed
2019-08-19 19:24:33 +03:00
- Task runner regressions for non-native mode pools including `input_data`,
`output_data` and `pre_execution_command` for native mode pools
([#301](https://github.com/Azure/batch-shipyard/issues/301))
- Provisioning Network Direct RDMA VM sizes (A8/A9/NC24rX/H16r/H16mr) resulted
in start task failures
2019-08-19 19:24:33 +03:00
([#299](https://github.com/Azure/batch-shipyard/issues/299))
2019-08-14 05:19:48 +03:00
## [3.8.0] - 2019-08-13
### Added
- Revamped Singularity support, including support for Singularity 3,
SIF images, and pull support from ACR registries for SIF images via ORAS.
Please see the global and jobs configuration docs for more information.
([#146](https://github.com/Azure/batch-shipyard/issues/146))
- New MPI interface in jobs configuration for seamless multi-instance task
executions with automatic configuration for SR-IOV RDMA VM sizes with support
for popular MPI runtimes including OpenMPI, MPICH, Intel MPI, and MVAPICH
([#287](https://github.com/Azure/batch-shipyard/issues/287))
- Support for Hb/Hc SR-IOV RDMA VM sizes
([#277](https://github.com/Azure/batch-shipyard/issues/277))
- Support for NC/NV/H Promo VM sizes
- Support for user-specified job preparation and release tasks on the host
([#202](https://github.com/Azure/batch-shipyard/issues/202))
- Support for conditional output data
([#230](https://github.com/Azure/batch-shipyard/issues/230))
- Support for bring your own public IP addresses on Batch pools.
Please see the pool configuration doc and the
[Virtual Networks and Public IPs guide](docs/64-batch-shipyard-byovnet.md)
for more information.
- Support for Shared Image Gallery for custom images
- Support for CentOS HPC 7.6 native conversion
- Additional Slurm configuration options
- New recipes: mpiBench across various configurations,
OpenFOAM-Infiniband-OpenMPI, OSUMicroBenchmarks-Infiniband-MVAPICH
### Changed
- **Breaking Change:** jobs cannot be submitted against pre-`3.8.0` pools.
Pools must be re-created with `3.8.0` or later.
2019-08-14 05:19:48 +03:00
- **Breaking Change:** the `singularity_images` property in the global
configuration has been modified to accomodate Singularity 3 support.
Please see the global configuration doc for more information.
([#146](https://github.com/Azure/batch-shipyard/issues/146))
- **Breaking Change:** the `gpu` property in the jobs configuration has
been changed to `gpus` to accommodate the new native GPU execution
support in Docker 19.03. Please see the jobs configuration doc for
more information.
([#293](https://github.com/Azure/batch-shipyard/issues/293))
- `pool images` commands now support Singularity
- Non-native task execution is now proxied via script
([#235](https://github.com/Azure/batch-shipyard/issues/235))
- Batch Shipyard images have been migrated to the Microsoft Container Registry
([#278](https://github.com/Azure/batch-shipyard/issues/278))
- Updated Docker CE to 19.03.1
- Updated blobxfer to 1.9.0
- Updated LIS to 4.3.3
- Updated NC/ND driver to 418.67, NV driver to 430.30
- Updated Batch Insights to 1.3.0
- Updated dependencies to latest, where applicable
- Updated Python to 3.7.4 for pre-built binaries
- Updated Docker images to use Alpine 3.10
- Various recipe updates to showcase the new MPI schema, HPLinpack and HPCG
updates to SR-IOV RDMA VM sizes
### Fixed
- Cargo Batch service client update missed
([#274](https://github.com/Azure/batch-shipyard/issues/274), [#296](https://github.com/Azure/batch-shipyard/issues/296))
- Premium File Shares were not enumerating correctly with AAD
([#294](https://github.com/Azure/batch-shipyard/issues/294))
- Per-job autoscratch setup failing for more than 2 nodes
### Removed
- Peer-to-peer image distribution support
- Python 3.4 support
2019-07-24 05:49:38 +03:00
## [3.7.1] - 2019-07-23
### Fixed
- Detection of graph root was broken with new version of Docker client (CLI)
on GPU pools ([#291](https://github.com/Azure/batch-shipyard/issues/291))
2019-03-01 00:10:14 +03:00
## [3.7.0] - 2019-02-28
2018-12-10 21:50:38 +03:00
### Added
2019-03-01 00:10:14 +03:00
- Slurm on Batch support: provision Slurm clusters with elastic cloud bursting
on Azure Batch pools. Please see the
[Slurm on Batch guide](https://github.com/Azure/batch-shipyard/blob/master/docs/69-batch-shipyard-slurm.md).
- Batch Insights integration ([#259](https://github.com/Azure/batch-shipyard/issues/259)),
please see the pool and credentials configuration docs.
- Support environment variables on additional node prep commands ([#253](https://github.com/Azure/batch-shipyard/pull/253))
2019-03-01 00:10:14 +03:00
- Support CentOS 7.6
- `pool exists` command
- `--recreate` flag for `pool add` to allow existing pools to be recreated
2019-03-01 00:10:14 +03:00
- `fs cluster orchestrate` command
- Sample Windows container recipes ([#246](https://github.com/Azure/batch-shipyard/issues/246))
### Changed
- **Breaking Change:** the `additional_node_prep_commands` property has
been migrated under the new `additional_node_prep` property as
`commands` ([#252](https://github.com/Azure/batch-shipyard/issues/252))
- Performance improvements to speed up job submission with large task
factories or large amount of tasks. Verbosity of task generation progress
has been increased which can be modified with `-v`.
2019-03-01 00:10:14 +03:00
- Updated blobxfer to 1.7.0 ([#255](https://github.com/Azure/batch-shipyard/issues/255))
- Updated LIS, NV driver to 410.92 and NC/ND driver to 410.104
- Updated other dependencies to latest
2018-12-10 21:50:38 +03:00
### Fixed
- Some commands were incorrectly failing due to nodeid conflicts with
supplied parameters ([#249](https://github.com/Azure/batch-shipyard/issues/249))
- Azure Function extension installation failure ([#260](https://github.com/Azure/batch-shipyard/issues/260))
2019-03-01 00:10:14 +03:00
- Block job submission on non-active pools ([#251](https://github.com/Azure/batch-shipyard/issues/251))
- Missing files included in binary distributions ([#258](https://github.com/Azure/batch-shipyard/issues/258))
- Pools with accelerated networking would fail provisioning sometimes due to
infiniband devices being present for non-RDMA VM sizes
2018-12-10 21:50:38 +03:00
### Security
- Updated Docker CE to 18.09.2 to address the runc CVE-2019-5736
- Updated Singularity to 2.6.1 to address the shared mount propagation
vulnerability CVE-2018-19295
2018-12-03 20:03:24 +03:00
## [3.6.1] - 2018-12-03
### Added
2018-12-10 21:54:18 +03:00
- `force_enable_task_dependencies` property in jobs configuration to turn
on task dependencies on a job even when no task dependencies are present
initially. This is useful when tasks are added at a later time that may
have dependencies. Please consult the jobs documentation for more information.
2018-12-03 20:03:24 +03:00
- Windows Server 2019 support
- Genomics and Bioinformatics recipes: BLAST and RNASeq
- PyTorch recipes
2018-11-19 22:20:08 +03:00
### Changed
- Updated Docker CE to 18.09.0
- Updated blobxfer to 1.5.5
2018-12-03 20:03:24 +03:00
- Updated NC/ND driver to 410.79
- Updated NV driver to 410.71 with CUDA10 support
2018-11-19 22:20:08 +03:00
- Updated other dependencies to latest
### Fixed
2018-12-03 20:03:24 +03:00
- `--tail` console output occasionally repeating characters
- NV provisioning regressions
- Windows node prep issue
2018-11-19 22:20:08 +03:00
- fs cluster status issue
- Retry MSI provisioning for discrete VM resources
2018-11-07 01:21:50 +03:00
## [3.6.0] - 2018-11-06 (SC18 Edition)
### Added
- Kata containers support: run containers on Linux compute nodes with a higher
level of isolation through lightweight VMs. Please see the pool doc for more
information.
- Per-job distributed scratch space support: create on-demand scratch
space shared between tasks of a job which can be particularly useful for MPI
and multi-instance tasks without having to manage a GlusterFS-on-compute
shared data volume. Please see both the pool doc and jobs doc for more
information.
- Add `restrict_default_bind_mounts` option to jobs specifications. This
will restrict automatic host directory bindings to the container filesystem
only to `$AZ_BATCH_TASK_DIR`. This is particularly useful in combination with
container runtimes enforcing VM-level isolation such as Kata containers.
- Allow installation and selection of multiple container runtimes along with
a default container runtime for Docker invocations. Please see the pool doc
for more information under `container_runtimes`.
- Support for Standard SSD and Ultra SSD managed disks for RemoteFS clusters.
In conjunction with this change, Availability Zone support has been added
for manage disks and storage cluster VMs. Please see the relevant
documentation for more information.
### Changed
- **Breaking Change:** the `premium` property under `remote_fs`:`managed_disks`
has been replaced with `sku`. Please see the RemoteFS configuration doc for
more information.
- **Breaking Change:** the Singularity container runtime is no longer installed
by default, please see the pool doc to configure pools to install Singularity
as needed under `container_runtimes`:`install`.
- Renamed MADL recipe to HPMLA
- Updated NC/ND driver to 410.72 with CUDA 10 support
- Updated blobxfer to 1.5.4
- Updated LIS, Prometheus, and Grafana
- Updated other dependencies to latest
- Updated binary builds and Windows Docker images to Python 3.7.1
### Fixed
- `input_data` utilizing Azure File shares ([#243](https://github.com/Azure/batch-shipyard/issues/243))
- New NV driver location ([#244](https://github.com/Azure/batch-shipyard/issues/244))
- Fixed non-public Azure region AAD login issues
- Fixed Singularity image download issues
- Fixed Grafana update regression with default Batch Shipayrd Dashboard
- Fixed SSH login to monitoring resource after federation feature merge
- Enable Singularity on Ubuntu 18.04
### Removed
- Debian 8 host support
2018-09-20 22:45:01 +03:00
## [3.6.0b1] - 2018-09-20
### Added
- Task and node count commands: `jobs tasks counts` and `pool nodes counts`
respectively ([#228](https://github.com/Azure/batch-shipyard/issues/228)).
Please see the usage doc for more information.
2018-10-04 23:51:27 +03:00
- Enhance blocked action tracking for federations. Please see the usage
doc for `fed jobs list` for more information.
2018-09-20 22:45:01 +03:00
- Support for Ubuntu 18.04
- Support for CentOS 7.5 in both non-native and native mode
- MacOS binary for the CLI
### Changed
- Updated Docker to 18.06.1
- Updated Singularity to 2.6.0
- Updated blobxfer to 1.5.0
2018-10-04 23:51:27 +03:00
- Updated Nvidia driver for NC/ND-series to 396.44
2018-09-20 22:45:01 +03:00
- Update various other dependencies to latest
- Windows binary is now signed
### Fixed
- Batch Shipyard site extension on nuget.org has been restored ([#224](https://github.com/Azure/batch-shipyard/issues/224))
- Pool auto-scaling beyond low priority limit ([#239](https://github.com/Azure/batch-shipyard/issues/239))
2018-10-04 23:51:27 +03:00
- Fix `jobs tasks term` command without pool SSH info
2018-09-20 22:45:01 +03:00
- Fix task id generator for federations
2018-08-06 20:35:31 +03:00
## [3.6.0a1] - 2018-08-06
### Added
- Federation support. Please see the
[federation guide](https://github.com/Azure/batch-shipyard/blob/master/docs/68-batch-shipyard-federation.md)
for more information.
- `monitor status` command with `--raw` support
### Changed
- Updated dependencies
2018-07-31 20:28:30 +03:00
## [3.5.3] - 2018-07-31
### Added
- Support Docker image preload delay for Linux native container pools.
Please see the global configuration docs for more information.
### Changed
- Improve registry login robustness with retries
### Fixed
- Docker Hub private registry login failures
- Environment variable issues ([#234](https://github.com/Azure/batch-shipyard/issues/234))
2018-07-20 19:13:44 +03:00
## [3.5.2] - 2018-07-20
### Fixed
- Non-native pool allocation on N-series VMs failing due to unpinned
dependent package for nvidia-docker2 ([#231](https://github.com/Azure/batch-shipyard/issues/231))
2018-07-17 23:16:23 +03:00
## [3.5.1] - 2018-07-17
### Changed
- Update GlusterFS on Compute on CentOS to 4.1
- Updated NC/ND Nvidia driver to 396.37
- Updated NV Nvidia driver to 390.75
- Updated LIS, Prometheus and Grafana
- Updated dependencies
### Fixed
- Properly terminate image pull without fallback on failure
- Fix pool metadata dump check logic
- Fix storage cluster provisioning with Node Exporter options
2018-06-29 23:24:27 +03:00
## [3.5.0] - 2018-06-29
### Added
- CentOS 7.5 and Microsoft Windows Server semi-annual
`datacenter-core-1803-with-containers-smalldisk` host support. Please see
the platform image support doc for more information.
- `fallback_registry` to improve robustness during provisioning when Docker
Hub has an outage or is degraded
([#215](https://github.com/Azure/batch-shipyard/issues/215), [#217](https://github.com/Azure/batch-shipyard/issues/217))
- `misc mirror-images` command to help mirror Batch Shipyard system
images to the designated fallback registry
2018-06-29 23:24:27 +03:00
- Support for XFS filesystem in storage clusters ([#218](https://github.com/Azure/batch-shipyard/issues/219))
- Experimental support for disk array RAID expansion for mdadm-based devices
via `fs cluster expand`
- Option to auto-upload Batch compute node service logs on unusable ([#216](https://github.com/Azure/batch-shipyard/issues/216))
- Microsoft Azure Distributed Linear Learner recipe ([#195](https://github.com/Azure/batch-shipyard/pull/195))
### Changed
2018-06-29 23:24:27 +03:00
- `pool nodes list` can now filter nodes with start task failed and/or
unusable states
- `diag logs upload` command can generate a read only SAS for the target
container via `--generate-sas`
- `storage clear` and `storage del` now allow multiple `--poolid` arguments
along with `--diagnostics-logs` to clear/delete diagnostics logs containers
- `storage sas create` now allows container and file share level SAS creation
along with `--list` permission now available as an option
- Pools failing to allocate with unusable or start task failed nodes will now
dump a listing of problematic nodes detailing the error
- Updated RemoteFS storage clusters using GlusterFS and Ubuntu/Debian-based
GlusterFS-on-compute to 4.1
- Updated blobxfer to 1.3.1
- Updated Singularity to 2.5.1
- Updated dependencies
2018-06-29 23:24:27 +03:00
### Fixed
- GlusterFS on compute provisioning ([#220](https://github.com/Azure/batch-shipyard/issues/220))
- Regression in KeyVault credential conf loading ([#214](https://github.com/Azure/batch-shipyard/issues/214))
- Task file mover command arg left unpopulated ([#29](https://github.com/Azure/batch-shipyard/issues/29))
- Recurring job manager failing to unpickle tasks with dependencies ([#221](https://github.com/Azure/batch-shipyard/issues/221))
- Task add regression when collections are too large from individual slices
### Removed
- CentOS 7.3 host support
2018-06-13 23:25:15 +03:00
## [3.5.0b3] - 2018-06-13
### Changed
- All supported platform images support blobfuse, including native mode
### Fixed
2018-06-29 23:24:27 +03:00
- blobfuse check preventing valid pool provisioning ([#213](https://github.com/Azure/batch-shipyard/issues/213))
2018-06-13 23:25:15 +03:00
- Pool resize not adding SSH users if keys are specified
## [3.5.0b2] - 2018-06-12
### Added
2018-06-29 23:24:27 +03:00
- Support for Prometheus monitoring and Grafana visualization
([#205](https://github.com/Azure/batch-shipyard/issues/205)). Please see the
monitoring doc and
[guide](https://github.com/Azure/batch-shipyard/blob/master/docs/66-batch-shipyard-resource-monitoring.md)
for more information.
- Support for specifying a maximum increment per autoscale evaluation and
2018-06-29 23:24:27 +03:00
the ability to define weekdays and workhours ([#210](https://github.com/Azure/batch-shipyard/issues/210))
- Support for native container support Marketplace platform images. Please
2018-06-29 23:24:27 +03:00
see the platform image support doc for more information. ([#204](https://github.com/Azure/batch-shipyard/issues/204))
- Allow configuration to enable SSH users to access Docker daemon ([#206](https://github.com/Azure/batch-shipyard/issues/206))
- Support for GPUs on CentOS 7.4 ([#199](https://github.com/Azure/batch-shipyard/issues/199))
- Support for CentOS-HPC 7.4 ([#184](https://github.com/Azure/batch-shipyard/issues/184))
### Changed
- **Breaking Change:** You can no longer specify both an `account_key`
and `aad` with the `batch` section of the credentials config. The prior
behavior was that `account_key` would take precedence over `aad`. Now
these options are mutually exclusive. This will now break configurations
that specified `aad` at the global level while having a shared `account_key`
2018-06-29 23:24:27 +03:00
at the `batch` level. ([#197](https://github.com/Azure/batch-shipyard/issues/197))
- **Breaking Change:** `install.sh` now installs into a virtual env by
default. Use the `-u` switch to retain the old (non-recommended)
2018-06-29 23:24:27 +03:00
default behavior. ([#200](https://github.com/Azure/batch-shipyard/issues/200))
- GlusterFS for RemoteFS and gluster on compute updated to 4.0.
- Update NC driver to 396.26 supporting CUDA 9.2
- blobxfer updated to 1.2.1
### Fixed
- Errant credentials check for configuration from commandline which affected
config load from KeyVault
- Blobxfer extra options regression
2018-06-29 23:24:27 +03:00
- Cache container/file share creations for data egress ([#211](https://github.com/Azure/batch-shipyard/issues/211))
2018-05-02 17:48:54 +03:00
## [3.5.0b1] - 2018-05-02
### Added
- Output to JSON for a subset of commands via `--raw` commmand line switch.
JSON output is directed to stdout. Please see the usage doc for which commands
2018-06-29 23:24:27 +03:00
are supported and important information regarding output stability. ([#177](https://github.com/Azure/batch-shipyard/issues/177))
2018-05-02 17:48:54 +03:00
- Allow AAD on the `storage` section in the credentials configuration.
2018-06-29 23:24:27 +03:00
Please see the credential doc for more information. ([#179](https://github.com/Azure/batch-shipyard/issues/179))
2018-05-02 17:48:54 +03:00
- Boot diagnostics are now enabled for all VMs provisioned for RemoteFS
2018-06-29 23:24:27 +03:00
clusters. This also enables serial console support in the portal. ([#193](https://github.com/Azure/batch-shipyard/issues/193))
2018-05-02 17:48:54 +03:00
- `product_iterables` task factory support. Please see the task factory
2018-06-29 23:24:27 +03:00
doc for more information. ([#187](https://github.com/Azure/batch-shipyard/issues/187))
2018-05-02 17:48:54 +03:00
- `default_working_dir` option as the job and task-level to set the
default working directory when a container executes as a task. Please
2018-06-29 23:24:27 +03:00
see the jobs doc for more information. ([#190](https://github.com/Azure/batch-shipyard/issues/190))
2018-05-02 17:48:54 +03:00
- `--no-generate-tunnel-script` option to `pool nodes grls`
2018-04-04 23:27:41 +03:00
### Changed
2018-05-02 17:48:54 +03:00
- Greatly improve throughput speed of many commands that internally iterated
2018-06-29 23:24:27 +03:00
sequences of actions ([#188](https://github.com/Azure/batch-shipyard/issues/188))
- RemoteFS clusters provisioned using Ubuntu 18.04-LTS
([#161](https://github.com/Azure/batch-shipyard/issues/161), [#185](https://github.com/Azure/batch-shipyard/issues/185))
- Update Nvidia NC driver to 390.46 supporting CUDA 9.1
- Update Nvidia NV driver to 390.42.
2018-05-02 17:48:54 +03:00
- Singularity updated to 2.5.0
- blobxfer updated to 1.2.0
- Docker CE updated to 18.03.1
2018-04-04 23:27:41 +03:00
- Update dependencies to latest
2018-06-29 23:24:27 +03:00
- Unify nodeprep scripts ([#176](https://github.com/Azure/batch-shipyard/issues/176))
- Integrate shellcheck ([#177](https://github.com/Azure/batch-shipyard/issues/177))
2018-05-02 17:48:54 +03:00
- Extend retry policy for all clients
- Add Windows file version info and icon to CLI binary
2018-04-04 23:27:41 +03:00
### Fixed
2018-06-29 23:24:27 +03:00
- Kernel unattended upgrades causes GPU jobs to fail on reboot ([#174](https://github.com/Azure/batch-shipyard/issues/174))
2018-05-02 17:48:54 +03:00
- Task submission speed regression when using task factory or large task
2018-06-29 23:24:27 +03:00
arrays with unnamed tasks ([#183](https://github.com/Azure/batch-shipyard/issues/183))
2018-05-02 17:48:54 +03:00
- Fix determinism in cardinality of results from `pool nodes grls`
- Env var export for tasks without env vars
- Ensure Nvidia persistence mode on reboots
- Pin nvidia-docker2 installations
- Site extension broken install issues (and fallback manual recovery)
### Removed
2018-06-29 23:24:27 +03:00
- Ubuntu 14.04 host support ([#164](https://github.com/Azure/batch-shipyard/issues/164))
2018-04-04 23:27:41 +03:00
2018-03-26 23:15:44 +03:00
## [3.4.0] - 2018-03-26
### Added
2018-03-14 22:21:13 +03:00
- Support for adding network access rules to the remote access port (SSH or
RDP). Please see the pool configuration guide for more details.
- Support for adding certificate references to a pool. Please see the
2018-03-16 00:55:33 +03:00
pool configuration guide for more details. Also please see below for
improvements to the `cert` command.
- Support for
[NCv3 VM sizes](https://azure.microsoft.com/blog/ncv3-vms-generally-available-other-gpus-expanding-regions/).
Note that ND/NCv2/NCv3 all require separate quota approval; please raise a
ticket through the Azure Portal.
2018-03-26 21:58:45 +03:00
- Support for uploading Batch compute node service logs to the specified
Azure storage account used by Batch Shipyard. Please see the
`diag logs upload` command in the usage docs.
- Support for fine-tuning `/etc/exports` when creating NFS file servers via
`server_options` and `nfs`. Please see the remote FS configuration doc
for more information.
- Support for job-level default task exit condition options. These options
can be overriden on a per-task basis. Please see the job configuration doc
for more information.
### Changed
2018-03-16 00:55:33 +03:00
- Improve `cert` commands
- Support adding arbitrary cer, pem and pfx certificates to a Batch
account via command line options
- Support deleting arbitrary certificates by thumbprint, including
multiple at once; also ask for confirmation before deleting
- Support creating pem/pfx pairs with `cert create` without having
to define an `encryption` section in the global configuration for
use in scenarios outside of credential encryption
- `depends_on` and `depends_on_range` now apply to tasks generated by
`task_factory` (#173). Please see the job configuration doc for more
information.
- `pool nodes del` and `pool nodes reboot` now accept multiple `--nodeid`
arguments to specify deleting and rebooting multiple nodes at the same time,
respectively
- `pool nodes prune`, `pool nodes reboot`, `pool nodes zap` will now ask
for confirmation first. `-y` flag can be specified to suppress confirmation.
- Added Batch Shipyard version to user agent for all ARM clients
- Improved node prep scripts with more timestamp detail, Docker and
Nvidia details
2018-03-22 00:39:58 +03:00
- CUDA 9.1 support on ND/NCv2/NCv3 with Tesla Driver 390.30
- Docker CE updated to 18.03.0
2018-03-20 00:07:12 +03:00
- Singularity updated to 2.4.4
- Dependencies updated
### Fixed
- Previous environment variable expansion fix applied to multi-instance tasks
2018-03-16 00:55:33 +03:00
- `jobs tasks list` command with undefined job action but with dependency
actions
- `job_action` for task default exit condition was being overwritten
incorrectly in certain scenarios
2018-03-01 19:25:49 +03:00
## [3.3.0] - 2018-03-01
2018-02-25 23:12:47 +03:00
### Added
2018-03-01 19:25:49 +03:00
- Support for specifying default task exit conditions (i.e., non-zero exit
codes). Please see the jobs configuration doc for more information.
- New commands (please see usage doc for more information):
- `pool nodes prune` will remove all unused Docker-related data on all
nodes in the pool (requires an SSH user)
- `pool nodes ps` performs a `docker ps -a` on all nodes in the pool
(requires an SSH user)
- `pool nodes zap` will kill (and optionally remove) **all** running
Docker containers on nodes in a pool (requires an SSH user). Note that
`jobs tasks term` is the preferred command to control individual
(or grouped) task termination or `jobs term --termtasks` to terminate
at the job level.
- `storage sas create` command added as a utility helper function to
create SAS tokens for given storage accounts in credentials
- Support for activating
[Azure Hybrid Use Benefit](https://azure.microsoft.com/pricing/hybrid-benefit/)
for Windows pools
2018-02-25 23:12:47 +03:00
### Changed
2018-03-01 19:25:49 +03:00
- Greatly expand pool, node, job, and task details for `list` sub-commands
- Expand error detail key/value pairs if present
- Name resources for RemoteFS with 3 digits (e.g., 000 instead of 0) for
improved alpha ordering
2018-03-01 19:25:49 +03:00
- Move site extension to nuget.org (#172)
### Fixed
2018-03-01 19:25:49 +03:00
- Command lines in non-native mode now properly expand environment variables
with normal quoting
- RemoteFS cluster del with disks error
- RemoteFS cluster status decoding issues
- Unintended interaction between native mode and custom images
2018-03-01 19:25:49 +03:00
- Handle starting Docker service for all deployments
- Do not automatically mount storage cluster or custom linux mounts on boot
due to potential race conditions with the ephemeral disk mount
- Fix TLS issue with powershell (#171)
- Fix conflicts when using install.sh with Anaconda environments
- Update packer scripts and fix typos
- Minor doc updates
2018-02-21 19:26:13 +03:00
## [3.2.0] - 2018-02-21
### Added
- Custom Linux Mount support for `shared_data_volumes`. Please see the
global configuration doc for more information.
2018-03-01 19:25:49 +03:00
- New commands (please see usage doc for more information):
- `account` command added with the following sub-commands (requires
AAD auth):
- `info` provides information about a Batch account (including account
level quotas)
- `list` provides information about all (or a resource group subset)
of accounts within the subscription specified in credentials
- `quota` provides service level quota information for the
subscription for a given location
- `pool rdp` sub-command added, please see usage doc for more information.
Requires Batch Shipyard executing on Windows with target Windows
containers pools.
2018-02-18 00:24:52 +03:00
- `pool images update` command now supports updating Docker images
in native container support pools via SSH
- Ability to specify an AAD authority URL via the `aad`:`authority_url`
credential configuration, `--aad-authority-url` command line option or
`SHIPYARD_AAD_AUTHORITY_URL` environment variable. Please see relevant
documentation for credentials and usage.
- Support for CentOS 7.4 and Debian 9 compute node hosts. CentOS 7.4
on GPU nodes is currently unsupported; CentOS 7.3 will continue to work on
N-series.
2018-02-21 19:26:13 +03:00
- Support for publisher `MicrosoftWindowsServer`, offer
`WindowsServerSemiAnnual`, and sku
`Datacenter-Core-1709-with-Containers-smalldisk`
- `--delete-resource-group` option added to `fs disks del` command
- CentOS-HPC 7.1, CentOS 7.3 GPU, and CentOS 7.4 packer scripts added to
contrib area
2018-02-21 19:26:13 +03:00
- Add documentation for which `platform_image`s are supported
2018-02-09 20:33:25 +03:00
### Changed
2018-02-18 00:24:52 +03:00
- **Breaking Change:** `additional_node_prep_commands` is now a dictionary
of `pre` and `post` properties which are executed either before or after the
Batch Shipyard startup task. Please see the pool configuration doc for more
information.
2018-02-09 20:33:25 +03:00
- Allow provisioning of OpenLogic CentOS-HPC 7.1
- Default management endpoint for public Azure cloud updated
- Improve some error messages/handling
- Update dependencies to latest
2018-02-17 01:03:13 +03:00
- Linux pre-built binary is no longer gzipped
- Update packer scripts in contrib area
2018-02-09 20:33:25 +03:00
### Fixed
- AAD auth for ARM endpoints in non-public Azure cloud regions
- Custom image + native mode deployment for Linux pools
- Potential command launch problems in native mode
- Minor schema validation updates
- AAD check logic for different points in pool allocation
- `--ssh` parameter for `pool images update` was not correctly set as a flag
- `--jobs` was not properly being merged with `--configdir` (#163)
2018-02-18 00:24:52 +03:00
- Fix regression in `pool images update` that would not login to
registries in multi-instance mode
2018-02-18 00:24:52 +03:00
- Fix `pool images` commands to more reliably work with SSH
- Fix `output_data` with windows containers pools (#165)
2018-01-30 22:21:22 +03:00
## [3.1.0] - 2018-01-30
### Added
- Configuration validation. Validator supports both YAML and JSON
configuration, please see special note in the Removed section below (#145)
- Support for Azure Blob storage container mounting via blobfuse (#159)
- Support for merge tasks which depend on all tasks specified in the
`tasks` array. Please see the jobs configuration guide for more
information (#149).
- Support for accelerated networking in RemoteFS storage clusters (#158)
### Changed
2018-01-30 22:21:22 +03:00
- Update Docker CE to 17.12.0 for Ubuntu/CentOS
- Update nvidia-docker 1.0.1 to nvidia-docker2
- Update blobxfer to 1.1.1
- Updated dependencies to latest
### Fixed
- Disabling `remove_container_after_exit`, `gpu`, `infiniband` at the
task-level was not being honored properly
### Removed
- Integration of the schema validator has now removed or enforced strict
behavior for the following previously deprecated configuration properties:
- `credentials`:`batch`:`account` has been removed
- `pool_specification`:`vm_count` must be a map of `dedicated` and
`low_priority` VM counts
- `pool_specification`:`vm_configuration` must be specified instead of
directly specifying `publisher`, `offer`, `sku` on `pool_specification`
2018-01-30 22:21:22 +03:00
- `global_resources`:`docker_volumes` is no longer valid and must be
replaced with `global_resources`:`volumes`
- `job_specifications`:`tasks`:`image` is no longer valid and must be
replaced with `job_specifications`:`tasks`:`docker_image`
2018-01-22 19:39:12 +03:00
## [3.0.3] - 2018-01-22
### Security
- Update NV driver to 384.111 to work with updated Linux kernels with
speculative execution side channel vulnerability patches (#154)
2018-01-12 20:06:57 +03:00
## [3.0.2] - 2018-01-12
### Fixed
- Errant bind option being propagated to volume name
2018-01-12 20:06:57 +03:00
- Clarify error path on attempting `infiniband` on non-supported images
- Fix quickstart recipe links from Read the Docs (#150)
### Security
- Update NC driver to 384.111 to work with updated Linux kernels with
2018-01-22 19:39:12 +03:00
speculative execution side channel vulnerability patches
2017-11-22 19:30:13 +03:00
## [3.0.1] - 2017-11-22
### Fixed
- Fix on-disk file naming for Docker images pulled with Singularity
- Data movement regressions
- Public IP configs in RemoteFS recipes
- Support more than 16 data disks per VM for RemoteFS servers
- Documentation and other typos
2017-11-13 09:45:19 +03:00
## [3.0.0] - 2017-11-13 (SC17 Edition)
### Added
- CLI Singularity image (#135)
### Changed
- Start LUN numbering for remote fs disks at 0
2017-11-10 20:22:18 +03:00
- Allow path to `python.exe` to be specified in `install.cmd`
2017-11-13 09:45:19 +03:00
- Ensure persistence daemon/mode is enabled for GPUs
2017-11-10 20:22:18 +03:00
- Update dependencies to latest
### Fixed
2017-11-13 10:22:15 +03:00
- Non-Ubuntu/CentOS cascade failures from non-existent Singularity
- Default Singularity tagged image names on disk
2017-11-10 20:22:18 +03:00
- Circular dependency in `task_factory` and `settings`
2017-11-13 02:11:26 +03:00
- `misc tensorboard` command broken from latest TF image
2017-11-13 09:45:19 +03:00
- Update NV driver
2017-11-09 00:33:51 +03:00
## [3.0.0rc1] - 2017-11-08
2017-11-06 23:55:17 +03:00
### Changed
- Update VM size support
- SSH private key filemode check no longer results in an exception if it
fails. Instead a warning is issued - this is to allow SSH invocations on WSL.
- Install scripts now uninstall `azure-storage` first due to conflicts with
the Azure Storage Python split library.
2017-11-06 23:55:17 +03:00
- Updated to blobxfer 1.0.0
### Fixed
- Job submission on custom image pools with 4.0 SDK changes
- Empty coordination command issue for Docker tasks
- Singularity registries with passwords in keyvault
2017-11-06 02:58:29 +03:00
## [3.0.0b1] - 2017-11-05
2017-10-29 19:38:46 +03:00
### Added
2017-11-06 02:58:29 +03:00
- Singularity support (#135)
- Preliminary Windows server support (#7)
- Pre-built binaries for CLI for some Linux distributions and Windows (#131)
- Windows Docker image for CLI
2017-11-06 02:58:29 +03:00
- Singularity HPCG and TensorFlow-GPU recipes
2017-10-29 19:38:46 +03:00
### Changed
- **Breaking Change:** Many commands have been placed under more
appropriate hierachies. Please see the major version migration guide for
more information.
2017-10-29 19:38:46 +03:00
### Fixed
- Mount `/opt/intel` into Singularity containers
- Retry image configuration error pulls from Docker registries
- AAD MFA token cache on Python2
- Non-native coordination command fix, if not specified
2017-11-06 02:58:29 +03:00
- Include min node counts in autoscale scenarios (#139)
- `jobs tasks list` when there is a failed task (#142)
2017-10-29 19:38:46 +03:00
2017-10-27 21:35:48 +03:00
## [3.0.0a2] - 2017-10-27
### Added
2017-10-06 20:55:46 +03:00
- Major version migration guide (#134)
- Support for mounting multiple Azure File shares as `shared_data_volumes`
to a pool (#123)
- `bind_options` support for `data_volumes` and `shared_data_volumes`
- More packer samples for custom images
2017-10-27 21:35:48 +03:00
- Singularity HPLinpack recipe
### Changed
2017-10-17 22:57:02 +03:00
- **Breaking Change:** `global_resources`:`docker_volumes` is now named
`global_resources`:`volumes`. Although backward compatibility is maintained
for this property, it is recommended to migrate as volumes are now shared
between Docker and Singularity containers.
- Azure Files (with `volume_driver` of `azurefile`) specified under
`shared_data_volumes` are now mounted directly to the host (#123)
- The internal root mount point for all `shared_data_volumes` is now under
`$AZ_BATCH_NODE_ROOT_DIR/mounts` to reduce clutter/confusion under the
old root mount point of `$AZ_BATCH_NODE_SHARED_DIR`. The container mount
points (i.e., `container_path`) are unaffected.
- Canonical UbuntuServer 16.04-LTS is no longer pinned to a specific
release. Please avoid using the version `16.04.201709190`.
2017-10-27 21:35:48 +03:00
- Update to blobxfer 1.0.0rc3
- Updated custom image guide
2017-10-27 21:35:48 +03:00
### Fixed
- Multi-instance Docker-based application command was not being launched
under a user identity if specified
- Allow min node allocation with `bias_last_sample` without required
sample percentage (#138)
2017-10-03 19:12:24 +03:00
## [3.0.0a1] - 2017-10-04
### Added
- Support for deploying compute nodes to an ARM Virtual Network with Batch
Service Batch accounts (#126)
- Support for deploying custom image compute nodes from an ARM Image resource
(#126)
- Support for multiple public and private container registries (#127)
- YAML configuration support. JSON formatted configuration files will continue
to be supported, however, note the breaking change with the corresponding
environment variable names for specifying individual config files from the
2017-10-03 19:12:24 +03:00
commandline. (#122)
2017-10-04 18:59:03 +03:00
- Option to automatically attempt recovery of unusable nodes during
2017-10-03 19:12:24 +03:00
pool allocation or resize. See the `attempt_recovery_on_unusable` option in
the pool configuration doc.
- Virtual Network guide
### Changed
2017-10-03 23:40:47 +03:00
- **Breaking Change:** Docker image tag for the CLI has been renamed to
`alfpark/batch-shipyard:<version>-cli` where `<version>` is the release
version or `latest` for whatever is in `master`. (#130)
- **Breaking Change:** Fully qualified Docker image names are now required
under both the global config `global_resources`.`docker_images` and jobs
`task` array `docker_image` (or `image`). The `docker_registry` property
in the global config file is no longer valid. (#106)
- **Breaking Change:** Docker private registries backed to Azure Storage blobs
2017-10-03 19:12:24 +03:00
are no longer supported. This is not to be confused with the Classic Azure
Container Registries which are still supported. (#44)
- **Breaking Change:** `docker_registry` property in the global config is
no longer required. An `additional_registries` option is available for any
additional registries that are not present from the `docker_images`
2017-10-06 20:55:46 +03:00
array in `global_resources` but require a valid login. (#106)
- **Breaking Change:** Data ingress/egress from/to Azure Storage along with
`task_factory`:`file` has changed to accommodate `blobxfer 1.0.0` commandline
and options. There are new expanded options available, including multiple
`include` and `exclude` along with `remote_path` explicit specifications
(instead of general `container` or `file_share`). Please see the appropriate
global config, pool or job configuration docs for more information. (#47)
- **Breaking Change:** `image_uris` in the `vm_configuration`:`custom_image`
property of the pool configuration has been replaced with `arm_image_id`
which is a reference to an ARM Image resource. Please see the custom image
2017-10-03 19:12:24 +03:00
guide for more information. (#126)
- **Breaking Change:** environment variables `SHIPYARD_CREDENTIALS_JSON`,
`SHIPYARD_CONFIG_JSON`, `SHIPYARD_POOL_JSON`, `SHIPYARD_JOBS_JSON`, and
`SHIPYARD_FS_JSON` have been renamed to `SHIPYARD_CREDENTIALS_CONF`,
`SHIPYARD_CONFIG_CONF`, `SHIPYARD_POOL_CONF`, `SHIPYARD_JOBS_CONF`, and
2017-10-03 19:12:24 +03:00
`SHIPYARD_FS_CONF` respectively. (#122)
- `--configdir` or `SHIPYARD_CONFIGDIR` now defaults to the current working
directory (i.e., `.`) if no other conf file options are specified.
- `aad` can be specified at a "global" level in the credentials configuration
file, which is then applied to `batch`, `keyvault` and/or `management`
section. Please see the credentials configuration guide for more information.
- `docker_image` is now preferred over the deprecated `image` property in
the `task` array in the jobs configuration file
2017-10-03 19:12:24 +03:00
- `gpu` and `infiniband` under the jobs configuration are now optional. GPU
and/or RDMA capable compute nodes will be autodetected and the proper
devices and other settings will be automatically be applied to tasks running
on these compute nodes. You can force disable GPU and/or RDMA by setting
`gpu` and `infiniband` properties to `false`. (#124)
- Update Docker CE to 17.09.0
2017-10-03 19:12:24 +03:00
- Update NC driver to 384.81 (CUDA 9.0 support)
## [2.9.6] - 2017-10-03
### Added
- Migrate to Read the Docs for [documentation](https://batch-shipyard.readthedocs.io/en/latest/)
### Fixed
- RemoteFS disk attach fixes
- Nvidia docker volume mount check
## [2.9.5] - 2017-09-24
### Added
- Optional `version` support for `platform_image`. This property can be
used to set a host OS version to prevent possible issues that occur with
`latest` image versions.
- `--all-starting` option for `pool delnode` which will delete all nodes
in starting state
### Changed
- Prevent invalid configuration of HPC offers with non-RDMA VM sizes
- Expanded network tuning exemptions for new Dv3 and Ev3 sizes
- Temporarily override Canonical UbuntuServer 16.04-LTS latest version to
a prior version due to recent linux-azure kernel issues
### Fixed
- NV driver updates
- Various OS updates and Docker issues
- CentOS 7.3 to 7.4 Nvidia driver breakage
- Regression in `pool ssh` on Windows
- Exception in unusable nodes with pool stats on allocation
- Handle package manager db locks during conflicts for local package installs
2017-09-12 18:56:03 +03:00
## [2.9.4] - 2017-09-12
### Changed
- Update dependencies to latest available
- Improve Docker builds
### Fixed
2017-09-12 18:56:03 +03:00
- Missing `join_by` function in blobxfer helper script (#115)
- Fix `clear()` for `pool udi` with Python 2.7 (#118)
2017-08-29 18:05:34 +03:00
## [2.9.3] - 2017-08-29
### Fixed
- Ignore `resize_timeout` for autoscale-enabled pools
- Present a warning for `jobs migrate` indicating Docker image requirements
- Various doc typos and updates
## [2.9.2] - 2017-08-16
### Added
2017-08-15 23:28:55 +03:00
- Deep learning Jupyter notebooks (thanks to @msalvaris and @thdeltei)
- Automatic site-extensions NuGet package updates with tagged releases via
AppVeyor builds
2017-08-15 23:28:55 +03:00
- Caffe2-CPU and Caffe2-GPU Recipes
### Changed
- Python 3.3 is no longer supported (due to `cryptography` dropping support
for 3.3).
- Use multi-stage build for Cascade to improve build times and reduce
Docker image size
2017-08-15 23:28:55 +03:00
### Fixed
- Provide more helpful feedback for invalid clients
- Fix provisioning clusters with disks larger than 2TB
- RemoteFS issues in resize and expand
- Various site extension issues, will now proper install/upgrade to the
associated tagged version
2017-08-15 23:28:55 +03:00
2017-08-09 18:57:33 +03:00
## [2.9.0rc1] - 2017-08-09
2017-08-07 20:06:19 +03:00
### Added
- Recurring job support (job schedules). Please see jobs configuration doc
for more information.
2017-08-07 20:06:19 +03:00
- `custom` task factory. See the task factory guide for more information.
- `--all-jobschedules` option for `jobs term` and `jobs del`
- `--jobscheduleid` option for `jobs disable`, `jobs enable` and
`jobs migrate`
- `--all` option for `jobs listtasks`
2017-08-07 20:06:19 +03:00
### Changed
- `autogenerated_task_id_prefix` configuration setting is now named
`autogenerated_task_id` and is a complex property. It has member properties
named `prefix` and `zfill_width` to control how autogenerated task ids are
named.
- `jobs list` will now output job schedules in addition to jobs
- `--all` parameter for `jobs term` and `jobs del` renamed to `--all-jobs`
- list subcommands now output in a more human readable format
2017-08-05 00:51:27 +03:00
## [2.9.0b2] - 2017-08-04
2017-08-03 20:41:06 +03:00
### Added
2017-08-07 20:06:19 +03:00
- `random` and `file` task factories. See the task factory guide for more
information.
- Summary statistics: `pool stats` and `jobs stats`. See the usage doc for
more information.
2017-08-03 20:41:06 +03:00
- Delete unusable nodes from pool with `--all-unusable` option for
`pool delnode`
2017-08-07 20:06:19 +03:00
- CentOS-HPC 7.3 support
2017-08-03 21:46:28 +03:00
- CNTK-GPU-Infiniband-IntelMPI recipe
2017-08-03 20:41:06 +03:00
### Changed
- `remove_container_after_exit` now defaults to `true`
- `input_data`:`azure_storage` files with an include filter that does not
include wildcards (i.e., targets a single file) will now be placed at
the `destination` directly as specified.
- Nvidia Tesla driver updated to 384.59
- TensorFlow recipes updated for 1.2.1. TensorFlow-Distributed `launcher.sh`
script is now generalized to take a script as the first parameter and
relocated to `/shipyard/launcher.sh`.
- CNTK recipes updated for 2.1. `run_cntk.sh` script now takes in CNTK
Python scripts for execution.
### Fixed
- Task termination with force failing due to new task generators
2017-08-03 18:51:29 +03:00
- pool udi over SSH terminal mangling
2017-08-01 01:05:58 +03:00
## [2.9.0b1] - 2017-07-31
2017-07-18 00:15:48 +03:00
### Added
2017-08-01 01:05:58 +03:00
- Autoscale support. Please see the Autoscale guide for more information.
- Autopool support
- Task factory and parametric sweep support. Please see the task factory
guide for more information.
- Job priority support
- Job migration support with new command `jobs migrate`
2017-08-01 01:05:58 +03:00
- Compute node fill type support
- New commands: `jobs enable` and `jobs disable`. Please see the usage doc
for more information.
2017-07-18 00:15:48 +03:00
- From Scratch: Step-by-Step guide
2017-08-01 01:05:58 +03:00
- Azure Cloud Shell information
2017-07-18 00:15:48 +03:00
2017-07-31 22:46:41 +03:00
### Changed
- Auto-generated task prefix is now `task-`. This can now be overridden with
the `autogenerated_task_id_prefix` property in the global configuration.
2017-08-01 01:05:58 +03:00
- Update dependencies to latest except for azure-mgmt-compute due to
broken changes
### Fixed
- RemoteFS regressions
- Pool deletion with poolid argument cleanup
2017-07-31 22:46:41 +03:00
2017-07-06 21:12:24 +03:00
## [2.8.0] - 2017-07-06
### Added
2017-07-06 21:12:24 +03:00
- Support for CentOS 7.3 NC/NV gpu pools
- `--all-start-task-failed` parameter for `pool delnode`
### Changed
- Improve robustness of docker image pulls within node prep scripts
- Restrict node list queries until pool allocation state emerges from resizing
### Fixed
- Remove nvidia gpu driver property from FFmpeg recipe
- Further improve retry logic for docker image pulls in cascade
## [2.8.0rc2] - 2017-06-30
### Added
- Support Mac OS X and Windows Subsystem for Linux installations via
`install.sh` (#101)
- Guide for Windows Subsystem for Linux installations
- Automated Nvidia driver install for NV-series
### Changed
- Drop unsupported designations for Mac OS X and Windows
- Update Docker engine to 17.06 for Ubuntu, Debian, CentOS and 17.04 for
OpenSUSE
### Fixed
- Regression in private registry image pulls
2017-06-27 21:38:36 +03:00
## [2.8.0rc1] - 2017-06-27
### Added
- Version metadata added to pools and jobs with warnings generated for
mismatches (#89)
2017-06-27 20:12:56 +03:00
- Cloud shell installation support
### Changed
- Update Docker images to Alpine 3.6 (#65)
- Improve robustness of package downloads
- Add retries for docker pull within cascade context
- Download cascade.log on start up failures
### Fixed
- Patch job for auto completion (#97)
- Tensorboard command with custom images
- conda-forge detection in installation scripts (#100)
2017-06-07 18:30:04 +03:00
## [2.8.0b1] - 2017-06-07
2017-06-06 18:29:44 +03:00
### Added
- Custom image support, please see the pool configuration doc and custom
image guide for more information. (#94)
2017-06-06 18:29:44 +03:00
- `contrib` area with `packer` scripts
### Changed
- **Breaking Change:** `publisher`, `offer`, `sku` is now part of a complex
property named `vm_configuration`:`platform_image`. This change is to
accommodate custom images. The old configuration schema is now deprecated and
will be removed in a future release.
- Updated NVIDIA Tesla driver to 375.66
### Fixed
- Improved pool resize/allocation logic to fail early with low priority core
quota reached with no dedicated nodes
2017-05-31 17:43:14 +03:00
## [2.7.0] - 2017-05-31
### Added
- `--poolid` parameter for `pool del` to specify a specific pool to delete
### Changed
- Prompt for confirmation for `jobs cmi`
- Updated to latest dependencies
- Split low-priority considerations into separate doc
### Fixed
- Remote FS allocation issue with `vm_count` deprecation check
- Better handling of package index refresh errors
2017-05-31 17:43:14 +03:00
- `pool udi` over SSH issues (#92)
- Duplicate volume checks between job and task definitions
2017-05-24 17:43:19 +03:00
## [2.7.0rc1] - 2017-05-24
### Added
- `pool listimages` command which will list all common Docker images on
all nodes and provide warning for mismatched images amongst compute nodes.
This functionality requires a provisioned SSH user and private key.
- `max_wall_time` option for both jobs and tasks. Please consult the
documentation for the difference when specifying this option at either the
job or task level.
- `--poll-until-tasks-complete` option for `jobs listtasks` to block the CLI
2017-05-24 17:43:19 +03:00
from exiting until all tasks under jobs for which the command is run have
completed
2017-05-23 06:03:35 +03:00
- `--tty` option for `pool ssh` and `fs cluster ssh` to enable allocation
of a pseudo-tty for the SSH session
### Changed
- `remove_container_after_exit`, `retention_time`, `shm_size`, `infiniband`,
2017-05-24 17:43:19 +03:00
`gpu` can now be specified at the job-level and overriden at the task-level
in the jobs configuration
2017-05-24 17:43:19 +03:00
- `data_volumes` and `shared_data_volumes` can now be specified at the
job-level and any volumes specified at the task level will be *merged* with
the job-level volumes to be exposed for the container
### Fixed
- Add missing deprecation path for `pool_specification_vm_count` for
multi-instance tasks. Please upgrade your jobs configuration to explicitly
use either `pool_specification_vm_count_dedicated` or
`pool_specification_vm_count_low_priority`.
- Speed up task collection additions by caching last task id
- Issues with pool resize and wait logic with low priority
2017-05-18 19:10:16 +03:00
## [2.7.0b2] - 2017-05-18
2017-05-13 05:23:29 +03:00
### Changed
- Allow the prior `vm_count` behavior, but provide a deprecation warning. The
old `vm_count` behavior will be removed in a future release. (#84)
2017-05-18 19:10:16 +03:00
- Add tasks via collection (#86)
2017-05-13 05:23:29 +03:00
- Log if node is dedicated in `pool listnodes`
- Updated all recipes with new `vm_count` changes
### Fixed
- Improve pool resize wait logic for pools with mixed node types
2017-05-18 19:10:16 +03:00
- Do not override workdir if specified (#87)
- Prevent container scanning for data ingress from Azure Storage if include
filter contains no wildcards (#88)
## [2.7.0b1] - 2017-05-12
### Added
- Support for [Low Priority Batch Compute Nodes](https://docs.microsoft.com/azure/batch/batch-low-pri-vms)
- `resize_timeout` can now be specified on the pool specification
- `--clear-tables` option to `storage del` command which will delete
blob containers and queues but clear table entries
- `--ssh` option to `pool udi` command which will force the update Docker
images command to update over SSH instead of through a Batch job. This is
2017-05-13 05:23:29 +03:00
useful if you want to perform an out-of-band update of Docker image(s), e.g.,
your pool is currently busy processing tasks and would not be able to
accommodate another task.
### Changed
- **Breaking Change:** `vm_count` in the pool specification is now a
complex property consisting of the properties `dedicated` and `low_priority`
- Updated all dependencies to latest
2017-05-08 22:29:37 +03:00
### Fixed
- Improve node startup time for GPU NC-series by removing extraneous
dependencies
- `fs cluster ssh` storage cluster id and command argument ordering was
inverted. This has been corrected to be as intended where the command
is the last argument, e.g., `fs cluster ssh mynfs -- df -h`
2017-05-08 22:29:37 +03:00
2017-05-05 18:42:53 +03:00
## [2.6.2] - 2017-05-05
### Added
- Docker image build for `develop` branch
### Changed
- Allow NVIDIA license agreement to be auto-confirmed via `-y` option
- Use requests for file downloading since it is already being installed
as a dependency
- Update dependencies to latest versions
### Fixed
- TensorFlow image not being set if no suitable image is found for
`misc tensorboard` command
- Authentication for running images not present in global config sourced
from a private registry
## [2.6.1] - 2017-05-01
### Added
- `misc tensorboard` command added which automatically instantiates a
Tensorboard instance on the compute node which is running or has ran a
task that has generated TensorFlow summary operation compatible logs. An
SSH tunnel is then created so you can view Tensorboard locally on the
machine running Batch Shipyard. This requires a valid SSH user that has been
provisioned via Batch Shipyard with private keys available. This command
will work on Windows if `ssh.exe` is available in `%PATH%` or the current
working directory. Please see the usage guide for more information about
this command.
2017-05-01 19:43:41 +03:00
- Pool-level `resource_files` support
### Changed
2017-05-01 00:22:13 +03:00
- Added optional `COMMAND` argument to `pool ssh` and `fs cluster ssh`
commands. If `COMMAND` is specified, the command is run non-interactively
with SSH on the target node.
- Added some additional sanity checks in the node prep script
- Updated TensorFlow-CPU and TensorFlow-GPU recipes to 1.1.0. Removed
specialized Docker build for TensorFlow-GPU. Added `jobs-tb.json` files
to TensorFlow-CPU and TensorFlow-GPU recipes as Tensorboard samples.
- Optimize some Batch calls
2017-04-22 05:22:03 +03:00
### Fixed
- Site extension issues
- SSH user add exception on Windows
- `jobs del --termtasks` will now disable the job prior to running task
termination to prevent active tasks in job from running while tasks are
being terminated
- `jobs listtasks` and `data listfiles` will now accept a `--jobid` that
does not have to be in `jobs.json`
- Data ingress on pool create issue with single node
2017-04-20 19:45:37 +03:00
## [2.6.0] - 2017-04-20
### Changed
- Update to latest dependencies
### Fixed
- Checks that prevented ssh/scp/openssl interaction on Windows
2017-04-20 19:45:37 +03:00
- SSH private key regression in data ingress direct to compute node
## [2.6.0rc1] - 2017-04-14
### Added
- Richer SSH options with new `ssh_public_key_data` and `ssh_private_key`
properties in `ssh` configuration blocks (for both `pool.json` and
`fs.json`).
- `ssh_public_key_data` allows direct embedding of SSH public keys in
OpenSSH format into the config files.
- `ssh_private_key` specifies where the private key is located with
respect to pre-created public keys (either `ssh_public_key` or
`ssh_public_key_data`). This allows transparent `pool ssh` or
`fs cluster ssh` commands with pre-created keys.
- RemoteFS-GlusterFS+BatchPool recipe
### Changed
- Docker installations are now pinned to a specific Docker version which
should reduce sudden breaking changes introduced upstream by Docker and/or
the distribution
- Fault domains for multi-vm storage clusters are now set to 2 by default but
can be configured using the `fault_domains` property. This was lowered from
the prior default of 3 due to managed disks and availability set restrictions
as some regions do not support 3 fault domains with this combination.
- Updated NC-series Tesla driver to 375.51
### Fixed
- Broken Docker installations due to gpgkey changes
- Possible race condition between disk setup and glusterfs volume create
- Forbid SSH username to be the same as the samba username
- Allow smbd.service to auto-restart with delay
- Data ingress to glusterfs on compute with no remotefs settings
2017-04-03 19:29:23 +03:00
### Removed
- Host support for OpenSUSE 13.2 and SLES 12
2017-04-03 19:29:23 +03:00
## [2.6.0b3] - 2017-04-03
### Added
2017-03-28 23:51:55 +03:00
- Created [Azure App Service Site Extension](https://www.siteextensions.net/packages/batch-shipyard).
You can now one-click install Batch Shipyard as a site extension (after you
2017-04-04 17:26:32 +03:00
have Python installed) and use Batch Shipyard from an Azure Function trigger.
- Samba support on storage cluster servers
- Add sample RemoteFS recipes for NFS and GlusterFS
2017-04-01 07:57:41 +03:00
- `install.cmd` installer for Windows. `install_conda_windows.cmd` has been
replaced by `install.cmd`, please see the install doc for more information.
### Changed
- **Breaking Change:** `multi_instance_auto_complete` under
`job_specifications` is now named `auto_complete`. This property will apply
to all types of jobs and not just multi-instance tasks. The default is now
`false` (instead of `true` for the old `multi_instance_auto_complete`).
- **Breaking Change:** `static_public_ip` has been replaced with a `public_ip`
2017-04-04 17:26:32 +03:00
complex property. This is to accommodate for situations where public IP for
RemoteFS is disabled. Please see the Remote FS configuration doc for more
info.
2017-04-01 07:57:41 +03:00
- `install.sh` now handles Anaconda Python environments
- `--cardinal 0` is now implicit if no `--hostname` or `--nodeid` is specified
for `fs cluster ssh` or `pool ssh` commands, respectively
2017-04-04 00:16:21 +03:00
- Allow `docker_images` in `global_resources` to be empty. Note that it is
always recommended to pre-load images on to pools for consistent scheduling
latencies from pool idle.
### Fixed
2017-03-24 03:37:53 +03:00
- Removed requirement of a `batch` credential section for pure `fs` operations
2017-03-28 23:51:55 +03:00
- Multi-instance auto complete setting not being properly read
- `install.sh` virtual environment issues
- Fix pool ingress data calls with remotefs (#62)
- Move additional node prep commands to last set of commands to execute in
start task (#63)
2017-03-31 18:07:23 +03:00
- `glusterfs_on_compute` shared data volume issues
- future and pathlib compat issues
- Python2 unicode/str issues with management libraries
2017-03-22 19:53:30 +03:00
## [2.6.0b2] - 2017-03-22
### Added
- Added virtual environment install option for `install.sh` which is now
the recommended way to install Batch Shipyard. Please see the install
guide for more information. (#55)
2017-03-22 19:53:30 +03:00
### Changed
- Force SSD optimizations for btrfs with premium storage
### Fixed
2017-03-22 19:53:30 +03:00
- Incorrect FS server options parsing at script time
- KeyVault client not initialized in `fs` contexts (#57)
2017-03-22 19:53:30 +03:00
- Check pool current node count prior to executing `pool udi` task (#58)
- Initialization with KeyVault uri on commandline (#59)
2017-03-17 03:25:40 +03:00
## [2.6.0b1] - 2017-03-16
### Added
- Support for provisioning storage clusters via the `fs cluster` command
- Support for NFS (single VM, scale up)
- Support for GlusterFS (multi VM, scale up and out)
- Support for provisioning managed disks via the `fs disks` command
- Support for data ingress to provisioned storage clusters
2017-03-22 19:53:30 +03:00
- Support for
[UserSubscription Batch accounts](https://docs.microsoft.com/azure/batch/batch-account-create-portal#user-subscription-mode)
- Azure Active Directory authentication support for Batch accounts
- Support for specifying a virtual network to use with a compute pool
2017-04-04 17:26:32 +03:00
- `allow_run_on_missing_image` option to jobs that allows tasks to execute
under jobs with Docker images that have not been pre-loaded via the
`global_resources`:`docker_images` setting in config.json. Note that, if
possible, you should attempt to specify all Docker images that you intend
to run in the `global_resources`:`docker_images` property in the global
configuration to minimize scheduling to task execution latency.
2017-03-17 20:00:58 +03:00
- Support for running containers as a different user identity (uid/gid)
- Support for Canonical/UbuntuServer/16.04-LTS. 16.04-LTS should be used over
the old 16.04.0-LTS sku due to
[issue #31](https://github.com/Azure/batch-shipyard/issues/31) and is no
longer receiving updates.
### Changed
- **Breaking Change:** `glusterfs` `volume_driver` for `shared_data_volumes`
should now be named as `glusterfs_on_compute`. This is to distinguish between
co-located GlusterFS on compute nodes with standalone GlusterFS
`storage_cluster` remote mounted distributed file system.
- Logging now has less verbose details (call origin) by default. Prior
behavior can be restored with the `-v` option.
- Pool existance is now checked prior to job submission and can now proceed
to add without an active pool.
- Batch `account` (name) is now an optional property in the credentials config
- Configuration doc broken up into multiple pages
- Update all recipes using Canonical/UbuntuServer/16.04.0-LTS to use
Canonical/UbuntuServer/16.04-LTS instead
2017-03-17 03:25:40 +03:00
- Configuration is no longer shown with `-v`. Use `--show-config` to dump
the complete configuration being used for the command.
- Precompile Python files during build for Docker images
- All dependencies updated to latest versions
- Update Batch API call compatibility for `azure-batch 2.0.0`
2017-02-23 19:06:27 +03:00
### Fixed
- Logging time format and incorrect Zulu time designation.
- `scp` and `multinode_scp` data movement capability is now supported in
Windows given `ssh.exe` and `scp.exe` can be found in `%PATH%` or the current
working directory. `rsync` methods are not supported on Windows.
- Credential encryption is now supported in Windows given `openssl.exe` can
be found in `%PATH%` or the current working directory.
## [2.5.4] - 2017-03-08
### Changed
- Downloaded files are now verified via SHA256 instead of MD5
- Updated NC-series Tesla driver to 375.39
### Fixed
- `nvidia-docker` updated to 1.0.1 for compatibility with Docker CE
2017-03-01 18:39:00 +03:00
## [2.5.3] - 2017-03-01
### Added
- `pool rebootnode` command added which allows single node reboot control.
Additionally, the option `--all-start-task-failed` will reboot all nodes in
the specified pool with the start task failed state.
- `jobs del` and `jobs term` now provide a `--termtasks` option to
allow the logic of `jobs termtasks` to precede the delete or terminate
action to the job. This option requires a valid SSH user to the remote nodes
as specified in the `ssh` configuration property in `pool.json`. This new
option is normally not needed if all tasks within the jobs have completed.
### Changed
- The Docker image used for blobxfer is now tied to the specific Batch
Shipyard release
- Default SSH user expiry time if not specified is now 30 days
- All recipes now have the default config.json storage account set to the
link as named in the provided credentials.json file. Now, only the credentials
file needs to be modified to run a recipe.
2017-02-23 19:06:27 +03:00
## [2.5.2] - 2017-02-23
2017-02-23 06:23:51 +03:00
### Added
- Chainer-CPU and Chainer-GPU recipes
- [Troubleshooting guide](docs/96-troubleshooting-guide.md)
2017-02-23 06:23:51 +03:00
2017-02-18 00:14:00 +03:00
### Changed
- Perform automatic container path substitution with host path for
GlusterFS data ingress/egress from/to Azure Storage (#37)
2017-02-18 00:14:00 +03:00
- Allow NAMD-TCP recipe to be run on a single node
### Fixed
2017-02-23 19:06:27 +03:00
- CNTK-GPU-OpenMPI run script fixed to allow multinode+singlegpu executions
- TensorFlow recipes updated for 1.0.0 release
2017-02-23 06:23:51 +03:00
- blobxfer data ingress on Windows (#39)
- Minor delete job and terminate tasks fixes
2017-02-01 22:06:26 +03:00
## [2.5.1] - 2017-02-01
### Added
- Support for max task retries (#23). See configuration doc for more
information.
- Support for task data retention time (#30). See configuration doc for
more information.
### Changed
- **Breaking Change:** `environment_variables_secret_id` was erroneously
named and has been renamed to `environment_variables_keyvault_secret_id` to
follow the other properties with similar behavior.
- Include Python 3.6 Travis CI target
### Fixed
- Automatically assigned task ids are now in the format `dockertask-NNNNN`
and will increment properly past 99999 but will not be padded after that (#27)
2017-02-01 22:06:26 +03:00
- Defect in list tasks for tasks that have not run (#28)
- Docker temporary directory not being set properly
2017-01-26 21:44:43 +03:00
- SLES-HPC will now install all Intel MPI related rpms
- Defect in task file mover for unencrypted credentials (#29)
## [2.5.0] - 2017-01-19
2017-01-12 20:23:25 +03:00
### Added
- Support for
[Task Dependency Id Ranges](https://docs.microsoft.com/azure/batch/batch-task-dependencies#task-id-range)
2017-01-12 20:23:25 +03:00
with the `depends_on_range` property under each task json property in `tasks`
in the jobs configuration file. Please see the configuration doc for more
information.
- Support for `environment_variables_secret_id` in job and task definitions.
Specifying these properties will fetch manually added secrets (in the form of
a string representation of a json key-value dictionary) from the specified
KeyVault using AAD credentials. Please see the configuration doc for more
information.
### Fixed
- Remove extraneous import (#12)
- Defect in handling per key secret ids (#13)
- Defect in environment variable dict merge (#17)
- Update Nvidia Docker to 1.0.0 (#21)
## [2.4.0] - 2017-01-11
### Added
- Support for credentials stored in Azure KeyVault
- `keyvault` command added. Please see the usage doc for more information.
- `*_keyvault_secret_id` properties added for keys and passwords in
credentials json. Please see the configuration doc for more information.
- Using Azure KeyVault with Batch Shipyard guide
### Changed
- Updated NC-series Tesla driver to 375.20
## [2.3.1] - 2017-01-03
### Added
- Add support for nvidia-docker with ssh docker tunnel
### Fixed
- Fix multi-job bug with jpcmd
## [2.3.0] - 2016-12-15
### Added
- `pool ssh` command. Please see the usage doc for more information.
- `shm_size` json property added to the json object within the `tasks` array
of a job. Please see the configuration doc for more information.
- SSH, Interactive Sessions and Docker SSH Tunnel guide
### Changed
- Improve usability of the generated SSH docker tunnel script
## [2.2.0] - 2016-12-09
### Added
- CNTK-CPU-Infiniband-IntelMPI recipe
### Changed
- `/opt/intel` is now automatically mounted once again for infiniband-enabled
containers on SUSE SLES-HPC hosts.
### Fixed
- Fix masked KeyErrors on `input_data` and `output_data`
- Fix SAS key generation for data movement
- Typo in ssh public key check on Windows prevented pool add actions
- Pin version of tfm docker image on data transfers
## [2.1.0] - 2016-11-30
### Added
- Allow `--configdir`, `--credentials`, `--config`, `--jobs`, `--pool` config
options to be specified as environment variables. Please see the usage doc
for more information.
- Added subcommand `listskus` to the `pool` command to list available
VM configurations (publisher, offer, sku) for the Batch account
2016-11-23 20:06:37 +03:00
### Changed
- Nodeprep now references cascade and tfm docker images by version instead
of latest to prevent breaking changes affecting older versions. Docker builds
of cascade and tfm based on latest commits are now disabled.
### Fixed
- Cascade docker image run not propagating exit code
2016-11-23 20:06:37 +03:00
## [2.0.0] - 2016-11-23
### Added
- Support for any Internet accessible container registry, including
[Azure Container Registry](https://azure.microsoft.com/services/container-registry/).
2016-11-19 21:39:58 +03:00
Please see the configuration doc for information on how to integrate with
a private container registry.
### Changed
- GPU driver for `STANDARD_NC` instances defined in the
`gpu`:`nvidia_driver`:`source` property is no longer required. If omitted,
an NVIDIA driver will be downloaded automatically with an NVIDIA License
agreement prompt. For `STANDARD_NV` instances, a driver URL is still required.
- Docker container name auto-tagging now prepends the job id in order to
prevent conflicts in case of un-named simultaneous tasks from multiple jobs
- Update CNTK docker images to 2.0beta4 and optimize GPU images for use
with NVIDIA K80/M60
2016-11-21 20:03:35 +03:00
- Update Caffe docker image, default to using OpenBLAS over ATLAS, and
optimize GPU images for use with NVIDIA K80/M60
2016-11-23 01:27:33 +03:00
- Update MXNet GPU docker image optimized for use with NVIDIA K80/M60
2016-11-21 20:03:35 +03:00
- Update TensorFlow docker images to 0.11.0 and optimize GPU images for use
with NVIDIA K80/M60
2016-11-19 21:39:58 +03:00
### Fixed
- Cascade thread exceptions will terminate with non-zero exit code
- Some improvements with node prep and reboots
- Task termination will only issue `docker rm` if the container exists
## [2.0.0rc3] - 2016-11-14 (SC16 Edition)
2016-11-04 17:25:55 +03:00
### Added
- `install_conda_windows.cmd` helper script for installing Batch Shipyard
under Anaconda for Windows
- Added `relative_destination_path` json property for `files` ingress into
node destinations. This allows arbitrary specification of where ingressed
files should be placed relative to the destination path.
- Added ability to ingress directly into the host without the requirement
of GlusterFS for pools with one compute node. A GlusterFS shared volume is
required for pools with more than one compute node for direct to pool data
ingress.
- New commands and options:
- `pool udi`: Update docker images on all compute nodes in a pool. `--image`
and `--digest` options can restrict the scope of the update.
2016-11-04 17:25:55 +03:00
- `data stream`: `--disk` will stream the file as binary to disk instead
of as text to the local console
- `data listfiles`: `--jobid` and `--taskid` allows scoping of the list
files action
- `jobs listtasks`: `--jobid` allows scoping of list tasks to a specific job
- `jobs add`: `--tail` allows tailing the specified file for the last job
and task added
2016-11-13 12:43:01 +03:00
- Keras+Theano-CPU and Keras+Theano-GPU recipes
- Keras+Theano-CPU added as an option in the quickstart guide
### Changed
- **Breaking Change:** Properties of `docker_registry` have changed
significantly to support eventual integration with the Azure Container
Registry service. Credentials for docker logins have moved to the credentials
json file. Please see the configuration doc for more information.
- `files` data ingress no longer creates a directory where files to
be uploaded exist. For example if uploading from a path `/a/b/c`, the
directory `c` is no longer created at the destination. Instead all files
found in `/a/b/c` will be immediately placed directly at the destination
path with sub-directories preserved. This behavior can be modified with
the `relative_destination_path` property.
2016-11-13 22:35:09 +03:00
- `CUDA_CACHE_*` variables are now set for GPU jobs such that compiled targets
pass-through to the host. This allows subsequent container invocations within
the same node the ability to reuse cached PTX JIT targets.
- `batch_shipyard`:`storage_entity_prefix` is now optional and defaults to
`shipyard` if not specified.
- Major internal configuration/settings refactor
### Fixed
2016-11-08 05:34:35 +03:00
- Pool resize down with wait
- More Python2/3 compatibility issues
2016-11-08 05:34:35 +03:00
- Ensure pools that deploy GlusterFS volumes have more than 1 node
## [2.0.0rc2] - 2016-11-02
2016-10-30 00:37:28 +03:00
### Added
- `install.sh` install/setup helper script
- `shipyard` execution helper script created via `install.sh`
- `generated_sas_expiry_days` json property to config json for the ability to
override the default number of days generated SAS keys are valid for.
- New options on commands/subcommands:
- `jobs add`: `--recreate` recreate any jobs which have completed and use
the same id
- `jobs termtasks`: `--force` force docker kill to tasks even if they are
in completed state
- `pool resize`: `--wait` wait for completion of resize
- HPCG-Infiniband-IntelMPI and HPLinpack-Infiniband-IntelMPI recipes
2016-10-30 00:37:28 +03:00
### Changed
- Default SAS expiry time used for resource files and data movement changed
from 7 to 30 days.
- Pools failing to start will now automatically retrieve stdout.txt and
stderr.txt to the current working directory under
`poolid/<node ids>/std{out,err}.txt`. These files can be inspected
locally and submitted as context for GitHub issues if pertinent.
- Pool resizing will now attempt to add an SSH user on the new nodes if
an SSH public key is referenced or found in the invocation directory
2016-10-30 00:37:28 +03:00
- Improve installation doc
### Fixed
- Improve Python2/3 compatibility
- Unicode literals warning with Click
- Config file loading issue in some contexts
- Documentation typos
2016-10-28 20:16:57 +03:00
## [2.0.0rc1] - 2016-10-28
### Added
2016-10-20 20:48:25 +03:00
- Comprehensive data movement support. Please see the data movement guide
and configuration doc for more information.
- Ingress from local machine with `files` in global configuration
- To GlusterFS shared volume
- To Azure Blob Storage
2016-10-15 23:56:56 +03:00
- To Azure File Storage
- Ingress from Azure Blob Storage, Azure File Storage, or another Azure
Batch Task with `input_data` in pool and jobs configuration
- Pool-level: to compute nodes
- Job-level: to compute nodes prior to running the specified job
- Task-level: to compute nodes prior to running a task of a job
- Egress to local machine as actions
- Single file from compute node
- Entire task-level directories from compute node
- Entire node-level directories from compute node
- Egress to Azure Blob of File Storage with `output_data` in jobs
configuration
2016-10-20 07:10:40 +03:00
- Task-level: to Azure Blob or File Storage on successful completion of a
task
2016-10-20 20:48:25 +03:00
- Credential encryption support. Please see the credential encryption guide
and configuration doc for more information.
2017-02-01 22:06:26 +03:00
- Experimental support for OpenSSH with HPN patches on Ubuntu
- Support pool resize up with GlusterFS
- Support GlusterFS volume options
- Configurable path to place files generated by `pool add` or `pool asu`
commands
- MXNet-CPU and Torch-CPU as options in the quickstart guide
- Update CNTK recipes for 1.7.2 and switch multinode/multigpu samples to
MNIST
- MXNet-CPU and MXNet-GPU recipes
### Changed
- **Breaking Change:** All new CLI experience with proper multilevel commands.
Please see usage doc for more information.
- Added new commands: `cert`, `data`
- Added many new convenience subcommands
2016-10-28 20:16:57 +03:00
- `--filespec` is now delimited by `,` instead of `:`
- **Breaking Change:** `ssh_docker_tunnel` in the `pool_specification` has
been replaced by the `ssh` property. `generate_tunnel_script` has been renamed
to `generate_docker_tunnel_script`. Please see the configuration doc for
more information.
- The `name` property of a task json object in the jobs specification is no
longer required for multi-instance tasks. If not specified, `name` defaults
to `id` for all task types.
- `data stream` no longer has an arbitrary max streaming time; the action will
stream the file indefinitely until the task completes
2016-10-20 07:10:40 +03:00
- Validate container with `storage_entity_prefix` for length issues
- `pool del` action now cleans up and deletes some storage containers
immediately afterwards (with confirmation prompts)
- `/opt/intel` is no longer automatically mounted for infiniband-enabled
containers on SUSE SLES-HPC hosts. Please see the configuration doc
on how to manually map this directory if required. OpenLogic CentOS-HPC
hosts remain unchanged.
- Modularized code base
### Fixed
- GlusterFS mount ownership/permissions fixed such that SSH users can
read/write
- Azure File shared volume setup when invoked from Windows
2016-10-20 07:10:40 +03:00
- Python2 compatibility issues with file encoding
- Allow shipyard.py to be invoked outside of the root of the GitHub cloned
base directory
- TensorFlow-Distributed recipe issues
2016-10-05 19:20:00 +03:00
## [1.1.0] - 2016-10-05
2016-09-26 21:17:50 +03:00
### Added
2016-10-04 06:02:29 +03:00
- Transparent Infiniband assist for SUSE SLES-HPC 12-SP1 image
2016-10-05 19:20:00 +03:00
- Add version for shipyard.py script
2016-10-04 06:02:29 +03:00
- NAMD-GPU, OpenFOAM-Infiniband-IntelMPI, Torch-CPU, Torch-GPU recipes
### Changed
- GlusterFS mountpoint is now within `$AZ_BATCH_NODE_SHARED_DIR` so files can
be viewed/downloaded with Batch APIs
- NAMD-Infiniband-IntelMPI recipe now contains a real Docker image link
2016-09-22 23:39:18 +03:00
### Fixed
- GlusterFS not properly starting on Ubuntu
2016-09-22 23:39:18 +03:00
## [1.0.0] - 2016-09-22
### Added
- Automated GlusterFS support
- Added `configdir` argument for convenience in loading configuration files,
please see the usage documentation for more details
- Ability to retrieve files from live compute nodes in addition to streaming
- Added `filespec` argument for non-interactive `streamfile` and `gettaskfile`
actions
- Added .gitattributes to designate Unix line-endings for text files
- Sample configuration files for each recipe
2016-09-22 23:39:18 +03:00
- Caffe-CPU, OpenFOAM-TCP-OpenMPI, TensorFlow-CPU, TensorFlow-Distributed
recipes
### Changed
- Updated configuration docs to detail which properties are required vs. those
that are optional
- SSH tunnel user is now added with a default expiry time of 7 days which can
be modified through the pool configuration file
- Configuration is not output to console by default, `-v` flag added for
verbose output
- Determinstic remote login settings output (node, ip, port) that can be
easily parsed
- Update Azurefile Docker Volume Driver plugin to 0.5.1
### Fixed
- Cascade (container-only) start issue with no private registry
- Non-shipyard docker image node prep with new azure-storage package
2016-09-17 03:54:49 +03:00
- Inter-node communication not specified key error on addpool
- Cross-platform fixes:
- Temp file creation used for environment variables
- SSH tunnel creation disabled on Windows if public key is not supplied
- Batch Shipyard Docker container not getting cleaned up if peer-to-peer is
disabled
### Removed
- `gpu`:`nvidia_driver`:`version` property removed from pool configuration
and is no longer required as the version is now automatically detected
## [0.2.0] - 2016-09-08
### Added
- Transparent GPU support for Azure N-Series VMs
- New recipes added: Caffe-GPU, CNTK-CPU-OpenMPI, CNTK-GPU-OpenMPI,
FFmpeg-GPU, NAMD-Infiniband-IntelMPI, NAMD-TCP, TensorFlow-GPU
### Changed
- Multi-instance tasks now automatically complete their job by default. This
removes the need to run the `cleanmijobs` action in the shipyard tool.
Please refer to the
[multi-instance documentation](docs/80-batch-shipyard-multi-instance-tasks.md)
for more information and limitations.
- Dumb back-off policy for DHT router convergence
- Optimzed Docker image storage location for Azure VMs
- Prompts added for destructive operations in the shipyard tool
### Fixed
- Incorrect file location of node prep finished
- Blocking wait for global resource on pool can now be disabled
- Incorrect process call to query for docker image size when peer-to-peer
transfer is disabled
- Use azure-storage 0.33.0 to fix Edm.Int64 overflow issue
2016-09-01 07:43:03 +03:00
## [0.1.0] - 2016-09-01
#### Added
- Initial release
2019-12-14 00:32:26 +03:00
[Unreleased]: https://github.com/Azure/batch-shipyard/compare/3.9.1...HEAD
[3.9.1]: https://github.com/Azure/batch-shipyard/compare/3.9.0...3.9.1
2019-11-15 22:08:52 +03:00
[3.9.0]: https://github.com/Azure/batch-shipyard/compare/3.8.2...3.9.0
2019-09-12 19:29:08 +03:00
[3.8.2]: https://github.com/Azure/batch-shipyard/compare/3.8.1...3.8.2
2019-08-19 19:24:33 +03:00
[3.8.1]: https://github.com/Azure/batch-shipyard/compare/3.8.0...3.8.1
2019-08-14 05:19:48 +03:00
[3.8.0]: https://github.com/Azure/batch-shipyard/compare/3.7.1...3.8.0
2019-07-24 05:49:38 +03:00
[3.7.1]: https://github.com/Azure/batch-shipyard/compare/3.7.0...3.7.1
2019-03-01 00:10:14 +03:00
[3.7.0]: https://github.com/Azure/batch-shipyard/compare/3.6.1...3.7.0
2018-12-03 20:03:24 +03:00
[3.6.1]: https://github.com/Azure/batch-shipyard/compare/3.6.0...3.6.1
2018-11-07 01:21:50 +03:00
[3.6.0]: https://github.com/Azure/batch-shipyard/compare/3.6.0b1...3.6.0
2018-09-20 22:45:01 +03:00
[3.6.0b1]: https://github.com/Azure/batch-shipyard/compare/3.6.0a1...3.6.0b1
2018-08-06 20:35:31 +03:00
[3.6.0a1]: https://github.com/Azure/batch-shipyard/compare/3.5.3...3.6.0a1
2018-07-31 20:28:30 +03:00
[3.5.3]: https://github.com/Azure/batch-shipyard/compare/3.5.2...3.5.3
2018-07-20 19:13:44 +03:00
[3.5.2]: https://github.com/Azure/batch-shipyard/compare/3.5.1...3.5.2
2018-07-17 23:16:23 +03:00
[3.5.1]: https://github.com/Azure/batch-shipyard/compare/3.5.0...3.5.1
2018-06-29 23:24:27 +03:00
[3.5.0]: https://github.com/Azure/batch-shipyard/compare/3.5.0b3...3.5.0
2018-06-13 23:25:15 +03:00
[3.5.0b3]: https://github.com/Azure/batch-shipyard/compare/3.5.0b2...3.5.0b3
[3.5.0b2]: https://github.com/Azure/batch-shipyard/compare/3.5.0b1...3.5.0b2
2018-05-02 17:48:54 +03:00
[3.5.0b1]: https://github.com/Azure/batch-shipyard/compare/3.4.0...3.5.0b1
2018-03-26 23:15:44 +03:00
[3.4.0]: https://github.com/Azure/batch-shipyard/compare/3.3.0...3.4.0
2018-03-01 19:25:49 +03:00
[3.3.0]: https://github.com/Azure/batch-shipyard/compare/3.2.0...3.3.0
2018-02-21 19:26:13 +03:00
[3.2.0]: https://github.com/Azure/batch-shipyard/compare/3.1.0...3.2.0
2018-01-30 22:21:22 +03:00
[3.1.0]: https://github.com/Azure/batch-shipyard/compare/3.0.3...3.1.0
2018-01-22 19:39:12 +03:00
[3.0.3]: https://github.com/Azure/batch-shipyard/compare/3.0.2...3.0.3
2018-01-12 20:06:57 +03:00
[3.0.2]: https://github.com/Azure/batch-shipyard/compare/3.0.1...3.0.2
2017-11-22 19:30:13 +03:00
[3.0.1]: https://github.com/Azure/batch-shipyard/compare/3.0.0...3.0.1
2017-11-13 09:45:19 +03:00
[3.0.0]: https://github.com/Azure/batch-shipyard/compare/3.0.0rc1...3.0.0
2017-11-09 00:33:51 +03:00
[3.0.0rc1]: https://github.com/Azure/batch-shipyard/compare/3.0.0b1...3.0.0rc1
2017-11-06 02:58:29 +03:00
[3.0.0b1]: https://github.com/Azure/batch-shipyard/compare/3.0.0a2...3.0.0b1
2017-10-27 21:35:48 +03:00
[3.0.0a2]: https://github.com/Azure/batch-shipyard/compare/3.0.0a1...3.0.0a2
2017-10-03 19:12:24 +03:00
[3.0.0a1]: https://github.com/Azure/batch-shipyard/compare/2.9.6...3.0.0a1
[2.9.6]: https://github.com/Azure/batch-shipyard/compare/2.9.5...2.9.6
[2.9.5]: https://github.com/Azure/batch-shipyard/compare/2.9.4...2.9.5
2017-09-12 18:56:03 +03:00
[2.9.4]: https://github.com/Azure/batch-shipyard/compare/2.9.3...2.9.4
2017-08-29 18:05:34 +03:00
[2.9.3]: https://github.com/Azure/batch-shipyard/compare/2.9.2...2.9.3
[2.9.2]: https://github.com/Azure/batch-shipyard/compare/2.9.0rc1...2.9.2
2017-08-09 18:57:33 +03:00
[2.9.0rc1]: https://github.com/Azure/batch-shipyard/compare/2.9.0b2...2.9.0rc1
2017-08-05 00:51:27 +03:00
[2.9.0b2]: https://github.com/Azure/batch-shipyard/compare/2.9.0b1...2.9.0b2
2017-08-01 01:05:58 +03:00
[2.9.0b1]: https://github.com/Azure/batch-shipyard/compare/2.8.0...2.9.0b1
2017-07-06 21:12:24 +03:00
[2.8.0]: https://github.com/Azure/batch-shipyard/compare/2.8.0rc2...2.8.0
[2.8.0rc2]: https://github.com/Azure/batch-shipyard/compare/2.8.0rc1...2.8.0rc2
2017-06-27 21:38:36 +03:00
[2.8.0rc1]: https://github.com/Azure/batch-shipyard/compare/2.8.0b1...2.8.0rc1
2017-06-07 18:30:04 +03:00
[2.8.0b1]: https://github.com/Azure/batch-shipyard/compare/2.7.0...2.8.0b1
2017-05-31 17:43:14 +03:00
[2.7.0]: https://github.com/Azure/batch-shipyard/compare/2.7.0rc1...2.7.0
2017-05-24 17:43:19 +03:00
[2.7.0rc1]: https://github.com/Azure/batch-shipyard/compare/2.7.0b2...2.7.0rc1
2017-05-18 19:10:16 +03:00
[2.7.0b2]: https://github.com/Azure/batch-shipyard/compare/2.7.0b1...2.7.0b2
[2.7.0b1]: https://github.com/Azure/batch-shipyard/compare/2.6.2...2.7.0b1
2017-05-05 18:42:53 +03:00
[2.6.2]: https://github.com/Azure/batch-shipyard/compare/2.6.1...2.6.2
[2.6.1]: https://github.com/Azure/batch-shipyard/compare/2.6.0...2.6.1
2017-04-20 19:45:37 +03:00
[2.6.0]: https://github.com/Azure/batch-shipyard/compare/2.6.0rc1...2.6.0
[2.6.0rc1]: https://github.com/Azure/batch-shipyard/compare/2.6.0b3...2.6.0rc1
2017-04-03 19:29:23 +03:00
[2.6.0b3]: https://github.com/Azure/batch-shipyard/compare/2.6.0b2...2.6.0b3
2017-03-22 19:53:30 +03:00
[2.6.0b2]: https://github.com/Azure/batch-shipyard/compare/2.6.0b1...2.6.0b2
2017-03-17 03:25:40 +03:00
[2.6.0b1]: https://github.com/Azure/batch-shipyard/compare/2.5.4...2.6.0b1
2017-03-08 20:09:34 +03:00
[2.5.4]: https://github.com/Azure/batch-shipyard/compare/2.5.3...2.5.4
[2.5.3]: https://github.com/Azure/batch-shipyard/compare/2.5.2...2.5.3
2017-02-23 19:06:27 +03:00
[2.5.2]: https://github.com/Azure/batch-shipyard/compare/2.5.1...2.5.2
2017-02-01 22:06:26 +03:00
[2.5.1]: https://github.com/Azure/batch-shipyard/compare/2.5.0...2.5.1
[2.5.0]: https://github.com/Azure/batch-shipyard/compare/2.4.0...2.5.0
[2.4.0]: https://github.com/Azure/batch-shipyard/compare/2.3.1...2.4.0
2017-01-03 21:04:40 +03:00
[2.3.1]: https://github.com/Azure/batch-shipyard/compare/2.3.0...2.3.1
[2.3.0]: https://github.com/Azure/batch-shipyard/compare/2.2.0...2.3.0
[2.2.0]: https://github.com/Azure/batch-shipyard/compare/2.1.0...2.2.0
[2.1.0]: https://github.com/Azure/batch-shipyard/compare/2.0.0...2.1.0
2016-11-23 20:06:37 +03:00
[2.0.0]: https://github.com/Azure/batch-shipyard/compare/2.0.0rc3...2.0.0
[2.0.0rc3]: https://github.com/Azure/batch-shipyard/compare/2.0.0rc2...2.0.0rc3
[2.0.0rc2]: https://github.com/Azure/batch-shipyard/compare/2.0.0rc1...2.0.0rc2
2016-10-28 20:16:57 +03:00
[2.0.0rc1]: https://github.com/Azure/batch-shipyard/compare/1.1.0...2.0.0rc1
2016-10-05 19:20:00 +03:00
[1.1.0]: https://github.com/Azure/batch-shipyard/compare/1.0.0...1.1.0
2016-09-22 23:39:18 +03:00
[1.0.0]: https://github.com/Azure/batch-shipyard/compare/0.2.0...1.0.0
[0.2.0]: https://github.com/Azure/batch-shipyard/compare/0.1.0...0.2.0
2016-09-01 19:40:24 +03:00
[0.1.0]: https://github.com/Azure/batch-shipyard/compare/ab1fa4d...0.1.0