Граф коммитов

537 Коммитов

Автор SHA1 Сообщение Дата
Fred Park cbf137422e
Fix task termination for infinite retry tasks
- Resolves #308
2019-09-03 15:12:10 +00:00
Fred Park 03046aa692
Fix possible null from node error value collection
- Resolves #309
2019-08-30 21:13:22 +00:00
Fred Park 0d5850c8c9
Fix task termination in non-native mode
- SSH side-channel docker kill signal was not being sent as Docker tasks
were not being detected properly
- Also fix issue with pool images update not executing if block on
images is false
- Resolves #308
2019-08-30 20:59:51 +00:00
Fred Park 0e773c5158
Update docs regarding AAD and subscription id
- Provide better error message in this case of missing subscription id
- Resolves #305
2019-08-30 15:53:19 +00:00
Fred Park 3a91511e50
Fix possible null node agent info on list nodes
- Resolves #307
2019-08-30 15:30:49 +00:00
Fred Park 88c3cdf8be
Fix prefix filter on task factory remote_path
- Resolves #303
2019-08-29 16:53:47 +00:00
Fred Park d77d8a8cce
Fix download cascade outputs on start task failure 2019-08-23 15:42:35 +00:00
Fred Park 0fc3522c50
Tag for 3.8.1 release 2019-08-19 16:24:33 +00:00
Fred Park 8b7b17f465
Fix Task Runner regressions
- Input/output data phases not correctly triggered for multi-instance
and MPI jobs
- Output data was not triggered at all
- Pre-exec triggering on native
- Resolves #301
2019-08-16 17:07:44 +00:00
Fred Park ff49d187a4
Tag for 3.8.0 release 2019-08-14 03:23:09 +00:00
Fred Park 826c46afe2
Bring your own Public IP support 2019-08-14 03:23:09 +00:00
Fred Park e9130f83f4
MCR migration
- Migrate images to Microsoft Container Registry
- Fix Shellcheck issues
- Related to #278
2019-08-14 03:23:03 +00:00
Fred Park 290209381e
Update Dependencies
- Update NVIDIA compute driver to 418.67
- Update NVIDIA grid driver to 430.30
- Update Batch Insights to 1.3.0
- Update blobxfer to 1.9.0
- Update Python dependencies
- Drop Python 3.4 support
2019-08-12 20:42:32 +00:00
Fred Park 3052e98c8b
Add MVAPICH support
- More changes for #287
- Automatically source environment modules if it exists
- Fix some typos
2019-08-12 01:58:39 +00:00
Fred Park be52a9c3b0
Various updates
- Fail VM provisioning if expected IB card is not present
- Update platform image native support
2019-08-12 01:58:28 +00:00
Fred Park b6044b3489
Update GPU support
- Update to Docker CE 19.03.1
- Use "native" Docker/containerd GPU support
- Breaking change in jobs configuration to allow arbitrary configuration
- Update docs
- Resolves #293
2019-08-08 20:36:41 +00:00
Fred Park e6709409a2
Update to Singularity 3.3.0
- Check for expected ephemeral mount point
2019-08-07 21:13:30 +00:00
Fred Park caec6b566f
Allow Premium File Shares via AAD
- Documentation clarification around the main storage account
- Resolves #294
2019-08-05 18:29:27 +00:00
Fred Park 7ae3cb9e50
Merge branch 'master' into singularity3 2019-08-05 18:28:19 +00:00
Vincent Labonté b64c3cb324 Add Infiniband support with Open MPI and MPICH (#297)
* Add Infinibnad support with Open MPI

* Add mpiBench-Infiniband-OpenMPI recipe

* Add setup script for OpenFOAM-Infiniband-OpenMPI recipe

* Update setup script for OpenFOAM-Infiniband-OpenMPI recipe

* Add OpenFOAM-Infiniband-OpenMPI recipe

* Add documentation for recipes

* Add Infiniband support with MPICH

* Add mpiBench-Infiniband-MPICH recipe
2019-08-05 10:39:08 -04:00
Fred Park d3fccd613d
Tag for 3.7.1 release 2019-07-24 02:55:52 +00:00
Fred Park 3c376224a3
Fix GPU node provisioning
- Start task failures due to docker-ce-cli info changing output
- Pin docker-ce-cli
- Make docker root dir parsing more robust
- Fix LIS and CentOS 7.6 GPU provisioning
- Resolves #291
2019-07-24 02:55:35 +00:00
Fred Park 4d69c96d79
Merge branch 'sriov-merge' into singularity3 2019-07-23 21:02:52 +00:00
Vincent Labonté cc42916cba Fixes and update of recipes (#290)
* Fix multi-instance tasks that are not a MPI task

* Add setup task script for CNTK-CPU-Infiniband-IntelMPI

* Update CNTK-CPU-Infiniband-IntelMPI recipe

* Add MPI executable path option

* Update CNTK-CPU-OpenMPI recipe

* Change the default MPI executable_path to mpirun

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-Infiniband-IntelMPI

* Update CNTK-GPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-OpenMPI

* Add setup task script for NAMD-Infiniband-IntelMPI

* Update NAMD-Infiniband-IntelMPI recipe

* Add setup task script for OpenFOAM-Infiniband-IntelMPI

* Update OpenFOAM-Infiniband-IntelMPI recipe

* Update TensorFlow-GPU Singularity recipe

* Add setup task script for OpenFOAM-TCP-OpenMPI

* Update OpenFOAM-TCP-OpenMPI recipe

* Add support for arbitrary commands with the MPI processes_per_node option

* Fix MPI with native images

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Modify CNTK-GPU-Infiniband-IntelMPI recipe

* Modify NAMD-Infiniband-IntelMPI recipe

* Update processes_per_node documentation

* Fix `pool images list` with Singularity images

* Modify OpenFOAM-Infiniband-IntelMPI set up script

* Add check for mpi setting with Windows

* Add auto scratch support with OpenFOAM-Infiniband-IntelMPI recipe

* Modify OpenFOAM-TCP-OpenMPI set up script

* Add auto scratch support with OpenFOAM-TCP-OpenMPI recipe

* Add mpiBench-IntelMPI recipe

* Add mpiBench-MPICH recipe

* Add mpiBench-OpenMPI recipe

* Resolve PR comments

* Resolve PR comments
2019-07-17 18:57:06 -07:00
Fred Park ce0caaa24d
Add promo VM size (NC/NV/H) support 2019-07-16 16:07:03 +00:00
Fred Park 25fec92273
Support Hc/Hb
- Support RDMA bifurcation
- Update platform docs for CentOS-HPC 7.6
2019-07-15 03:32:04 +00:00
Fred Park 559463cd12
Merge branch 'develop' into sriov-merge 2019-07-09 21:45:31 +00:00
Vincent Labonté 442a22bd28 Improve MPI Interface for Singularity and Docker (#289)
* Add MPI config support for MPICH

* Add MPI config support for Docker containers

* Resolve PR comments

* Make use of the script runner with MPI and Docker

* Minor fixes

* Resolve PR comments
2019-07-09 13:46:12 -07:00
Vincent Labonté e6e60048a7 Improve MPI Interface for Intel MPI and Open MPI with Singularity images (#288)
* Add MPI config support for IntelMPI

* Separate prologue command into user and system

* Add MpiSettings

* Add MPI config support for Open MPI

* Fix MPI config support for IntelMPI

* Workaround for Open MPI btl tcp

* Correct documentation

* Fix non mpi multi instance execution

* Resolve PR comments

* Resolve PR comments

* Partially address #287
2019-07-03 12:40:54 -07:00
Fred Park 4b9a004f1a
Update to Batch 7.0.0 SDK
- Breaking change: pool listskus -> account images
- Support setting working directory for native mode
- Resolves #286
2019-06-27 20:08:49 +00:00
Fred Park b93f60213d
Support conditional output data
- Resolves #230
2019-06-24 18:03:43 +00:00
Fred Park 7b138e785a
Support user-specified job prep/release tasks
- Host mode only
- Resolves #202
2019-06-24 16:02:30 +00:00
Vincent Labonté 9f58ad0042 Fixes for Singularity 3 support (#285)
* Fix credentials when running task with Singularity docker:// images

* Fix Singularity cache directory's ownership

* Fix images update command

* Fix running cascade with use_shipyard_docker_image

* Remove envfile dump in task runner
2019-06-21 10:36:28 -07:00
Fred Park d6e939d58e
Use singularity env var over param for tasks
- Doc fixups
2019-06-19 20:11:09 +00:00
Fred Park 824f6de415
Merge branch 'develop' into singularity3
- Move username/password run options to settings for singularity
2019-06-18 20:49:38 +00:00
Fred Park bc4be6dbc3
Proxy non-native task execution via script
- Resolves #235
2019-06-18 20:11:12 +00:00
Vincent Labonté 6a0f90d509 Singularity list images and run ORAS images (#284)
* Remove unused directories

* Augment pool images list to support Singularity images

* Fix specific image update with private ORAS registries

* Add support to run ORAS image from a private registry

* Only log in used registries

* Fix checks

* Update documentation

* Resolve PR comments

* Resolve PR comments
2019-06-17 08:29:25 -07:00
Vincent Labonté 8293a20be3 Support multiple Singularity registries (#283)
* Add support for multiple singularity registries (://docker and ://oras)

* Resolve PR comments

* Resolve PR comments
2019-06-13 10:22:18 -07:00
Vincent Labonté 5307f1779d Fix image update command (#281)
* Create one log file per container mode

* Make singularity 3 work

* Minor fixes

* Fix cascade with docker image and singularity image

* Add capability to pull from library://

* Add singularity signed images to config file

* Add singularity signed images to the global resource table

* Pull and verify signed singularity images

* Put the singularity sypgp directory in the mount directory

* Add ability to provide key file to verify a singularity image

* Resolve PR comments

* Fix Singularity registry credemtials

* Extract cascade logic from nodeprep

* Re-run cascade if the image update command has no specified image

* Fix prefix errors when using shipyard docker image

* Make sure that the cascade log files are not overridden

* Fix wrong parameter name

* Clarify error message when trying to update images on Windows

* Update documentation

* Fix checks

* Resolve PR comments
2019-06-10 10:35:28 -07:00
Vincent Labonté 305d376cdc Support Singularity signed image verification (#280)
* Create one log file per container mode

* Make singularity 3 work

* Minor fixes

* Fix cascade with docker image and singularity image

* Add capability to pull from library://

* Add singularity signed images to config file

* Add singularity signed images to the global resource table

* Pull and verify signed singularity images

* Put the singularity sypgp directory in the mount directory

* Add ability to provide key file to verify a singularity image

* Resolve PR comments

* Fix Singularity registry credemtials
2019-06-05 11:14:37 -07:00
Vincent Labonté f9d0ad9a7f Initial support for Singularity 3 and SIF (#279)
* Create one log file per container mode

* Make singularity 3 work

* Minor fixes

* Fix cascade with docker image and singularity image

* Add capability to pull from library://
2019-05-29 14:15:22 -07:00
Vincent Labonté a68579c095 Prepare for Singularity3 work (#276)
* Remove torrent functionality

* Remove torrent storage

* Fix singularity permissions

* Add container mode in cascade.py

* Fix errors

* Fix PR comments

* Fix flake8 errors
2019-05-22 14:46:44 -07:00
Fred Park c39a919bf7
Add more Slurm partition settings
- Partition preemption settings
- Partition other options
2019-04-02 15:59:36 -07:00
Fred Park 97dac7d5aa
Update blobxfer to 1.7.1
- Update some docs
2019-03-05 08:05:02 -08:00
Fred Park 2301e20cfc
Tag for 3.7.0 release 2019-02-28 14:03:35 -08:00
Fred Park 163f1d0cb6
Suspend and restart support for Slurm clusters 2019-02-28 13:48:20 -08:00
Fred Park ed5a21d416
Doc updates
- Add support for CentOS 7.6 native
2019-02-28 13:48:16 -08:00
Fred Park 6e8d2a119f
Component updates
- Update blobxfer to 1.7.0
- Update Batch Insights to 1.2.0
- Update LIS
- Update NV driver to 410.92
- Update NC/ND driver to 410.104
2019-02-28 12:11:19 -08:00
Fred Park 4fa60af37a
Fix accelerated networking provisioning
- Add pool exists command
- Add recreate option to pool add
2019-02-28 12:11:18 -08:00
Fred Park 314037f76f
Slurm on Batch feature
- Package and use Slurm 18.08 instead of default from distro repo
- Slurm "master" contains separate controller and login nodes
- Integrate RemoteFS shared file system into Slurm cluster
- Auto feature tagging on Slurm nodes
- Support CentOS 7, Ubuntu 16.04, Ubuntu 18.04 Batch pools as Slurm
  node targets
- Unify login and Batch pools on cluster user based on login user
- Auto provision passwordless SSH user on compute nodes with login user
  context
- Add slurm cluster commands, including orchestrate command
- Add separate SSH for controller, login, nodes
- Add Slurm configuration doc
- Add Slurm guide
- Add Slurm recipe
- Update usage doc
- Remove deprecated MSI VM extension from monitoring and federation
- Fix pool nodes count on non-existent pool
- Refactor SSH info to allow offsets
- Add fs cluster orchestrate command
2019-02-28 12:11:10 -08:00