Граф коммитов

220 Коммитов

Автор SHA1 Сообщение Дата
Fred Park 2d27034cc7
Add FI_PROVIDER to Intel MPI ofi fabrics 2019-10-25 17:55:33 +00:00
Fred Park 8d1fc8e46e
Warn/error when mixing auto_scratch with autoscale
- Resolves #319
2019-10-22 16:58:38 +00:00
Fred Park acb7c6f40c
Fix streaming race between in task termination 2019-10-17 16:32:07 +00:00
Fred Park 0cf8b48f3e
Fix pre-exec in non-native
- Command should not be wrapped in a shell
- Fix user add error message if ssh key is not present
2019-10-10 23:42:49 +00:00
Fred Park 89dff6b201
Prevent job submission on older pools
- Resolves #312
2019-09-11 17:44:56 +00:00
Fred Park 1706902959
Fix non-native data transfer sequence coupling
- Non-native input_data or output_data of azure_storage type with
sequences greater than 1 would have each individual action depend upon
the success of the prior action
- Resolves #310
2019-09-05 19:29:33 +00:00
Fred Park cbf137422e
Fix task termination for infinite retry tasks
- Resolves #308
2019-09-03 15:12:10 +00:00
Fred Park 03046aa692
Fix possible null from node error value collection
- Resolves #309
2019-08-30 21:13:22 +00:00
Fred Park 0d5850c8c9
Fix task termination in non-native mode
- SSH side-channel docker kill signal was not being sent as Docker tasks
were not being detected properly
- Also fix issue with pool images update not executing if block on
images is false
- Resolves #308
2019-08-30 20:59:51 +00:00
Fred Park 3a91511e50
Fix possible null node agent info on list nodes
- Resolves #307
2019-08-30 15:30:49 +00:00
Fred Park d77d8a8cce
Fix download cascade outputs on start task failure 2019-08-23 15:42:35 +00:00
Fred Park 8b7b17f465
Fix Task Runner regressions
- Input/output data phases not correctly triggered for multi-instance
and MPI jobs
- Output data was not triggered at all
- Pre-exec triggering on native
- Resolves #301
2019-08-16 17:07:44 +00:00
Fred Park e9130f83f4
MCR migration
- Migrate images to Microsoft Container Registry
- Fix Shellcheck issues
- Related to #278
2019-08-14 03:23:03 +00:00
Fred Park 3052e98c8b
Add MVAPICH support
- More changes for #287
- Automatically source environment modules if it exists
- Fix some typos
2019-08-12 01:58:39 +00:00
Fred Park be52a9c3b0
Various updates
- Fail VM provisioning if expected IB card is not present
- Update platform image native support
2019-08-12 01:58:28 +00:00
Vincent Labonté b64c3cb324 Add Infiniband support with Open MPI and MPICH (#297)
* Add Infinibnad support with Open MPI

* Add mpiBench-Infiniband-OpenMPI recipe

* Add setup script for OpenFOAM-Infiniband-OpenMPI recipe

* Update setup script for OpenFOAM-Infiniband-OpenMPI recipe

* Add OpenFOAM-Infiniband-OpenMPI recipe

* Add documentation for recipes

* Add Infiniband support with MPICH

* Add mpiBench-Infiniband-MPICH recipe
2019-08-05 10:39:08 -04:00
Fred Park 4d69c96d79
Merge branch 'sriov-merge' into singularity3 2019-07-23 21:02:52 +00:00
Vincent Labonté cc42916cba Fixes and update of recipes (#290)
* Fix multi-instance tasks that are not a MPI task

* Add setup task script for CNTK-CPU-Infiniband-IntelMPI

* Update CNTK-CPU-Infiniband-IntelMPI recipe

* Add MPI executable path option

* Update CNTK-CPU-OpenMPI recipe

* Change the default MPI executable_path to mpirun

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-Infiniband-IntelMPI

* Update CNTK-GPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-OpenMPI

* Add setup task script for NAMD-Infiniband-IntelMPI

* Update NAMD-Infiniband-IntelMPI recipe

* Add setup task script for OpenFOAM-Infiniband-IntelMPI

* Update OpenFOAM-Infiniband-IntelMPI recipe

* Update TensorFlow-GPU Singularity recipe

* Add setup task script for OpenFOAM-TCP-OpenMPI

* Update OpenFOAM-TCP-OpenMPI recipe

* Add support for arbitrary commands with the MPI processes_per_node option

* Fix MPI with native images

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Modify CNTK-GPU-Infiniband-IntelMPI recipe

* Modify NAMD-Infiniband-IntelMPI recipe

* Update processes_per_node documentation

* Fix `pool images list` with Singularity images

* Modify OpenFOAM-Infiniband-IntelMPI set up script

* Add check for mpi setting with Windows

* Add auto scratch support with OpenFOAM-Infiniband-IntelMPI recipe

* Modify OpenFOAM-TCP-OpenMPI set up script

* Add auto scratch support with OpenFOAM-TCP-OpenMPI recipe

* Add mpiBench-IntelMPI recipe

* Add mpiBench-MPICH recipe

* Add mpiBench-OpenMPI recipe

* Resolve PR comments

* Resolve PR comments
2019-07-17 18:57:06 -07:00
Fred Park 25fec92273
Support Hc/Hb
- Support RDMA bifurcation
- Update platform docs for CentOS-HPC 7.6
2019-07-15 03:32:04 +00:00
Fred Park 559463cd12
Merge branch 'develop' into sriov-merge 2019-07-09 21:45:31 +00:00
Vincent Labonté 442a22bd28 Improve MPI Interface for Singularity and Docker (#289)
* Add MPI config support for MPICH

* Add MPI config support for Docker containers

* Resolve PR comments

* Make use of the script runner with MPI and Docker

* Minor fixes

* Resolve PR comments
2019-07-09 13:46:12 -07:00
Vincent Labonté e6e60048a7 Improve MPI Interface for Intel MPI and Open MPI with Singularity images (#288)
* Add MPI config support for IntelMPI

* Separate prologue command into user and system

* Add MpiSettings

* Add MPI config support for Open MPI

* Fix MPI config support for IntelMPI

* Workaround for Open MPI btl tcp

* Correct documentation

* Fix non mpi multi instance execution

* Resolve PR comments

* Resolve PR comments

* Partially address #287
2019-07-03 12:40:54 -07:00
Fred Park 4b9a004f1a
Update to Batch 7.0.0 SDK
- Breaking change: pool listskus -> account images
- Support setting working directory for native mode
- Resolves #286
2019-06-27 20:08:49 +00:00
Fred Park 7b138e785a
Support user-specified job prep/release tasks
- Host mode only
- Resolves #202
2019-06-24 16:02:30 +00:00
Fred Park 824f6de415
Merge branch 'develop' into singularity3
- Move username/password run options to settings for singularity
2019-06-18 20:49:38 +00:00
Fred Park bc4be6dbc3
Proxy non-native task execution via script
- Resolves #235
2019-06-18 20:11:12 +00:00
Vincent Labonté 6a0f90d509 Singularity list images and run ORAS images (#284)
* Remove unused directories

* Augment pool images list to support Singularity images

* Fix specific image update with private ORAS registries

* Add support to run ORAS image from a private registry

* Only log in used registries

* Fix checks

* Update documentation

* Resolve PR comments

* Resolve PR comments
2019-06-17 08:29:25 -07:00
Vincent Labonté 305d376cdc Support Singularity signed image verification (#280)
* Create one log file per container mode

* Make singularity 3 work

* Minor fixes

* Fix cascade with docker image and singularity image

* Add capability to pull from library://

* Add singularity signed images to config file

* Add singularity signed images to the global resource table

* Pull and verify signed singularity images

* Put the singularity sypgp directory in the mount directory

* Add ability to provide key file to verify a singularity image

* Resolve PR comments

* Fix Singularity registry credemtials
2019-06-05 11:14:37 -07:00
Fred Park 4fa60af37a
Fix accelerated networking provisioning
- Add pool exists command
- Add recreate option to pool add
2019-02-28 12:11:18 -08:00
Fred Park 314037f76f
Slurm on Batch feature
- Package and use Slurm 18.08 instead of default from distro repo
- Slurm "master" contains separate controller and login nodes
- Integrate RemoteFS shared file system into Slurm cluster
- Auto feature tagging on Slurm nodes
- Support CentOS 7, Ubuntu 16.04, Ubuntu 18.04 Batch pools as Slurm
  node targets
- Unify login and Batch pools on cluster user based on login user
- Auto provision passwordless SSH user on compute nodes with login user
  context
- Add slurm cluster commands, including orchestrate command
- Add separate SSH for controller, login, nodes
- Add Slurm configuration doc
- Add Slurm guide
- Add Slurm recipe
- Update usage doc
- Remove deprecated MSI VM extension from monitoring and federation
- Fix pool nodes count on non-existent pool
- Refactor SSH info to allow offsets
- Add fs cluster orchestrate command
2019-02-28 12:11:10 -08:00
Fred Park a30cb674ca
Migrate to Azure Batch Python SDK 6.0.0
- Fix breaking changes
- Update dependencies
- Gate some debug messages behind the verbose flag
2019-01-16 13:03:30 -08:00
Fred Park 458c69e6a9
Improve task factory generation/submission speed
- Amortize copy cost over single deep copy and pop
- Add more feedback during generation/submission for large task
  factories and task sets
2019-01-16 13:03:29 -08:00
Fred Park f53eee7bbd
Block job submission on non-active pools
- Resolves #251
2019-01-10 13:47:08 -08:00
Fred Park 70532fa4ae
Add Genomics recipes
- BLAST and RNASeq pipelines
- Fix adding tasks to an existing job with existing merge tasks
- Add support for force_enable_task_dependencies at the job level
- Fix doc typos
2018-11-29 08:58:01 -08:00
Fred Park 387fd14d54
Fix --tail console output
- Fix occurrences where the stream would occassionally repeat characters
- Allow incremental unicode decoding of the stream
2018-11-29 08:58:01 -08:00
Fred Park 54c78b7c49
Auto scratch support 2018-11-05 11:24:22 -08:00
Fred Park 96c220df34
Update to Azure Batch 5.1.0 SDK
- Accommodate breaking changes
- Add compute node agent info
2018-09-18 13:56:26 -07:00
Fred Park 1a4ad686ef
Fix federation task id generator
- Fix list issue with empty addition timestamps or uids
- Expedite generating task ids for federation bound tasks with autogenerated
  task ids
2018-08-23 13:37:03 -07:00
Fred Park c1bbd5131d
Add count commands
- jobs tasks count and pool nodes count commands with --raw support
- Update usage doc
- Resolves #228
2018-08-09 13:07:06 -07:00
Fred Park 6e1409c16f
Fix jobs tasks term command without pool ssh info 2018-08-08 15:48:36 -07:00
Fred Park acdea94722
Tag for 3.6.0a1 release 2018-08-06 10:35:31 -07:00
Fred Park 52628d27cf
Federation support
- Federation proxy lifecycle management
- Federation lifecycle management
- Federation job submission and management
- Mount Azure File share for auto-rotated log persistence
- FIFO within job support
- Constraint matching
- Federations can be created in "unique job id" mode requiring all
  submitted jobs via fed jobs add be unique across the entire federation
- Supports nearly 15K actions per job (in non-unique job id mode)
- Task dependency rewrite engine for federated jobs
  - Verify dependencies only within task group
  - Uniquely identify task dependencies
- Allow tuning of scheduling behavior options
- Package federation logic on proxy into Docker container
- Full guide/walkthrough for federation feature
- Refactor common code between monitor/fed proxy into resource
- Other doc updates
2018-08-06 09:30:36 -07:00
Fred Park 3c76f9e1d9
Fix various environment variable issues
- Remove environment variable file create/upload and use task environment
  variable and dump instead for non-native pools
- Remove unnecessary windows env var dumps for non-native mode which is
  an impossible combination
- Clean-up job schedule job manager task env vars
- Simplify multi-instance env var handling
- Resolves #234
2018-07-28 16:15:28 -07:00
Fred Park d659c34272
Fix metadata dump with correct check logic 2018-07-02 20:05:45 -07:00
Fred Park 5952fab4a9
Fix non-retry of task add collection
- For cases where the slice is too big
- This is a regression from #188
2018-06-29 11:16:30 -07:00
Fred Park 166adfa47a
Add filters on pool nodes list
- Allow filtering of start task failed and unusable states
- On pool allocation use this filtering to scope down "bad" nodes
2018-06-29 10:32:00 -07:00
Fred Park 0c3d492f1a
Various fixes and updates
- Dump node listing on unusable
- Update Gluster to 4.1 on Ubuntu/Debian/RemoteFS
- Update Python to 3.6.6 in Windows Docker images
- Update dependencies
- Minor doc updates
- Fix appveyor sha256 artifact upload
2018-06-29 09:32:10 -07:00
Fred Park 03aad7dd94
Fix tfm and rjm issues
- Regression in task file mover task command generator from Windows
  support refactor not properly replacing keyword arg
- Resolves #29
- Unpickling of tasks with dependencies for recurring job manager fails due
  to an unintended pickling of an object from yaml that was not represented
  as a Python list
- Resolves #221
2018-06-27 20:33:03 -07:00
Fred Park ea27e9e8bd
Support XFS filesystems in storage clusters
- Allow mdadm-based RAID-0 arrays to expand (experimental)
- Greatly expand remote fs guide with configuration/usage explanations
- Fix blocking bug with fs commands
- Resolves #219
2018-06-27 08:57:27 -07:00
Fred Park 534318e2a7
Auto upload Batch node logs on unusable
- Resolves #216
- Add generate sas option for diag logs upload command
- Allow multiple poolid for storage clear and del
- Add diagnostics log option to storage clear and del
- Allow storage sas create command to create container and share level
  SAS tokens
2018-06-25 13:04:03 -07:00