Граф коммитов

57 Коммитов

Автор SHA1 Сообщение Дата
Fred Park a54f872326
Add ignore GPU warnings option 2019-12-12 20:42:47 +00:00
Fred Park 990345f84f
Allow specifying the autoscratch task id
- Update docs regarding pool id naming requirements
2019-11-27 18:33:29 +00:00
Fred Park d57e567512
Improve per-job autoscratch setup
- Provide an option for setup type
- Allow explicit setting of the number of instances for the
autoscratch volume
- Fix autopool and autoscratch interaction
2019-11-26 21:02:56 +00:00
Fred Park bc4a47d88d
Enhanced support for autogen task ids
- Add support to override global default at a per-job level and at a per
task factory level
- Resolves #324
2019-11-15 19:09:38 +00:00
Fred Park 134262158b
Support Singularity image encryption
- Modify singularity_images global resources to support encryption
options
- Automatically bind certificates to encrypted containers when a task
executes
2019-11-13 03:15:05 +00:00
Fred Park 826c46afe2
Bring your own Public IP support 2019-08-14 03:23:09 +00:00
Fred Park 3052e98c8b
Add MVAPICH support
- More changes for #287
- Automatically source environment modules if it exists
- Fix some typos
2019-08-12 01:58:39 +00:00
Fred Park b6044b3489
Update GPU support
- Update to Docker CE 19.03.1
- Use "native" Docker/containerd GPU support
- Breaking change in jobs configuration to allow arbitrary configuration
- Update docs
- Resolves #293
2019-08-08 20:36:41 +00:00
Fred Park 4d69c96d79
Merge branch 'sriov-merge' into singularity3 2019-07-23 21:02:52 +00:00
Vincent Labonté cc42916cba Fixes and update of recipes (#290)
* Fix multi-instance tasks that are not a MPI task

* Add setup task script for CNTK-CPU-Infiniband-IntelMPI

* Update CNTK-CPU-Infiniband-IntelMPI recipe

* Add MPI executable path option

* Update CNTK-CPU-OpenMPI recipe

* Change the default MPI executable_path to mpirun

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-Infiniband-IntelMPI

* Update CNTK-GPU-Infiniband-IntelMPI recipe

* Add setup task script for CNTK-GPU-OpenMPI

* Add setup task script for NAMD-Infiniband-IntelMPI

* Update NAMD-Infiniband-IntelMPI recipe

* Add setup task script for OpenFOAM-Infiniband-IntelMPI

* Update OpenFOAM-Infiniband-IntelMPI recipe

* Update TensorFlow-GPU Singularity recipe

* Add setup task script for OpenFOAM-TCP-OpenMPI

* Update OpenFOAM-TCP-OpenMPI recipe

* Add support for arbitrary commands with the MPI processes_per_node option

* Fix MPI with native images

* Modify CNTK-CPU-Infiniband-IntelMPI recipe

* Modify CNTK-GPU-Infiniband-IntelMPI recipe

* Modify NAMD-Infiniband-IntelMPI recipe

* Update processes_per_node documentation

* Fix `pool images list` with Singularity images

* Modify OpenFOAM-Infiniband-IntelMPI set up script

* Add check for mpi setting with Windows

* Add auto scratch support with OpenFOAM-Infiniband-IntelMPI recipe

* Modify OpenFOAM-TCP-OpenMPI set up script

* Add auto scratch support with OpenFOAM-TCP-OpenMPI recipe

* Add mpiBench-IntelMPI recipe

* Add mpiBench-MPICH recipe

* Add mpiBench-OpenMPI recipe

* Resolve PR comments

* Resolve PR comments
2019-07-17 18:57:06 -07:00
Fred Park 25fec92273
Support Hc/Hb
- Support RDMA bifurcation
- Update platform docs for CentOS-HPC 7.6
2019-07-15 03:32:04 +00:00
Fred Park 559463cd12
Merge branch 'develop' into sriov-merge 2019-07-09 21:45:31 +00:00
Vincent Labonté 442a22bd28 Improve MPI Interface for Singularity and Docker (#289)
* Add MPI config support for MPICH

* Add MPI config support for Docker containers

* Resolve PR comments

* Make use of the script runner with MPI and Docker

* Minor fixes

* Resolve PR comments
2019-07-09 13:46:12 -07:00
Vincent Labonté e6e60048a7 Improve MPI Interface for Intel MPI and Open MPI with Singularity images (#288)
* Add MPI config support for IntelMPI

* Separate prologue command into user and system

* Add MpiSettings

* Add MPI config support for Open MPI

* Fix MPI config support for IntelMPI

* Workaround for Open MPI btl tcp

* Correct documentation

* Fix non mpi multi instance execution

* Resolve PR comments

* Resolve PR comments

* Partially address #287
2019-07-03 12:40:54 -07:00
Fred Park b93f60213d
Support conditional output data
- Resolves #230
2019-06-24 18:03:43 +00:00
Fred Park 7b138e785a
Support user-specified job prep/release tasks
- Host mode only
- Resolves #202
2019-06-24 16:02:30 +00:00
Vincent Labonté 305d376cdc Support Singularity signed image verification (#280)
* Create one log file per container mode

* Make singularity 3 work

* Minor fixes

* Fix cascade with docker image and singularity image

* Add capability to pull from library://

* Add singularity signed images to config file

* Add singularity signed images to the global resource table

* Pull and verify signed singularity images

* Put the singularity sypgp directory in the mount directory

* Add ability to provide key file to verify a singularity image

* Resolve PR comments

* Fix Singularity registry credemtials
2019-06-05 11:14:37 -07:00
Vincent Labonté a68579c095 Prepare for Singularity3 work (#276)
* Remove torrent functionality

* Remove torrent storage

* Fix singularity permissions

* Add container mode in cascade.py

* Fix errors

* Fix PR comments

* Fix flake8 errors
2019-05-22 14:46:44 -07:00
Fred Park c39a919bf7
Add more Slurm partition settings
- Partition preemption settings
- Partition other options
2019-04-02 15:59:36 -07:00
Fred Park 314037f76f
Slurm on Batch feature
- Package and use Slurm 18.08 instead of default from distro repo
- Slurm "master" contains separate controller and login nodes
- Integrate RemoteFS shared file system into Slurm cluster
- Auto feature tagging on Slurm nodes
- Support CentOS 7, Ubuntu 16.04, Ubuntu 18.04 Batch pools as Slurm
  node targets
- Unify login and Batch pools on cluster user based on login user
- Auto provision passwordless SSH user on compute nodes with login user
  context
- Add slurm cluster commands, including orchestrate command
- Add separate SSH for controller, login, nodes
- Add Slurm configuration doc
- Add Slurm guide
- Add Slurm recipe
- Update usage doc
- Remove deprecated MSI VM extension from monitoring and federation
- Fix pool nodes count on non-existent pool
- Refactor SSH info to allow offsets
- Add fs cluster orchestrate command
2019-02-28 12:11:10 -08:00
Fred Park 775f602e5a
Add Batch Insights integration
- Resolves #259
2019-01-16 13:03:29 -08:00
Fred Park e210e6032f
Re-organize environment variables on node prep
- Resolve #252
2019-01-16 13:03:28 -08:00
jackpimbert 30ec7ad03b
Add custom environment variable options to pool configuration (#253)
* Add env vars option to pool configuration

- Add the option to include custom environment variables in the pool
  configuration
- This allows users to setup Batch environment variables for the start
  task
- This in turn could allow for the ability to use batch insights

* Add env var spec to pool schema for validation

- CHANGELOG updated with pool env var feature

* Add a keyvault option for pool env vars

- Added a keyvault option to the pool configuration
- Updated docs, schemas and templates accordingly
- Changed schema type from `str` to `text` for env vars

* Add keyvault client check

- Add a keyvault client check to ensure a valid client is used when
  using keyvault to add env vars to the start task.
2019-01-16 13:03:28 -08:00
Fred Park 70532fa4ae
Add Genomics recipes
- BLAST and RNASeq pipelines
- Fix adding tasks to an existing job with existing merge tasks
- Add support for force_enable_task_dependencies at the job level
- Fix doc typos
2018-11-29 08:58:01 -08:00
Fred Park 95dec309fe
Add support for standard and ultra SSDs
- Breaking change on premium property in managed disks
- Add availability zone support
2018-11-06 10:00:25 -08:00
Fred Park 54c78b7c49
Auto scratch support 2018-11-05 11:24:22 -08:00
Fred Park 02c6e110d7
Kata containers support
- Make Singularity runtime install optional
- Add `restrict_default_bind_mounts` option to jobs spec
- Provide a default container runtime option
2018-11-05 11:24:17 -08:00
Fred Park 52628d27cf
Federation support
- Federation proxy lifecycle management
- Federation lifecycle management
- Federation job submission and management
- Mount Azure File share for auto-rotated log persistence
- FIFO within job support
- Constraint matching
- Federations can be created in "unique job id" mode requiring all
  submitted jobs via fed jobs add be unique across the entire federation
- Supports nearly 15K actions per job (in non-unique job id mode)
- Task dependency rewrite engine for federated jobs
  - Verify dependencies only within task group
  - Uniquely identify task dependencies
- Allow tuning of scheduling behavior options
- Package federation logic on proxy into Docker container
- Full guide/walkthrough for federation feature
- Refactor common code between monitor/fed proxy into resource
- Other doc updates
2018-08-06 09:30:36 -07:00
Fred Park e069e72564
Support Docker image preload delay
- This option is only available for Linux non-native pools. All other
  pool types ignore this option.
2018-07-28 18:44:57 -07:00
Fred Park ea27e9e8bd
Support XFS filesystems in storage clusters
- Allow mdadm-based RAID-0 arrays to expand (experimental)
- Greatly expand remote fs guide with configuration/usage explanations
- Fix blocking bug with fs commands
- Resolves #219
2018-06-27 08:57:27 -07:00
Fred Park 3f30ba8d07
Support a fallback registry for system images
- Resolves #217
- Add misc mirror-images command
- Pass Singularity version to bootstrap
- Fix GlusterFS on compute provisioning, resolves #220
2018-06-26 12:22:09 -07:00
Fred Park 534318e2a7
Auto upload Batch node logs on unusable
- Resolves #216
- Add generate sas option for diag logs upload command
- Allow multiple poolid for storage clear and del
- Add diagnostics log option to storage clear and del
- Allow storage sas create command to create container and share level
  SAS tokens
2018-06-25 13:04:03 -07:00
Fred Park 7d48249864
Allow docker access with Batch SSH users
- Allow Docker daemon access with Batch SSH users with configuration
  enablement
- Resolves #206
2018-06-11 14:27:26 -07:00
Fred Park 612e2a50e5
Move Prometheus/Grafana config to separate file
- Move grafana admin login info to credentials
- Update documentation for Prometheus/Grafana integration
- Resolves #205
2018-06-08 16:42:52 -07:00
Fred Park cf0797790f
Support max increment of VMs in scenario autoscale
- Allow definition of weekdays/workhours
- Resolves #210
2018-06-08 07:20:41 -07:00
Fred Park 9f61db12c3
Autoprovision Grafana Dashboard
- Add default dashboard
- Allow arbitrary provisioning of additional dashboards
- Add monitor list command
- Add RemoteFS monitoring support
- Compact cadvisor
2018-06-07 10:50:37 -07:00
Fred Park b77a147766
Continue Prometheus integration support
- Add nginx reverse proxy and letsencrypt cert support
- Add let's encrypt options
- Add picket into compose
- crontab cert renewal
- Add inbound rule management for temporary ACME challenge on port 80
- Update to node exporter 0.16
- Fixup various issues
2018-06-04 09:04:30 -07:00
Fred Park ae86b92be2
Start Prometheus monitoring integration
- Refactor package uploader for pool
- Auto install node exporter and cadvisor for prom enabled pools
- Add configuration
- Create monitoring resource
- Start work on picket monitor
2018-06-04 08:56:44 -07:00
Fred Park 69235ba2c1
Add default_working_dir option
- Clarify where a container runs by default in the jobs config doc
- Resolves #190
2018-04-25 08:23:17 -07:00
Fred Park 72484b510b
Add product_iterables support for task factory
- Resolves #187
2018-04-20 11:21:34 -07:00
Fred Park 97d9ca09ce
Relax zip iterable type
- Partially addresses #187
2018-04-19 14:34:57 -07:00
Fred Park 350f1185d9
Allow AAD on storage credentials
- Resolves #179
2018-04-18 08:09:13 -07:00
Fred Park e1ae45b9d2
Add support for task exit conditions at job-level
- Fix incorrectly overwriting the job action setting
2018-03-16 09:25:35 -07:00
Fred Park 2a17ef0690
Add certificate reference support on pools 2018-03-14 14:01:29 -07:00
Fred Park b9a55efa02
Add remote access control settings 2018-03-14 12:21:13 -07:00
Fred Park 088f2d5e34
Add support for arbitrary exports config for NFS
- Add user agent for ARM clients
- Update README
2018-03-13 15:03:31 -07:00
Fred Park e79baee8f9
Add support for AHUB 2018-02-28 15:08:28 -08:00
Fred Park 5c0eb71156
Support default task exit conditions
- Add more info for list operations
2018-02-27 08:15:25 -08:00
Fred Park 850b54936f
Fix AAD support for non-public Azure cloud regions
- Add authority_url option
- Update default management endpoint
2018-02-16 09:21:00 -08:00
Fred Park ae20b27643
Add Custom Linux Mount support
- Add pre/post support for additional node prep commands
2018-02-12 11:20:34 -08:00