Граф коммитов

29 Коммитов

Автор SHA1 Сообщение Дата
Fred Park 1b741a74c5
Update dependencies 2019-11-15 23:23:03 +00:00
Fred Park 43fda94278
Update drivers and dependencies
- Docker CE 19.03.2
- blobxfer to 1.9.2
- NC/ND driver to 418.87.00
2019-09-11 17:51:13 +00:00
Fred Park 826c46afe2
Bring your own Public IP support 2019-08-14 03:23:09 +00:00
Fred Park e9130f83f4
MCR migration
- Migrate images to Microsoft Container Registry
- Fix Shellcheck issues
- Related to #278
2019-08-14 03:23:03 +00:00
Fred Park 290209381e
Update Dependencies
- Update NVIDIA compute driver to 418.67
- Update NVIDIA grid driver to 430.30
- Update Batch Insights to 1.3.0
- Update blobxfer to 1.9.0
- Update Python dependencies
- Drop Python 3.4 support
2019-08-12 20:42:32 +00:00
Fred Park 00f1c95b1d
Update Dockerfiles to Alpine 3.10 2019-08-08 20:11:33 +00:00
Fred Park 4b9a004f1a
Update to Batch 7.0.0 SDK
- Breaking change: pool listskus -> account images
- Support setting working directory for native mode
- Resolves #286
2019-06-27 20:08:49 +00:00
Fred Park ec7af5b7c1
Various updates
- Update docs
- Update azure-batch dependency
- Set Slurm scheduling option defer mode
2019-03-22 14:50:26 -07:00
Fred Park 81a260f0bd
Update to Alpine 3.9
- Fix some slurm deps
2019-02-28 12:11:19 -08:00
Fred Park 314037f76f
Slurm on Batch feature
- Package and use Slurm 18.08 instead of default from distro repo
- Slurm "master" contains separate controller and login nodes
- Integrate RemoteFS shared file system into Slurm cluster
- Auto feature tagging on Slurm nodes
- Support CentOS 7, Ubuntu 16.04, Ubuntu 18.04 Batch pools as Slurm
  node targets
- Unify login and Batch pools on cluster user based on login user
- Auto provision passwordless SSH user on compute nodes with login user
  context
- Add slurm cluster commands, including orchestrate command
- Add separate SSH for controller, login, nodes
- Add Slurm configuration doc
- Add Slurm guide
- Add Slurm recipe
- Update usage doc
- Remove deprecated MSI VM extension from monitoring and federation
- Fix pool nodes count on non-existent pool
- Refactor SSH info to allow offsets
- Add fs cluster orchestrate command
2019-02-28 12:11:10 -08:00
Fred Park a30cb674ca
Migrate to Azure Batch Python SDK 6.0.0
- Fix breaking changes
- Update dependencies
- Gate some debug messages behind the verbose flag
2019-01-16 13:03:30 -08:00
Fred Park 70f2c80de0
Update Dockerfiles 2018-12-03 10:59:38 -08:00
Fred Park eea4286724
Update dependencies
- NC/ND driver to 410.79
- NV Grid driver to 410.71 with CUDA10 support
- LIS
2018-12-03 09:03:49 -08:00
Fred Park 2519f3cedd
Update dependencies 2018-11-19 11:20:08 -08:00
Fred Park 342b7fc2e2
Fix various issues
- Monitoring SSH login
- Grafana update regression with Batch Shipyard Dashboard
- Federation job submission
2018-11-06 14:21:03 -08:00
Fred Park 62e8ebcac1
Update dependencies
- Update blobxfer to 1.5.4
- Resolves #243
2018-11-05 11:24:08 -08:00
Fred Park ab9cc70828
Update build to Python 3.7.1
- Update Windows Docker images to Python 3.7.1
- Fix flake8 errors
- Fix shellcheck errors
- Various build updates and fixes
2018-10-30 14:24:31 -07:00
Fred Park 584dada9f8
Update Singularity and Alpine
- Update to 3.8, rebuild 3.7 due to CVE
- Update Singularity to 2.6.0
2018-09-20 08:48:15 -07:00
Fred Park 06ab86c655
Update various components
- Update Nvidia Tesla driver to 396.44 for NC
- Update LIS to 4.2.6
- Update prometheus and grafana
2018-09-18 13:56:27 -07:00
Fred Park 96c220df34
Update to Azure Batch 5.1.0 SDK
- Accommodate breaking changes
- Add compute node agent info
2018-09-18 13:56:26 -07:00
Fred Park 1d666ae6aa
Update dependencies
- Update blobxfer to 1.5.0
2018-09-18 13:56:26 -07:00
Fred Park acdea94722
Tag for 3.6.0a1 release 2018-08-06 10:35:31 -07:00
Fred Park 52628d27cf
Federation support
- Federation proxy lifecycle management
- Federation lifecycle management
- Federation job submission and management
- Mount Azure File share for auto-rotated log persistence
- FIFO within job support
- Constraint matching
- Federations can be created in "unique job id" mode requiring all
  submitted jobs via fed jobs add be unique across the entire federation
- Supports nearly 15K actions per job (in non-unique job id mode)
- Task dependency rewrite engine for federated jobs
  - Verify dependencies only within task group
  - Uniquely identify task dependencies
- Allow tuning of scheduling behavior options
- Package federation logic on proxy into Docker container
- Full guide/walkthrough for federation feature
- Refactor common code between monitor/fed proxy into resource
- Other doc updates
2018-08-06 09:30:36 -07:00
Fred Park 7060366213
Update dependencies 2018-07-17 11:18:20 -07:00
Fred Park ea1341c1bd
Update drivers and monitoring components
- NC/ND update to 396.37
- NV update to 390.75
- LIS update to 4.2.5-2
- Prometheus to 2.3.2
- Grafana to 5.2.1
2018-07-17 11:02:39 -07:00
Fred Park 66e77ac397
Update dependencies
- Fixes in scripts for cascade and monitor cert renewal
2018-06-27 15:32:22 -07:00
Fred Park c98cf320ec
Fix various Let's Encrypt non-staging issues
- Prod ACME challenges were failing due to improper stage check cleanup
- Fix nginx and compose configuration errors for prod certs
- Migrate Docker install on Ubuntu 18.04 to stable channel
- Update LIS
2018-06-27 11:20:50 -07:00
Fred Park 2336a63990
Add GitHub issue templates
- Update AppVeyor build
- Update requirements
- Minor doc updates
2018-06-18 10:12:23 -07:00
Fred Park 9f61db12c3
Autoprovision Grafana Dashboard
- Add default dashboard
- Allow arbitrary provisioning of additional dashboards
- Add monitor list command
- Add RemoteFS monitoring support
- Compact cadvisor
2018-06-07 10:50:37 -07:00