Граф коммитов

495 Коммитов

Автор SHA1 Сообщение Дата
Fred Park 1a17391105 Remove Ubuntu 16.04-LTS latest redirect
- Update to blobxfer 1.0.0rc2
2017-10-16 13:03:54 -07:00
Fred Park 607bfd252e Migrate to storage split library
- Remove queue deletion code
- Resolves #133
2017-10-05 21:40:50 -07:00
Fred Park 7c87b043a0 Various updates
- Normalize gluster on compute mount path
- Fix gluster on compute data ingress
- Lock blobxfer verison to 1.0.0rc1
2017-10-05 13:02:30 -07:00
Fred Park 298b00d946 Mount Azure file shares to host (#123)
- Allow multiple file shares per pool
- Move root mount point for all shared data volumes
2017-10-04 17:59:30 -07:00
Fred Park fc459e7333 Doc fixes
- Raise exception when docker_registry is detected in the global config
2017-10-04 13:31:48 -07:00
Fred Park 01995e97b6 Tag for 3.0.0a1 release 2017-10-04 09:26:01 -07:00
Fred Park 49a374416f Add unusable node recovery option 2017-10-04 09:25:57 -07:00
Fred Park 796a5e33b4 Combine rjm/tfm to cargo (#125) 2017-10-03 18:24:50 -07:00
Fred Park 6315be3a6b Transition to blobxfer 1.x command structure
- Data ingress/egress changes
- Task factory file changes
- Resolves #47
2017-10-03 18:24:49 -07:00
Fred Park e783744e00 Container registry logic overhaul
- Remove private registry back to Azure storage blob support (#44)
- Require fully qualified Docker image names (#106)
- Support multiple public/private registries on a single pool (#127)
2017-10-03 18:24:42 -07:00
Fred Park dcddb04150 Fixes and log more info in start task 2017-10-03 10:05:17 -07:00
Fred Park cbddcdfbff Use docker_image in favor of image in tasks 2017-10-03 10:05:17 -07:00
Fred Park 60b4fc446f Support ARM Images for custom images (#126) 2017-10-03 10:05:17 -07:00
Fred Park 238982db77 Add ARM VNet support in Batch service mode (#126)
- Support "global" aad property in credentials
- Add Virtual Network guide
2017-10-03 10:05:17 -07:00
Fred Park a45f3b9a66 Multi-instance fixes for native 2017-10-03 10:04:03 -07:00
Fred Park afdad167a8 Add native support to custom images
- Update TSG doc for native container support
2017-10-03 10:04:03 -07:00
Fred Park 6aac272e9e Autoenable ib/gpu on tasks if settings allow
- Resolves #124
2017-10-03 10:04:03 -07:00
Fred Park e2ddf3b750 Add YAML configuration support
- Resolves #122
2017-10-03 10:04:03 -07:00
Fred Park 9c700dfbd5 Fix jrtask and suppress CUDA vars for native
- Doc updates
- Suppress coordination/task commands that are empty
2017-10-03 10:03:20 -07:00
Fred Park f1775c3ad7 Native container job schedule support 2017-10-03 10:03:20 -07:00
Fred Park cccc5fa3b8 output_data to storage blob support for native 2017-10-03 10:03:20 -07:00
Fred Park 0a9689f5a8 Native container support
- Allow pool conversion with native flag
2017-10-03 10:03:20 -07:00
Fred Park aed9f8145b Add test cluster support 2017-10-03 10:03:20 -07:00
Fred Park 75157e91f6 Add Read the Docs build
- Tag for 2.9.6 release (mainly to generate a 2.9.x RTD version)
2017-10-03 09:53:32 -07:00
Fred Park bb3b4d6a41 Fix RemoteFS disk model create option type 2017-09-25 12:45:23 -07:00
Fred Park 53477a720a Tag for 2.9.5 release
- Add --all-starting for pool delnode
2017-09-24 19:50:14 -07:00
Fred Park c13793dd57 Support version in platform image
- Override UbuntuServer 16.04-LTS latest to prior version due to
  linux-azure kernel issues
2017-09-22 19:41:33 -07:00
Fred Park 093cfdbc83 Prevent invalid mix of HPC offer and non-RDMA VM
- Fix unusable nodes on allocation exception in pool stats
- Expand network tuning exemptions
2017-09-22 08:27:16 -07:00
Fred Park 260e1609ee Fix various OS, Docker and nvidia issues
- Update Docker CE versions for Ubuntu and CentOS
- Update NC driver
- Add special nvidia install path for CentOS 7.3 during 7.4 rollout
2017-09-21 22:13:33 -07:00
Fred Park 5621ec5e3e Fix regression in ssh private key check on Windows 2017-09-21 13:11:26 -07:00
Fred Park 7e1c4c7e75 NV-series driver updates
- Resolves #119
2017-09-13 09:19:06 -07:00
Fred Park 9602608871 Tag for 2.9.4 release 2017-09-12 08:56:03 -07:00
Fred Park 1b0e1c449d Refactor SSH to common path
- Check for SSH private key filemode
- Resolves #116
2017-09-07 08:02:40 -07:00
hieuhc 28bfda3278 Replace clear() method when invoking pool udi with ssh (#118) 2017-09-07 09:18:46 +01:00
Fred Park 1d5cdcbbdf Tag for 2.9.3 release 2017-08-29 08:05:34 -07:00
Fred Park 010b160e43 Provide warning and note on job migration 2017-08-17 13:08:31 -07:00
Fred Park 5f393beb1d Disallow resize_timeout in AS-enabled pools 2017-08-17 07:06:03 -07:00
Fred Park 9d66d75b62 Tag for 2.9.2 release
- Attempt another fix at site extension upgrade (#113)
2017-08-16 08:59:53 -07:00
Fred Park f91107e89d Tag for 2.9.1 release 2017-08-16 08:33:19 -07:00
Fred Park 9e3308ff2b Fix issues in RemoteFS
- Public ip not being assigned for resize command
- Need to wait for block device to show up for attached disks (expand
  command)
- Add more logging
2017-08-16 08:14:08 -07:00
Fred Park 71706dec6e Fail faster for storage issues in remotefs 2017-08-15 19:40:41 -07:00
Fred Park 3f244d0e4b Fix ssh private key issue in RemoteFS 2017-08-15 16:39:52 -07:00
Fred Park b16685348d Tag for 2.9.0 release 2017-08-15 13:28:55 -07:00
Fred Park 466c7d4a3b Perform client checks 2017-08-15 13:26:59 -07:00
Fred Park e434b83cb3 Fix truncated P50 provisioning
- Support "s_v3" suffixed premium VM SKUs
2017-08-15 13:26:31 -07:00
Fred Park 7a815dff6f Minor updates and fixes 2017-08-14 09:07:07 -07:00
Fred Park 1e4cd777be Fix division by zero in pool stats
- Fix flake8 issues
2017-08-10 15:08:39 -07:00
Fred Park 745082029f Misc doc updates
- Update requests
- Check task id length
- Drop Python 3.3 support due to cryptography
2017-08-10 08:40:07 -07:00
Fred Park 284c4d9c23 Tag for 2.9.0rc1 release 2017-08-09 08:57:33 -07:00
Fred Park 4573180293 Validate and prompt certain job schedule adds 2017-08-09 08:39:26 -07:00
Fred Park 44a1f14b31 Add monitor_task_completion for recurring jobs 2017-08-09 07:57:56 -07:00
Fred Park 5ae9001716 Add job schedule support to commands
- Resolves #19
2017-08-08 15:02:53 -07:00
Fred Park 9add2444ec Change autogen task id property to complex
- Update job recurrence docs
2017-08-08 08:45:15 -07:00
Fred Park be530e63c0 Job recurrence support 2017-08-07 19:42:09 -07:00
Fred Park 99e72c0c3f Add custom task factory support (#93) 2017-08-07 10:38:08 -07:00
Fred Park 754b5ee5a6 Tag for 2.9.0b2 release 2017-08-04 14:51:27 -07:00
Fred Park 8a396f0e18 Add pool and jobs stats
- Resolves #110
2017-08-04 14:47:16 -07:00
Fred Park c5fa85adcb Add file task factory (#93)
- Split out task factory settings into separate file
- Change uniform to be a, b instead of min, max
- Update blobxfer script for single target ingress to place file
  directly to destination
2017-08-04 11:02:33 -07:00
Fred Park 1650ce4a95 Add random task factory (#93) 2017-08-03 20:10:56 -07:00
Fred Park e5ffd492ab Update CNTK CPU infiniband recipe to 2.1 2017-08-03 16:28:23 -07:00
Fred Park 4d09a09a80 Add --all-unused to pool delnode 2017-08-03 10:41:38 -07:00
Fred Park d539e2923a Fix pool udi terminal mangling 2017-08-03 08:52:20 -07:00
Fred Park 9eb8fd4c55 Support CentOS-HPC 7.3
- Update misc tensorboard to latest
- Fix term tasks in disable jobs
- Update NVIDIA driver
- Doc updates
2017-08-02 15:12:45 -07:00
Fred Park bab8628ed5 Tag for 2.9.0b1 release 2017-07-31 15:05:58 -07:00
Fred Park ed8ca2d225 Add autogen task id setting 2017-07-31 13:40:18 -07:00
Fred Park 196a36336e Add rebalance based on preempted node count 2017-07-31 13:40:15 -07:00
Fred Park 4105acc2f8 Add task factory (parameter sweep) support
- Resolves #93
2017-07-28 14:36:42 -07:00
Fred Park 23a753a110 RemoteFS fixes 2017-07-27 08:12:34 -07:00
Fred Park 7a9177b16b Fix pool deletion with poolid arg 2017-07-26 10:38:29 -07:00
Fred Park 5fef683af4 Universally increase SAS expiry time 2017-07-21 13:28:56 -07:00
Fred Park e32fc4d93e Add Autopool support
- Resolves #33
- Add --poolid to storage clear and storage del
- jobs del and jobs term now cleanup storage data if autopool is
  detected
2017-07-21 11:10:03 -07:00
Fred Park 30ea8c280f Add autoscale guide 2017-07-21 11:10:03 -07:00
Fred Park 7ba85e7496 Add job migration support
- Add enable/disable job support too
- Resolves #108
2017-07-21 11:10:03 -07:00
Fred Park 3b65ba684f Support job priorities
- Resolves #109
2017-07-21 11:10:03 -07:00
Fred Park 23e9584852 Add compute node fill type support
- Resolves #107
2017-07-21 11:10:03 -07:00
Fred Park 82a46a615a Basic Autoscale functionality
- Allow pools to be added with zero target nodes
- Add pool autoscale commands
2017-07-21 11:10:03 -07:00
Fred Park 5291ff1130 Move to blob leasing for download ticketing
- Greatly increase resource file SAS expiry timedelta
- Make concurrent_source_downloads generic, remove non-p2p option
- Update Dockerfiles
- Update to latest azure-storage
2017-07-21 11:10:03 -07:00
Fred Park d197c9be28 Minor fixups 2017-07-07 09:07:40 -07:00
Fred Park 03fe791171 Tag for 2.8.0 release 2017-07-06 11:12:24 -07:00
Fred Park 8eb2197d23 Allow CentOS 7.3 on NC/NV 2017-07-06 11:12:05 -07:00
Fred Park de45b18a67 Add backoff to cascade docker image pull retries 2017-07-01 01:25:30 -07:00
Fred Park 2a48885da1 More improvements for scale out robustness
- Add --all-start-task-failed to delnode
- Reduce node output on pool allocation wait with number of nodes > 10
2017-06-30 23:50:21 -07:00
Fred Park 06188c1944 Tag for 2.8.0rc2 release
- Fix regression with private docker image pulls
- Resolves #103
- Resolves #105
2017-06-30 11:45:26 -07:00
Fred Park ade6a27b60 Tag for 2.8.0rc1 release 2017-06-27 11:38:36 -07:00
Fred Park 54422ce2eb Add retry handling for cascade docker pull
- Add cascade.log download for start up failures
2017-06-27 09:28:03 -07:00
Fred Park cefa72e443 Add version metadata to pool and jobs
- Resolves #89
2017-06-26 13:20:49 -07:00
Fred Park a61449ec9c Fix tensorboard command with custom image changes
- Fix ref during exception handling for invalid platform image
- Remove max size note for remote fs managed disks
2017-06-26 10:51:53 -07:00
Fred Park 35c9779d68 Fix job auto_complete overwrite of job properties
- Resolves #97
2017-06-09 11:34:26 -07:00
Fred Park 887c597fab Tag for 2.8.0b1 release 2017-06-07 08:30:04 -07:00
Fred Park a41713c5ee Add custom image guide
- Update recipes for vm_configuration
- Fix some issues with platform pools with new changes
2017-06-06 12:41:42 -07:00
Fred Park 8397b411c5 Initial custom image support 2017-06-06 08:43:33 -07:00
Fred Park 549f50aac5 Tag for 2.7.0 release 2017-05-31 07:43:14 -07:00
Fred Park 004413e36e Fix pool udi with no logins/encryption over SSH
- Resolves #92
2017-05-28 15:18:32 -07:00
Fred Park ec986323fb Duplicate volume checks between job and task 2017-05-28 15:18:32 -07:00
Fred Park 3d5958195c Remove print statements 2017-05-24 13:43:20 -07:00
Fred Park 3a5fb452d5 Various fixes
- Add poolid param for pool del
- Fix vm_count deprecation check on fs actions
- Improve robustness of package index updates
- Prompt for jobs cmi action
- Update to latest dependencies
2017-05-24 09:54:09 -07:00
Fred Park 06fd4f8e62 Tag for 2.7.0rc1 release 2017-05-24 07:43:19 -07:00
Fred Park 4514920b6a Add pool listimages command
- Resolves #60
- Fix some resize/wait issues
2017-05-23 14:14:29 -07:00
Fred Park d80d938063 More inheritable job to task properties
- Add max_wall_time property
- Resolves #69
2017-05-23 09:29:00 -07:00
Fred Park 5a82cc79f8 Add --tty option to ssh commands 2017-05-22 20:03:35 -07:00
Fred Park a623e5b26f Add list tasks poll option
- Resolves #77
- Add deprecation path for multi-instance pool spec vm count
- Fix outdated recipe doc for multi-instance pool spec
- Cache last task id to speed up task collection adds
2017-05-22 19:45:25 -07:00
Fred Park 199ac70e22 Tag for 2.7.0b2 release 2017-05-18 09:18:43 -07:00
Fred Park d2b066bf6d Add tasks via collection
- Resolves #86
2017-05-17 18:48:16 -07:00
Fred Park 644e86ddb6 Prevent glusterfs on compute and max tasks > 1 2017-05-17 15:03:30 -07:00
Fred Park c3a72fa4e3 Allow workdir to be set
- Resolves #87
2017-05-17 09:28:19 -07:00
Fred Park 11c5dc700b Improve pool resize logic with mixed nodes 2017-05-16 09:49:01 -07:00
Fred Park d9304794bc Add deprecation path for vm_count change
- Resolves #84
2017-05-16 09:43:58 -07:00
Fred Park b09a37f22f Add preempted state checking 2017-05-15 08:08:30 -07:00
Fred Park fd6c45505e Update recipes for vm_count change 2017-05-12 19:23:29 -07:00
Fred Park 7ed7429a24 Add Low Priority Batch VM support
- Resolves #82
- Resolves #83
2017-05-12 14:42:55 -07:00
Fred Park a17d6b64c9 Update dependencies
- Fix breaking changes in keyvault library
- Fix inverted order for fs cluster ssh and optional command
2017-05-11 09:21:20 -07:00
Fred Park 70f7317c13 Add --clear-tables option to storage del
- Update limitations doc
- Resolves #80
2017-05-08 09:07:09 -07:00
Fred Park 1050c5da0e Tag for 2.6.2 release 2017-05-05 08:42:53 -07:00
Fred Park 983a7eed45 Node prep script improvements
- Blacklist nouveau universally on GPU VMs
- Change URL retrieval to requests
- Update requirements to latest
2017-05-05 08:42:19 -07:00
Fred Park 2d53e411e4 Add develop branch Dockerfile for hub
- Allow NVIDIA license agreement to be auto-confirmed with -y
2017-05-04 09:03:09 -07:00
Fred Park b3f959801c Fix docker login for missing images
- Resolves #66
2017-05-02 20:11:24 -07:00
Fred Park 2e62f12729 Fix misc tensorboard default image issue 2017-05-02 13:16:57 -07:00
Fred Park ad423c0e3d Tag for 2.6.1 release
- Optimize some Batch calls
2017-05-01 22:17:43 -07:00
Fred Park 3d2c8cc191 Fix termtasks with disable 2017-05-01 18:47:41 -07:00
Fred Park f9912b7a52 Pool-level resource file support 2017-05-01 10:17:09 -07:00
Fred Park b559ba3fb5 Fix data ingress to single node on pool add 2017-05-01 08:38:45 -07:00
Fred Park aee7c2018b Batch exception handling fixes
- Add more tensorboard log switches for autodetection
2017-04-30 20:08:15 -07:00
Fred Park c8f7521196 Add COMMAND arg for ssh commands 2017-04-30 14:22:13 -07:00
Fred Park d638fe10a5 Tensorboard docker image auto-detect
- Fix jobs del --termtasks to disable job first
- Fix jobs listtasks and data listfiles to accept jobid not in jobs
  config
- Perform double port remapping to avoid conflicts in tensorboard
2017-04-29 23:56:53 -07:00
Fred Park fb978a4788 Update TensorFlow recipes to r1.1
- Remove custom build of TF for non-distributed mode
2017-04-28 23:13:04 -07:00
Fred Park 4c29dd22e6 Fix streaming off by one 2017-04-28 12:53:27 -07:00
Fred Park 0394cd9848 Add misc tensorboard command 2017-04-28 12:53:19 -07:00
Fred Park 6612150352 Catch ssh user add exception in pool add
- Update various docs
2017-04-26 11:09:56 -07:00
Fred Park e1452fc7c9 Update Azure site extension docs 2017-04-21 19:42:49 -07:00
Fred Park 12216930fe Tag for 2.6.0 release 2017-04-20 13:09:26 -07:00
Fred Park e997441169 Fix regression in data ingress 2017-04-19 08:32:28 -07:00
Fred Park b8d36a065a Refactor pool ssh settings 2017-04-18 18:46:57 -07:00
Fred Park 0161169daa Remove Windows checks for scp/ssh/openssl
- Update docs
- Remove unused vars in nodeprep script
2017-04-18 13:51:05 -07:00
Fred Park 77db9dbd82 Fix quotes in parameters
- Disallow newline character in smb password
2017-04-14 22:39:27 -07:00
Fred Park 7b99cf0b85 Modify glusterfs race fix with iptables
- Restrict smb account password from containing certain characters due
  to echo reinterpret issues
- Fix some more ssh/pathlib issues
2017-04-14 14:56:35 -07:00
Fred Park 469e5cb56f Tag for 2.6.0rc1 release
- Fix Docker setup issues
- Pin Docker release version
- Update NC nvidia driver
2017-04-14 12:54:43 -07:00
Fred Park 741a0bdd85 Add fault_domains property
- Add RemoteFS-GlusterFS+BatchPool recipe
- Various fixes
2017-04-14 08:14:13 -07:00
Fred Park 0d974fa0aa Add additional SSH options
- Fix samba to auto-restart
2017-04-13 09:31:35 -07:00
Fred Park 9b30c60b10 Tag for 2.6.0b3 release 2017-04-03 14:20:24 -07:00
Fred Park 96395fa68a Allow docker_images to be empty 2017-04-03 14:20:20 -07:00
Fred Park f61f91423e multi_instance_auto_complete -> auto_complete
- Resolves #61
2017-04-03 10:48:54 -07:00
Fred Park b01b835fa2 pool udi with 0 nodes returns warning
- Resolves #64
2017-04-03 08:50:39 -07:00
Fred Park 3ded07634e Fix some Python2 issues in remotefs
- Properly map the gluster volume mountpath and not the brick for SMB
2017-03-31 13:21:22 -07:00
Fred Park 5088c8f26e Fix pathlib and future compatibility issues 2017-03-31 09:55:34 -07:00
Fred Park d6e72ca22f Fix glusterfs_on_compute issues 2017-03-31 08:07:23 -07:00
Fred Park b426ce9c39 Add Samba NSG rules and stat 2017-03-30 19:48:17 -07:00
Fred Park 667d273c09 Move additional node prep commands as last set
- Resolves #63
2017-03-30 15:16:36 -07:00
Fred Park 130401af75 Add samba support on storage cluster nodes 2017-03-30 15:03:11 -07:00
Artem Sobolev bb14d2224e Fix ingress data call (#62) 2017-03-30 13:16:03 -07:00
Fred Park db16e4cb7e Allow public IP to be disabled
- Fix fs cluster status --detail
- Expand non-retry on async ops to include all 400-level status codes
2017-03-28 20:49:09 -07:00
Fred Park f4b08a9f77 Fix for multi-instance auto complete
- Also only read credentials json if valid
2017-03-24 14:54:19 -07:00
Fred Park 5879e48586 Remove batch credential for fs ops
- Add RemoteFS recipes
- Replace Batch Explorer with Batch Labs
2017-03-23 15:06:42 -07:00
Fred Park dd1b9f3de5 Tag for 2.6.0b2 release
- Resolves #59
2017-03-22 10:05:10 -07:00
Fred Park f952605e58 Fix server options arg parsing 2017-03-17 19:32:44 -07:00
Fred Park 8871b8697d Move storage container creation/deletion for fs
- Move storage container actions closer to create/delete for the cluster
  to reduce chance of storage container/blob orphaning
2017-03-17 15:48:20 -07:00
Fred Park 0f742a3cf3 Do not run udi when there are no nodes
- Resolves #58
2017-03-17 10:35:04 -07:00
Fred Park c15ea84840 Return helpful text about sc id not found 2017-03-16 19:20:35 -07:00
Fred Park 791c5726e0 Tag for 2.6.0b1 release 2017-03-16 17:53:15 -07:00
Fred Park b269ea7f06 Add multi-volume/server support 2017-03-16 15:18:29 -07:00
Fred Park 38ac358d9d Add pre-existing checks
- Switch to hostname peering in add brick (resize)
- Update docs regarding max_tasks_per_node
2017-03-16 08:58:53 -07:00
Fred Park 9dfccad392 Cap vm extension install to 1 attempt
- Fail async op at first chance to preserve traceback
2017-03-15 13:22:26 -07:00
Fred Park a95bc22e52 Add automatic retries for async ops in fs commands
- Single-source resource name generation
- Add --generate-from-prefix option for fs cluster del command
2017-03-15 11:30:01 -07:00
Fred Park 5325395522 Add glusterfs local mount option and NSG rule
- Add --hosts option for fs cluster status to print required hosts
  changes on the local machine to mount the remote fs
2017-03-14 22:07:51 -07:00
Fred Park ca2f9d73ab Add support for docker run uid/gid
- Resolves #54
2017-03-14 08:52:09 -07:00
Fred Park 071ec86831 Minor fixes and updates 2017-03-13 08:16:53 -07:00
Fred Park 89b722df54 Populate the fs config doc
- Update base README
- Rename disk_ids to disk_names in fs.json
2017-03-12 13:07:56 -07:00
Fred Park d9966e645d Experimental support for gluster cluster resize
- Add more robustness to gluster provisioning
- Stat script fixes
- Add --detail option for stat
- More disks del/list options
2017-03-12 11:18:41 -07:00
Fred Park 9dc4673530 Enable data ingress to remote storage clusters
- Refactor some constants to function access from the proper locations
2017-03-11 19:15:14 -08:00
Fred Park cb7b42a231 Support glusterfs <-> pool autolinking
- Support glusterfs expand (additional disks)
- Provide `mount_options` for `file_server` which applies to local mount
on the file server of the disks
- Allow gluster volume name to be specified
- Provide stronger cross-checking between pool virtual network and
storage cluster virtual network
- Increase ud/fd in AS to maximums
- Install acl tools for nfsv4 and glusterfs
2017-03-11 15:23:55 -08:00
Fred Park 0ed28d96fc Allow scp dm and credential encryption on Windows
- Rename old glusterfs scripts to be less confusing with remote
  glusterfs support
2017-03-11 09:21:33 -08:00
Fred Park 675c6c37f8 Glusterfs support for add/suspend/start
- Simple logging by default
- Fix logging format
2017-03-10 22:54:16 -08:00
Fred Park 3f47fda0b9 Checkpoint multi-vm glusterfs support
- Allow resource_group overrides in managed_disks and storage_cluster
- Add server_options to file_server
- Add named resource group support to disk deletion
- Fix Batch and ARM client issues in non-AAD mode
2017-03-10 15:10:31 -08:00
Fred Park 33291504c2 Support missing image tasks, pool check
- Break out configs into separate pages
- Update all configs using 16.04.0-LTS to 16.04-LTS
- Remove Batch `account` from recipe credentials
2017-03-09 15:07:37 -08:00
Fred Park e349a004cd Support pool <-> storage cluster auto-linkage
- Update to latest batch management client library supporting
  UserSubscription
- Begin breakout of config doc into multiple pages
2017-03-09 09:40:16 -08:00
Fred Park 5fcddad7ea Pool <-> storage cluster linkage checkpoint 2017-03-08 23:43:16 -08:00
Fred Park 91403de98f Add pool vnet spec
- Refactor vnet/subnet creation so pool creation can use it
- Allow read of fs.json for pool add
- Rename "glusterfs" volume_driver to "glusterfs_on_compute"
2017-03-08 20:23:05 -08:00
Fred Park 66d90dde90 Prep for add pool with vnet changes
- Centralize various client creation logic
2017-03-08 14:56:39 -08:00
Fred Park 8f7aee3a2f Support AAD auth for Batch accounts 2017-03-08 11:13:09 -08:00
Fred Park c118b7e2d9 Allow custom inbound network security rules 2017-03-08 09:52:21 -08:00
Fred Park 587ab7faa4 Fix suspend/start issues with software raid
- Disallow expand action with mdadm-based arrays on RAID-0
- Change "remotefs" to "fs" for commands
2017-03-08 09:52:21 -08:00
Fred Park 748cf64bfb Refactor and unify AAD settings across commands
- All KeyVault AAD endpoints to be specified
2017-03-08 09:52:21 -08:00
Fred Park a6a672a82e Begin expand functionality
- Fix issues with ext4 + mdadm
2017-03-08 09:52:21 -08:00
Fred Park f8e3fa52ed Add stat script
- Better organize some remotefs json settings
- Reduce redundant lookups in ssh path
- Create --output-config option to separate from --verbose
2017-03-08 09:52:21 -08:00
Fred Park 0b172eccce Add remotefs bootstrap script 2017-03-08 09:52:21 -08:00
Fred Park 82398360ff Add cluster status and ssh commands
- Start integration with CustomScript extension
2017-03-08 09:52:21 -08:00
Fred Park 69335d7287 Add cluster suspend and start commands
- Begin work on status command
2017-03-08 09:52:21 -08:00
Fred Park 94bde5b076 Add first version of cluster add and del commands
- Modify remotefs json for more properties
2017-03-08 09:52:21 -08:00
Fred Park acbce84fa1 Add disk del and list commands 2017-03-08 09:52:21 -08:00
Fred Park cba7086511 Add disk add command
- Add first iteration of remotefs.json
- Modify set of TCP no tune VMs
2017-03-08 09:52:21 -08:00
Fred Park cd2cb4352a Scaffold base changes for remotefs 2017-03-08 09:52:21 -08:00
Fred Park 48834c1a49 Fix certificate vis for encrypted credentials 2017-03-08 09:36:43 -08:00
Fred Park 9cd39534cb Update for azure-batch 2.0.0
- Dependency refresh
2017-03-08 09:36:43 -08:00
Fred Park 00302b75db Tag for 2.5.4 release
- Update nvidia-docker for docker ce
- Update tesla nc driver
- Use SHA256 checksums instead of MD5 for downloads
2017-03-08 08:39:51 -08:00
Fred Park f9782878f1 Tag for 2.5.3 release 2017-03-01 07:39:00 -08:00
Fred Park b8a94c378d Add rebootnode command
- Update recipes that download files to use resource files instead
2017-02-28 19:37:57 -08:00
Fred Park 6ce173ca05 Pin blobxfer version and add termtasks option
- Clarify docs for usage scope between tooling/APIs
2017-02-28 09:45:24 -08:00
Fred Park ace0dde416 Tag for 2.5.2 release 2017-02-23 08:06:27 -08:00
Fred Park 7dd02b3c27 Automatic path sub for Gluster/AzStorage xfer
- Resolves #37
2017-02-23 07:51:49 -08:00
Fred Park af1e03cfa8 Add troubleshooting guide
- Minor convoy.batch fixes
2017-02-22 20:34:20 -08:00
Christian 65576235a1 Wrap local command in a OS specific shell (#39)
* Wrap local command in a OS specific shell

* _ON_WIN const instead of os.name
2017-02-21 07:31:03 -08:00
Fred Park 4eea944bb3 Tag for 2.5.1 release 2017-02-01 11:06:26 -08:00
Fred Park 78fad1c3e3 Add support for task retention time
- Resolves #30
2017-01-31 09:40:16 -08:00
Fred Park 25dcc983ef Fix unencrypted task file mover delimiter issue
- Resolves #29
2017-01-30 15:06:32 -08:00
Fred Park 0fe858c14e Add FAQ and fix autogen task id rollover
- Resolves #27
2017-01-26 08:30:53 -08:00
Andrea Dotti c6176d01ce Fix issue with listtasks failing on active status tasks (#28)
* Fix issue with listtasks failing on active status tasks

Fix issue with tasks that are not in state completed or running that cause  jobs listtasks fails
and causes shipyard to crash.

* Fix formatting issue

Fix spaces and too long line

* Fix code notations
2017-01-26 08:22:02 -08:00
Fred Park 270ef0c7b1 Fix Docker tmpdir
- Fix typo with ev secret id ref to keyvault
- Add travis py36 env
2017-01-24 14:43:44 -08:00
Derrick Liu 5fabd07fef Add max_task_retry_count to job and task definitions (#23)
* Add `max_task_retry_count` to json template as reference

* Add job-level and task-level max_task_retry_count properties

If set, we create a `azure.batch.models.JobConstraints` or `azure.batch.models.TaskConstraints` object, and pass it into the call to `JobAddParameter` or `TaskAddParameter` as a constraints argument.

* Update configuration documentation to include `max_task_retry_count`

* Fixed various minor issues and linting

Squashed commit:

[d794908] No retry means retry_count is 0, not 1

[29de812] Forgot to define these earlier

[8336700] Don't check for empty since it's an int

[c59d52a] Fix flake8 linting line length (+2 squashed commit)

Squashed commit:

[8336700] Don't check for empty since it's an int

[c59d52a] Fix flake8 linting line length

* Rename `max_task_retry_count` to `max_task_retries` and fix other PR comments
2017-01-24 07:46:52 -08:00
Fred Park fa4e1f847c Add env var secret id support
- Tag for 2.5.0 release
- Resolves #12
- Partially resolves #15
2017-01-19 10:16:42 -08:00
Fred Park 040a068265 Various fixes
- This resolves #13 and resolves #16
2017-01-19 10:16:42 -08:00
Gonzalo 520db51019 nvidia-docker updated to v1.0.0 2017-01-19 09:48:13 -08:00
Derrick Liu 829258de2b Remove extra azure.mgmt import 2017-01-18 20:54:43 -08:00
Fred Park 9b6dbef19f Add task dependency id range support 2017-01-12 09:30:41 -08:00
Fred Park c95520eaea Tag for 2.4.0 release
- Update KeyVault docs with Azure CLI 2.0 commands
- Resolves #10
2017-01-11 09:22:45 -08:00
Fred Park 348ceebc65 Add AAD X.509 cert auth support (#10)
- AAD/Keyvault credential support in credentials.json
2017-01-10 11:48:39 -08:00
Fred Park be04d89410 Update docs for KeyVault support (#10) 2017-01-06 08:03:26 -08:00
Fred Park 2048d8b289 Add KeyVault support (#10) 2017-01-05 10:20:13 -08:00
Fred Park ae7e5df410 Tag for 2.3.1 release
- Update some docs
2017-01-03 08:51:19 -08:00
Fred Park b69334de58 Fix multi-job jpcmd bug 2016-12-15 15:44:18 -08:00
Fred Park d9bf6c92da Add nvidia-docker support to ssh tunnel 2016-12-15 11:05:52 -08:00
Fred Park 57b47b353f Add pool ssh command, resolves #9
- Make the ssh docker tunnel script much easier to use
- Add an ssh guide to docs
2016-12-15 07:39:04 -08:00
Fred Park 38ba61245d Add /dev/shm option, resolves #8 2016-12-14 08:36:51 -08:00
Fred Park 7f37f81e93 Increment version to 2.2.0 2016-12-12 08:13:37 -08:00
Fred Park f00f222877 Infiniband settings changes
- Add CNTK ib recipe
- Update READMEs to remove GPU preview notes
- Tag for 2.2.0 release
2016-12-09 11:34:21 -08:00
Fred Park 5baf61d8f4 Fix SAS key and KeyError masking in data movement 2016-11-30 14:41:39 -08:00
Fred Park c603e8f6f5 Fix tfm docker image latest reference 2016-11-30 13:44:44 -08:00
jasper-schneider 6f7d474874 Fix ssh_public_key typo when using Windows (#6) 2016-11-30 13:40:29 -08:00
Fred Park 8f0fa2f446 Tag for 2.1.0 release
- Pass version to nodeprep and pull backend docker images by version
2016-11-30 08:27:46 -08:00
Fred Park 28732f2aea Add listskus subcommand
- Update docs for envvars
2016-11-29 15:31:23 -08:00
Fred Park 8577232349 Tag for 2.0.0 release 2016-11-23 09:06:37 -08:00
Fred Park fe0403de9a Update MXNet GPU docker image 2016-11-22 14:27:33 -08:00
Fred Park c047b522c3 Update CNTK docker images to 2.0beta4
- Fix termtasks for multi-instance tasks with named containers
2016-11-22 00:16:16 -08:00
Fred Park 44080c123a Prepend job id to Docker container names
- Update TensorFlow to 0.11.0 and custom compile to add compute/sm 3.7
2016-11-20 14:57:45 -08:00
Fred Park 8cb2ba9583 Allow GPU property to be optional for NC VMs
- Update all GPU compute recipes to omit gpu driver
2016-11-20 09:00:19 -08:00
Fred Park 453ae98a65 Terminate cascade on thread failures 2016-11-19 10:39:58 -08:00
Fred Park c7744f95bf Support for internet accessible private registries 2016-11-19 09:00:01 -08:00
Fred Park 4399dbf4db Tag for 2.0.0rc3 release
- Fix flake8 issues
2016-11-14 11:10:59 -08:00
Fred Park 62b532d233 Fix encoding issue for env file write 2016-11-13 23:26:48 -08:00
Fred Park 0ae05f2d84 Add CUDA_CACHE_ vars for GPU tasks 2016-11-13 11:42:56 -08:00
Fred Park b8fcdede8f Add --tail option for jobs add
- Simplify quickstart with --tail
2016-11-12 22:35:36 -08:00
Fred Park 4f41d95e32 Finish settings refactor
- Change recipes to use current_dedicated for multi-instance count
2016-11-12 22:13:55 -08:00
Fred Park e6593281eb Refactor direct config access out of data/storage 2016-11-12 12:35:56 -08:00
Fred Park 6e20e1b512 Refactor direct config accesses in crypto
- Refactor os path calls to pathlib
2016-11-12 09:13:09 -08:00
Fred Park 285f86ae9b Fleet no longer directly accesses config json 2016-11-11 23:19:28 -08:00
Fred Park cb4a077776 Fleet add pool no longer directly accesses config 2016-11-11 21:51:11 -08:00
Fred Park 03ced70c38 Continue settings refactor
- Credentials
- Some of global config
2016-11-11 21:08:58 -08:00
Fred Park 392af0bd55 Start pool settings refactor 2016-11-11 19:23:16 -08:00
Fred Park bff72f4d04 Fix removal of shared path from glusterfs ingress 2016-11-11 09:45:36 -08:00
Fred Park e700ee05b7 Add docker login prior to image update
- Move docker hub creds to credentials json
- Begin refactor of configuration settings retrieval
2016-11-11 09:30:14 -08:00
Fred Park da573524de Preliminary steps for ACR support
- Fix update docker images with private registry
- Automatically clean dangling image refs on update
- Remove private registry file/image id support
- Refactor fleet initialization steps to one entry point
- Simplify shipyard context init
2016-11-10 09:48:00 -08:00
Fred Park 092c2a22d1 Fix single node transfer with single file 2016-11-09 19:06:06 -08:00