Граф коммитов

454 Коммитов

Автор SHA1 Сообщение Дата
chezhang 0731b03962 Fix a bug that heartbeat thread may be stuck due to deadlock 2020-07-27 16:48:13 +08:00
chezhang 6130db6eb7 Revise some log messages 2020-07-27 16:28:25 +08:00
chezhang 772b2f4012 Update version to 2.5.0.0 2020-07-17 21:51:02 +08:00
zclok010 cce1f68b9f Improve GPU instance name readability in metric info by adding GPU name 2019-08-30 14:18:33 +08:00
zclok010 9695b14042 Fix a issue that node with FQDN host name may not be recognized by scheduler 2019-08-20 11:44:21 +08:00
zclok010 3dbf5d342d Fix build issue 2019-08-20 11:38:18 +08:00
zclok010 b369a14ff3 Move build-in execution filters to https://github.com/Azure-Samples/hpcpack-samples 2019-08-01 23:35:55 +08:00
zclok010 5ae6582161 add a missing build-in execution filter in OnTaskStart.sh 2019-07-31 15:33:59 +08:00
zclok010 1a575bccb3 suppress frequent warning log message when counter file of a mellanox network driver is invalid 2019-07-31 15:32:31 +08:00
zclok010 dc73a1fb4d git ignore VMExtension/*.zip 2019-07-26 00:04:41 +08:00
zclok010 ba4ca371d7 git ignore VMExtension\*.zip 2019-07-26 00:01:58 +08:00
zclok010 f91b5bae05 update config sample 2019-07-25 23:58:02 +08:00
Sunbin Zhu 6abe5e32d4 Merge branch 'v2' of https://github.com/Azure/hpcpack-linux-agent into v2 2019-07-25 23:37:09 +08:00
Sunbin Zhu a544594e1e Update hpcnodemanager.py
Add firewall rule to allow port 40002
2019-07-25 23:37:02 +08:00
zclok010 302720eb73 revise some version info 2019-07-23 12:03:32 +08:00
zclok010 0de36a6ff9 mitigate scheduler pressure when connection is poor by decreasing HTTP reporter retry frequency 2019-07-23 12:03:32 +08:00
zclok010 19fcca1a97 Support multiple instances monitoring with instance filter;
Seperate network usage monitoring from total usage to usage of individual network instances
2019-07-23 12:03:32 +08:00
Sunbin Zhu c7c071ffa0 Include VM extension code
Include VM extension code
2019-07-19 15:03:46 +08:00
zclok010 a2ff32a012 add comments of known issue of run-away processes 2019-07-04 00:14:05 +08:00
zclok010 5760fb798b Merge branch 'dockerTask' into v2 2019-07-03 14:49:55 +08:00
zclok010 16c5aaaecb HpcData client location change in execution filter 2019-06-19 15:11:19 +08:00
zclok010 183c4d8ef2 update version info 2019-06-17 16:01:59 +08:00
zclok010 9d827bcac1 Fix node manager crash issue due to out-of-bound array writing when constructing monitoring packet with too many data values 2019-06-13 19:32:41 +08:00
zclok010 9ff4b44466 docker task improvement 2019-06-13 15:31:44 +08:00
zclok010 a1dd952348 Add IB network usage factor 2019-05-31 16:59:55 +08:00
zclok010 ae73ce7624 Merge branch 'config' into v2 2019-05-27 16:24:51 +08:00
zclok010 29b2baa192 Merge branch 'ibNetwork' into v2 2019-05-27 16:22:24 +08:00
zclok010 43bfbca0c7 update sample config file 2019-05-23 23:42:45 +08:00
zclok010 1d3a59d886 monitor IB network usage 2019-05-23 18:26:14 +08:00
FAREAST\chezhang 9322ff980d fix code defect 2019-05-21 17:54:38 +08:00
FAREAST\chezhang 332d7f9f3d Merge branch 'dataIO' into v2 2019-05-15 14:21:04 +08:00
FAREAST\chezhang c6da4b4edf Merge branch 'su' into v2 2019-05-15 14:20:55 +08:00
FAREAST\chezhang 68c9e0cd57 Merge branch 'mpiFilter' into v2 2019-05-15 14:20:48 +08:00
FAREAST\chezhang 71d2753c4f Support downloading input files and uploading output files for task with HpcData service 2019-05-13 14:52:45 +08:00
FAREAST\chezhang 418ae90cab replace mpi command in the whole command line instead of just the beginning 2019-05-13 14:39:04 +08:00
FAREAST\chezhang 371138d162 Fix a code defect when CCP_SWITCH_USER is set 2019-05-10 19:52:42 +08:00
zclok010 6a7ef893c6 git ignore .vscode 2019-05-10 12:16:27 +08:00
zclok010 b130b4fd7f Fix test case error introduced in commit b41e682 (excution filter change) 2019-05-09 15:15:48 +08:00
zclok010 d98eec6f15 Add CcpVersion and CustomProperties in register info 2019-05-07 10:43:33 +08:00
zclok010 0d44a14c3e refine code 2019-05-06 20:04:10 +08:00
zclok010 d7a8e4a371 tidy log when cleaning up zombie tasks 2019-05-05 14:57:32 +08:00
zclok010 05138060e7 Fix a bug that processes in task are not actually terminated after task canceling if cgroup is not enable 2019-04-30 20:26:31 +08:00
zclok010 844bb394b9 Fix a bug that zombie task clean up would fail when nodemanager starts 2019-04-30 20:24:27 +08:00
zclok010 42b467d5af tidy some code 2019-04-30 15:12:30 +08:00
FAREAST\chezhang 6483b16b0f Update tasks' statistics before sending heartbeat to enable showing statistics when task is running 2019-04-26 10:19:23 +08:00
zclok010 b41e682ab3 Merge branch 'executionfilter' into v2 2019-04-16 15:42:04 +08:00
zclok010 2750349ba1 Add build-in execution filters to adjust task affinity in terms of core distribution in NUMA nodes and to modify command for preparation of mpi task 2019-04-16 15:29:26 +08:00
zclok010 8deeffa1f0 get home dir by tilde expansion 2019-04-08 15:26:17 +08:00
zclok010 2fa7452e49 fix a bug that memory is limited to the first NUMA node when using cgroup 2019-03-28 17:26:35 +08:00
zclok010 4e0e4a2d1e change the default working directory to home 2019-03-28 17:02:54 +08:00