Граф коммитов

45 Коммитов

Автор SHA1 Сообщение Дата
Yuqi Wang d602cdcbab
Keep podGracefulDeletionTimeoutSec as Nullable (#74) 2022-02-23 18:38:16 +08:00
Yuqi Wang e6589162fb
Upgrade CRD to apiextensions.k8s.io/v1 to support k8s >= 1.22 (#72) 2022-01-17 11:08:28 +08:00
Yuqi Wang 4b5707f53e
Expose Task History (#62) 2020-10-19 21:39:28 +08:00
Yuqi Wang 959722c429
Treat invalid Pod caused by network error as PodCreationUnknownError (#61) 2020-08-31 20:57:51 +08:00
Yuqi Wang a220bd321f
Expose and increase default sync concurrency (#60) 2020-08-28 19:58:15 +08:00
Yuqi Wang 29e115373a
Pause zero scale Framework instead of completing it (#59) 2020-08-11 19:46:34 +08:00
Yuqi Wang d67fc76595
Support Create ExecutionType: Just create without start (#58) 2020-08-10 12:20:02 +08:00
Yuqi Wang c61269671d
Support Framework ScaleUp/ScaleDown with Strong Safety Guarantee (#56) 2020-08-03 14:00:49 +08:00
Di Xu 40fb74d1d5
add FC_TASK_INDEX to label so can select pod uniquely (#53) 2020-02-14 18:33:00 -08:00
Yuqi Wang c4be168117
Enrich PodSpecError to early fail Pod (#52) 2020-01-17 16:32:53 +08:00
Yuqi Wang 7789e3e73f
Aware UID change during Update event and Sync (#51) 2020-01-13 13:57:28 +08:00
Yuqi Wang 285ade0ea8
Remove deprecated Initializers in planning (#50) 2019-12-10 11:12:32 +08:00
Yuqi Wang 429fa5498e
Fix invalid json in log caused by fmt (MISSING) (#49) 2019-11-12 14:17:11 +08:00
Yuqi Wang 707b7a9c97
Add PodNodeName to help track failures on node before PodIP is available (#45) 2019-10-29 17:13:04 +08:00
Yuqi Wang 8e4145176c
Support large scale Framework by LargeFrameworkCompression (#44) 2019-10-23 15:43:59 +08:00
Yuqi Wang 77ec4abbdc
Support PodGracefulDeletionTimeoutSec to tune Framework Consistency vs Availability (#43) 2019-09-19 17:54:25 +08:00
Yuqi Wang 42373169ca
Refine PodFailureSpec (#42) 2019-09-16 13:44:41 +08:00
Yuqi Wang df63d60c53
Support PodFailureSpec to classify and summarize Pod failures (#41) 2019-09-02 18:54:50 +08:00
Yuqi Wang 54a4554b69
Refine RetryDelaySec (#39) 2019-08-16 17:46:05 +08:00
Yuqi Wang e13b822eff
Remove unnecessary recoverFrameworkWorkItems (#38) 2019-08-12 13:42:49 +08:00
Yuqi Wang 4a771cd6c4
Support FrameworkCompletedRetainSec (#37) 2019-08-09 19:04:40 +08:00
Yuqi Wang 9298ab677c
Redefine FrameworkAttemptRunning and Record attempt running start time (#35)
This helps to measure pure running duration
2019-08-08 11:21:06 +08:00
Yuqi Wang d432b57875
Fill object TypeMeta/GroupVersionKind in case it is missed in history snapshot (#33) 2019-08-01 14:44:26 +08:00
Yuqi Wang 1aa6e612e1
Support LogObjectSnapshot: Expose Framework and Pod History (#31) 2019-07-31 15:04:08 +08:00
Yuqi Wang 48f601bb39
Switch to klog (#30) 2019-07-26 20:09:05 +08:00
Yuqi Wang 2caad5b969
Upgrade to golang 1.12.6 (#29) 2019-07-18 15:58:23 +08:00
Yuqi Wang 243996c2c0
Refine Framework golang iteration (#28) 2019-07-17 15:37:44 +08:00
Yuqi Wang 157c3bfe0f
Still sync Task after FrameworkAttemptCompleted (#27) 2019-07-17 15:35:32 +08:00
Yuqi Wang 20f38add58
Revise TaskStatus after FrameworkAttemptCompleted (#26) 2019-07-17 15:32:31 +08:00
Yuqi Wang 9cf3c8881e
Make AttemptCompleted state is not necessary to have an associated instance (#25) 2019-07-17 15:27:38 +08:00
Yuqi Wang dbb98da159
Support Stop Framework (#24) 2019-07-17 15:24:19 +08:00
Yuqi Wang b4b3695cea
Revise Internal and External CompletionTypeAttribute to User and Platform (#22) 2019-07-17 15:18:57 +08:00
Yuqi Wang 0a7b9851a0
Support Pod Template Placeholders (#21) 2019-07-17 15:16:08 +08:00
Yuqi Wang a4b9eeb690
Consolidate slice append (#20) 2019-07-17 15:12:56 +08:00
Yuqi Wang 1fb9e251e7
Refine updateRemoteFrameworkStatus (#19) 2019-07-17 15:09:52 +08:00
Yuqi Wang 63422e0227
Fix fExpectedStatusInfos map race condition (#18) 2019-07-17 15:03:33 +08:00
Yuqi Wang 9ab3eadace
Fix LogLines (#17) 2019-07-17 14:59:07 +08:00
Yuqi Wang 3654ac11ce
Upgrade to kubernetes-1.14.2 (#16) 2019-07-17 14:47:01 +08:00
Yuqi Wang 8adcef25f6
Add FrameworkAttemptPreparing State (#12) 2019-02-19 19:28:02 +08:00
Yuqi Wang 0c93ab3733
Refine CompletionPolicy comment and log (#11) 2019-02-15 13:58:46 +08:00
Yuqi Wang 7e3eaa0c21
Fix TaskComplete may transition to TaskAttemptCompleted (#10) 2019-02-14 19:06:27 +08:00
Yuqi Wang 3420ae0e67
[BREAKING CHANGE]: Refine AnnotationKey, LabelKey and EnvName (#6)
1. Change "POD_NAMESPACE" to "FRAMEWORK_NAMESPACE"
2. Prefix "FC_" for all FrameworkController Predefined AnnotationKeys, LabelKeys and EnvNames
3. Prefix "FB_" and uppercase TaskRoleName for all FrameworkBarrier EnvNames
2019-01-17 17:41:46 +08:00
Yuqi Wang 07c2a6c058
Refine Doc and Example (#3)
Refine Doc and Example
2018-12-17 21:32:22 +08:00
Yuqi Wang 94a1680339
Support FrameworkBarrier for GangExecution and Add Distributed TensorFlow Training Example (#2)
1. Support FrameworkBarrier for GangExecution
2. Add Distributed TensorFlow Training Example
2018-11-23 14:53:04 +08:00
Yuqi Wang 75dea76860 Initial FrameworkController: General-Purpose Kubernetes Pod Controller 2018-10-22 08:34:54 +00:00