-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable pytorch elastic training fashion based on torch elastic #267
base: master
Are you sure you want to change the base?
Conversation
|
||
// EnableElastic decides whether torch elastic is enabled for job. | ||
// +optional | ||
EnableElastic bool `json:"enableElastic"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add omitempty json tag
MaxReplicas *int32 `json:"maxReplicas,omitempty"` | ||
|
||
RDZVBackend string `json:"rdzvBackend,omitempty"` | ||
RdzvEndpoint string `json:"rdzvEndpoint"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is rdzvEndpoint
required and must not omitempty
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, RDZVBackend and RdzvEndpoint are both required when EnableElastic is true.
@wanziyu hi wanziyu, thanks for your contribution! before merge your PR to master branch, you should sign-off your commit first. |
pkg/job_controller/api/v1/types.go
Outdated
LastReplicas int32 `json:"lastReplicas,omitempty"` | ||
|
||
// Continue represents whether the job needs to continue scaling. | ||
Continue bool `json:"continue,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will user amend this continue
field manually? if not, who will pause the scaling progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torchelastic controller will amend this continue field. When continue is set to false in the last control loop, the controller will pause the scaling progress.
@@ -162,11 +162,28 @@ func (jc *JobController) FilterServicesForReplicaType(services []*v1.Service, re | |||
return result, nil | |||
} | |||
|
|||
// calculateServiceSliceSize compare max pod index with desired replicas and return larger size | |||
func calculateServiceSliceSize(services []*v1.Service, replicas int) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary? in GetServiceSlices
function, serviceSlices will be re-allocated when index> size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If calculateServiceSliceSize is not used, there will exist an error "index out of range" when scaling in workers and services (ie., update worker replicas from 2 to 1).
…-elastic Signed-off-by: wanziyu <[email protected]>
Signed-off-by: wanziyu <[email protected]>
Signed-off-by: wanziyu <[email protected]>
1dc3aa9
to
90832ba
Compare
Codecov Report
@@ Coverage Diff @@
## master #267 +/- ##
==========================================
+ Coverage 28.93% 29.12% +0.18%
==========================================
Files 88 89 +1
Lines 5985 6260 +275
==========================================
+ Hits 1732 1823 +91
- Misses 4000 4174 +174
- Partials 253 263 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
ctrl "sigs.k8s.io/controller-runtime" | ||
) | ||
|
||
func SetupWithManager(mgr ctrl.Manager) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may torchelasticcontroller
implementation be a sub-module of /pytorch
, directory structure will looks cleaner.
jobStatus.ElasticStatus[rtype].ElasticCondition = apiv1.ElasticStop | ||
} | ||
|
||
func updateElasticStatusForContinueJob(pytorchJob *training.PyTorchJob, currentReplicas, newReplicas int32, rtype apiv1.ReplicaType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateElasticStatusForContinueJob
... updateElasticStatusForMaxMetricJob
has some code duplication that can be further abstracted.
return true, nil | ||
} | ||
|
||
func (ts *TorchElasticController) restartWorkerInKruiseProtocol(job *trainingv1alpha1.PyTorchJob, pod *corev1.Pod, expectedWorldSize, expectedGeneration string) (completed bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reuse restartWorkerInKruiseProtocol
under /pytorch
package
Signed-off-by: wanziyu <[email protected]>
Ⅰ. Describe what this PR does
The PR designs elastic training APIs, adds a torch-elastic controller and implements elastic training control flow on torch-elastic controller and pytorch controller. Currently, the scaling algorithm is based on the real-time batch training latency collected from running pod logs.
II. Does this pull request fix one issue?
#251