-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IL: LC supports to recover job when restart #152
Conversation
pkg/localcontroller/db/db.go
Outdated
|
||
queryResult := dbClient.Where("name = ?", name).First(&r) | ||
if queryResult.RowsAffected == 0 { | ||
klog.Warningf("resource(name=%s) not in db", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- leave the log to the callers, because some callers may expect non-existence
- no Error return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
please update your commit meesage, we need to describe this pr to make reviewer understand what you want to do. |
Upated pr message. thanks for your suggestion. |
2a20b2a
to
ca3a27a
Compare
return nil | ||
} | ||
|
||
// updateAnnotations updates job annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent: updateAnnotations
VS updateJobToDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateJobToDB
=> saveJobToDB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
@@ -322,8 +338,7 @@ func (im *Manager) startJob(name string) { | |||
} | |||
|
|||
if err != nil { | |||
klog.Errorf("job(name=%s) complete the %s task failed, error: %v", | |||
jobConfig.UniqueIdentifier, jobStage, err) | |||
klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err) | |
klog.Errorf("job(name=%s) failed to complete the %s task: %v", name, jobStage, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
if err := job.updateDeployModel(deployModel, trainedModel); err != nil { | ||
return nil, err | ||
if err := job.JobConfig.Storage.CopyFile(trainedModel, deployModel); err != nil { | ||
return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v", | |
return fmt.Errorf("failed to copy trained model(url=%s) to the deploy model(url=%s): %w", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit: error wrap using %w.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
|
||
klog.Infof("job(name=%s) deploys model(url=%s) successfully", jobConfig.UniqueIdentifier, trainedModel) | ||
klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel) | |
klog.V(4).Infof("copy trained model(url=%s) to the deploy model(url=%s) successfully", trainedModel, deployModel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
deployModel := jobConfig.EvalModels[1].URL | ||
if job.Storage.IsLocalStorage { | ||
// updateModel updates deploy model | ||
func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this?
func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error { | |
func (im *Manager) updateDeployModelFile(job *Job, trainedModel string, deployModel string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -687,7 +804,7 @@ func (job *Job) createOutputDir(jobConfig *JobConfig) error { | |||
|
|||
dirNames := []string{"data/train", "data/eval", "train", "eval"} | |||
|
|||
if job.Storage.IsLocalStorage { | |||
if job.JobConfig.Storage.IsLocalStorage { | |||
if err := util.CreateFolder(outputDir); err != nil { | |||
klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir) | |
klog.Errorf("job(name=%s) failed to create folder %s: %v", jobConfig.UniqueIdentifier, outputDir, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
return err | ||
} | ||
if err != nil { | ||
klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v", | |
klog.Errorf("job(name=%s) failed to trigger the %sing phase: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
credential := job.ObjectMeta.Annotations[runtime.SecretAnnotationKey] | ||
if credential != "" { | ||
if err := job.JobConfig.Storage.SetCredential(credential); err != nil { | ||
return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err) | |
return fmt.Errorf("failed to set storage credential: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- no job name in returned error.
- wrap error using %w.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
err = im.saveJobToDB(job) | ||
if err != nil { | ||
klog.Errorf("job(name=%s) failed to update job to db, error: %v", | ||
jobConfig.UniqueIdentifier, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no return
here instead of continuing since an error has been happend and logged? if intentional, add a comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like // continue anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
return err | ||
err = im.loadDataset(job) | ||
if err != nil || jobConfig.Dataset == nil || jobConfig.Dataset.DataSource == nil { | ||
return fmt.Errorf("job(name=%s) failed to load dataset, and waiting it: %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nic: error wrap using %w
instead of %v
, see https://blog.golang.org/go1.13-errors#TOC_3.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
return nil | ||
} | ||
|
||
switch jobStage { | ||
case sednav1.ILJobTrain: | ||
return &Model{Format: models[1].Format, URL: models[1].URL} | ||
// return two models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// two models for deploy stage:
// the first model is the output model of train stage, which was evaluated as better than the second model in eval stage .
// the second model is the serving model used in the inference worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ,fix it
Signed-off-by: JimmyYang20 <[email protected]>
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: llhuii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Upgrade to v0.4.0 Created-by: Vittorio Cozzolino 00609018 Author-id: 553076 MR-id: 12361350 Commit-by: Vittorio Cozzolino;KubeEdge Bot;JoeyHwong;JimmyYang20;ShiXiaohou;DanLiu;HenryChou;Yutong Wang;EnfangCui;llhuii;XinYao1994;Jie Pu;wei.ji Merged-by: Vittorio Cozzolino 00609018 E2E-issues: Description: fix incremental_learning bug - Add docs and code comment - fix bugs: epoch always be 1, inference result not saved, s3 upload fail Signed-off-by: JoeyHwong <[email protected]>, Interface Improvement: 1. The algorithm of HardExampleMining should be seleted by the developer. Signed-off-by: JoeyHwong <[email protected]>, install.sh: fix LC_BIND_PORT bug rename the variable LC_BIND_PORT to SENDA_LC_BIND_PORT. Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#121 from llhuii/fix-install-script-bug install.sh: fix LC_BIND_PORT bug, Update interface.py fix env missing bug Signed-off-by: JoeyHwong <[email protected]>, fix bug: aggregation of weights should occur in the AggServer Signed-off-by: JoeyHwong <[email protected]>, fix bug: Cloud worker not exiting Signed-off-by: JoeyHwong <[email protected]>, gm: refactor all features into independent dir All controllers are placed into globalmanager/controllers: 1. each feature has the independent subdirectory 2. upstream/downstream are kept as top level. Commom types/utils/worker.go are placed into globalmanager/runtime. Signed-off-by: llhuii <[email protected]>, fix PR comment - clean useless code - catch server exception in threads Signed-off-by: JoeyHwong <[email protected]>, gm: refactor upstream controller Split upstream controller, merge each feature CR logic code into its controller. Signed-off-by: llhuii <[email protected]>, gm: share client/Informer with all controllers Make all controllers sharing with: 1. kubernetes client, and informerFactory with random resync period. 2. sedna crd client, and informerFactory with random resync period. This can reduce code and improve slim performance. Signed-off-by: llhuii <[email protected]>, gm: add dataset controller Only handle dataset update from edge. Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#106 from JoeyHwong-gk/federated fix federated learning bugs, gm: split all upstream logic into separate file Signed-off-by: llhuii <[email protected]>, gm: split all downstream logic into separate file Since all CR watch actions are placed into corresponding controller, controllers/downstream.go is unnecessary. Signed-off-by: llhuii <[email protected]>, LC: fix nil pointer dereference bug It happened in evalTask of incremental job when deploy model hasn't been synced to LC. evalTask should return error instead of logging error. And it doesn't need job id info into error, same as trainTask. Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#139 from JimmyYang20/fixbug LC: fix nil pointer dereference bug, LC: send dataset update to GM only when changed number of samples has been sent to GM only when adding new data. Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#137 from JimmyYang20/main LC: send dataset update to GM only when changed, lifelong learning s3 support - fix file_ops method - fix kb save bug Signed-off-by: JoeyHwong <[email protected]>, Add object search and tracking docs to docs/proposals/ Add object search and tracking crd samples to build/crd-samples/sedna/ Add object search and tracking type.go files to pkg/apis/sedna/v1alpha1/ Signed-off-by: EnfangCui <[email protected]>, Merge pull request kubeedge#100 from EnfangCui/add-multi-edge-inference-PR Add object search and tracking proposals, gm: more code clean after initial refactor done 1. remove the feature redundant name in all feature controllers(e.g. 'federatedlearningJob' to 'job'), since it has already own independent package, no need the feature extra name 2. upstream interface optimizaztion 3. fix empty Kind of all CR in downstream 4. add extra doc string 5. fix code style Signed-off-by: llhuii <[email protected]>, Fix the problem that kbimage cannot be compiled in Makefile Signed-off-by: wei.ji <[email protected]>, improve lifelong learning docs 1. improve the atc example words 2. fix the broken links in lifelong proposal Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#146 from Jw-Jm/main Fix the problem that kbimage cannot be compiled in Makefile, make the hard_example_mining alg to be a common interface Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#134 from llhuii/refactor-gm gm: decouple all features into independent package, Merge pull request kubeedge#107 from JoeyHwong-gk/incremental [incremental learning] example:keep all results whether is hardExample or not, fixed the issue of using s3 to save model, Merge pull request kubeedge#143 from JoeyHwong-gk/lldoc improve lifelong learning docs, fix example bug: save result which get from cloud if is hard example fix message when http connect fail Signed-off-by: JoeyHwong <[email protected]>, fix pr comment - make the hard_example_mining alg to be a common interface - fix get_hem_from_config: raise exception when value is unexpected Signed-off-by: JoeyHwong <[email protected]>, lc: decouple all features into independent package Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#117 from JoeyHwong-gk/joint joint_inference: bug fix and interface reconstruction, Merge pull request kubeedge#149 from JimmyYang20/refector-lc lc: decouple all features into independent package, fix lifelong issue - Reduce parameters for initial - show all interfaces of lifelong learning in example - fix bugs from deep learning framework Signed-off-by: JoeyHwong <[email protected]>, fix il doc Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#153 from JimmyYang20/fix-doc Fix rendering issue of example doc in readthedocs, Merge pull request kubeedge#142 from JoeyHwong-gk/lls3 lifelong learning: issue-driven interface-adjustment and bug fix, fix the lifelong example problem from backend and constant - fix sklearn backend: support args in train/eval/infer - fix lifelong constant Signed-off-by: JoeyHwong <[email protected]>, Automatic push images when publishing a release A github action is added for pushing image when a new release is created: 1. login docker hub. 2. checkout the project, and run `make push-all`. Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#154 from JoeyHwong-gk/lifelong [Lifelong example]: fix the problem from backend and constant, docs: update install guide 1. add GM/LC links 2. add GM/LC deploy form Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#155 from llhuii/add-image-push-gh-action Push images automatically when a new release is created, Merge pull request kubeedge#156 from llhuii/update-install-doc docs: update install guide, IL: LC supports to recover job when restart Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#152 from JimmyYang20/fixbug IL: LC supports to recover job when restart, Fix IMAGE_REPO in github image-publish action Using the env 'GITHUB_REPOSITORY' instead of 'GITHUB_ACTOR' to get the right image repo name i.e. `IMAGE_REPO` in Makefile. Signed-off-by: llhuii <[email protected]>, add lib doc Signed-off-by: JoeyHwong <[email protected]>, Improve the docs Signed-off-by: JoeyHwong <[email protected]>, fix syntax and information in the docs Signed-off-by: JoeyHwong <[email protected]>, update lib doc Signed-off-by: JoeyHwong <[email protected]>, Update s3 example docs of IL&JI Signed-off-by: JimmyYang20 <[email protected]>, Support websocket reconnection Signed-off-by: JoeyHwong <[email protected]>, Lib support hot model update Signed-off-by: JoeyHwong <[email protected]>, Adjusting the Log of IncrementalLearning example Signed-off-by: JoeyHwong <[email protected]>, Fix codegen verify checker Note the codegen verify checker should report error Signed-off-by: llhuii <[email protected]>, Add the missing gencode for objectsearch/tracking Signed-off-by: llhuii <[email protected]>, fix job_kind value in LC_report Signed-off-by: JoeyHwong <[email protected]>, Update dependency of server request in lib - replace `retry==1.3.3` with `tenacity==8.0.1` because of `retry` no longer maintained. Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#158 from llhuii/fix-imagerepo-of-image-publish-action Fix IMAGE_REPO in github image-publish action, Merge pull request kubeedge#164 from JoeyHwong-gk/federated Support websocket reconnection when the server status is abnormal, Merge pull request kubeedge#166 from llhuii/fix-verify-checker Fix codegen verify checker, Add contributing docs Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#159 from JoeyHwong-gk/libdoc update lib doc, Merge pull request kubeedge#148 from llhuii/add-contributing-docs Add contributing docs, Merge pull request kubeedge#160 from JimmyYang20/doc-s3 Update s3 example docs of IL&JI, fix access exceptions when rendering with sphinx Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#150 from JoeyHwong-gk/docs docs improvement, GM&LC: IL supports model hot updates Signed-off-by: JimmyYang20 <[email protected]>, fix pr comment Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#138 from JimmyYang20/modelhotupdate GM&LC: IL supports model hot updates, Fix s3 example docs of IL&JI Signed-off-by: JimmyYang20 <[email protected]>, Merge pull request kubeedge#174 from JimmyYang20/doc-s3 Fix s3 example docs of IL&JI, Merge pull request kubeedge#157 from JoeyHwong-gk/hot_model [Lib Support] hot model update, Upgrade gorilla/websocket from v1.4.0 to v1.4.2 This upgrade fixes a potential DoS vector bug in gorilla/websocket 1.4.0, see GHSA-jf24-p9p9-4rjh Signed-off-by: llhuii <[email protected]>, Merge pull request kubeedge#182 from llhuii/upgrade-websocket Upgrade gorilla/websocket from v1.4.0 to v1.4.2, fix lib/requirements - This upgrade fixes a CSRF error in FastAPI version earlier than 0.65.2, see GHSA-8h2j-cgx8-6xv7 Signed-off-by: JoeyHwong <[email protected]>, Merge pull request kubeedge#183 See merge request butterfly/sedna!9
What this PR does / why we need it:
When these are situations,such as the node being restarted,the LC will be restarted, then LC should recover incremental learning job.
Which issue(s) this PR fixes:
fix issue: #140