Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IL: LC supports to recover job when restart #152

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

JimmyYang20
Copy link

@JimmyYang20 JimmyYang20 commented Aug 9, 2021

What this PR does / why we need it:
When these are situations,such as the node being restarted,the LC will be restarted, then LC should recover incremental learning job.

Which issue(s) this PR fixes:
fix issue: #140

@kubeedge-bot kubeedge-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 9, 2021

queryResult := dbClient.Where("name = ?", name).First(&r)
if queryResult.RowsAffected == 0 {
klog.Warningf("resource(name=%s) not in db", name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. leave the log to the callers, because some callers may expect non-existence
  2. no Error return?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

@JimmyYang20 JimmyYang20 changed the title IL: lc supports to recover job when restart IL: LC supports to recover job when restart Aug 10, 2021
@ugvddm
Copy link

ugvddm commented Aug 12, 2021

please update your commit meesage, we need to describe this pr to make reviewer understand what you want to do.

@JimmyYang20
Copy link
Author

please update your commit meesage, we need to describe this pr to make reviewer underst

Upated pr message. thanks for your suggestion.

@kubeedge-bot kubeedge-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 13, 2021
@JimmyYang20 JimmyYang20 force-pushed the fixbug branch 3 times, most recently from 2a20b2a to ca3a27a Compare August 13, 2021 06:39
@kubeedge-bot kubeedge-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 13, 2021
return nil
}

// updateAnnotations updates job annotations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent: updateAnnotations VS updateJobToDB

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateJobToDB => saveJobToDB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

@@ -322,8 +338,7 @@ func (im *Manager) startJob(name string) {
}

if err != nil {
klog.Errorf("job(name=%s) complete the %s task failed, error: %v",
jobConfig.UniqueIdentifier, jobStage, err)
klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err)
klog.Errorf("job(name=%s) failed to complete the %s task: %v", name, jobStage, err)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if err := job.updateDeployModel(deployModel, trainedModel); err != nil {
return nil, err
if err := job.JobConfig.Storage.CopyFile(trainedModel, deployModel); err != nil {
return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v",
return fmt.Errorf("failed to copy trained model(url=%s) to the deploy model(url=%s): %w",

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit: error wrap using %w.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

}

klog.Infof("job(name=%s) deploys model(url=%s) successfully", jobConfig.UniqueIdentifier, trainedModel)
klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel)
Copy link

@llhuii llhuii Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel)
klog.V(4).Infof("copy trained model(url=%s) to the deploy model(url=%s) successfully", trainedModel, deployModel)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

deployModel := jobConfig.EvalModels[1].URL
if job.Storage.IsLocalStorage {
// updateModel updates deploy model
func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this?

Suggested change
func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error {
func (im *Manager) updateDeployModelFile(job *Job, trainedModel string, deployModel string) error {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -687,7 +804,7 @@ func (job *Job) createOutputDir(jobConfig *JobConfig) error {

dirNames := []string{"data/train", "data/eval", "train", "eval"}

if job.Storage.IsLocalStorage {
if job.JobConfig.Storage.IsLocalStorage {
if err := util.CreateFolder(outputDir); err != nil {
klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir)
klog.Errorf("job(name=%s) failed to create folder %s: %v", jobConfig.UniqueIdentifier, outputDir, err)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

return err
}
if err != nil {
klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v",
klog.Errorf("job(name=%s) failed to trigger the %sing phase: %v",

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

credential := job.ObjectMeta.Annotations[runtime.SecretAnnotationKey]
if credential != "" {
if err := job.JobConfig.Storage.SetCredential(credential); err != nil {
return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err)
Copy link

@llhuii llhuii Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err)
return fmt.Errorf("failed to set storage credential: %w", err)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. no job name in returned error.
  2. wrap error using %w.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix it

err = im.saveJobToDB(job)
if err != nil {
klog.Errorf("job(name=%s) failed to update job to db, error: %v",
jobConfig.UniqueIdentifier, err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no return here instead of continuing since an error has been happend and logged? if intentional, add a comment here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like // continue anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

return err
err = im.loadDataset(job)
if err != nil || jobConfig.Dataset == nil || jobConfig.Dataset.DataSource == nil {
return fmt.Errorf("job(name=%s) failed to load dataset, and waiting it: %v",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nic: error wrap using %w instead of %v, see https://blog.golang.org/go1.13-errors#TOC_3.3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

return nil
}

switch jobStage {
case sednav1.ILJobTrain:
return &Model{Format: models[1].Format, URL: models[1].URL}
// return two models.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		// two models for deploy stage:
		// the first model is the output model of train stage, which was evaluated as better than the second model in eval stage .
		// the second model is the serving model used in the inference worker.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ,fix it

@llhuii
Copy link

llhuii commented Aug 13, 2021

/lgtm

@kubeedge-bot kubeedge-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2021
@llhuii
Copy link

llhuii commented Aug 13, 2021

/approve

@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: llhuii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2021
@kubeedge-bot kubeedge-bot merged commit 8485c90 into kubeedge:main Aug 13, 2021
vcozzolino added a commit to vcozzolino/sedna that referenced this pull request Dec 3, 2021
Upgrade to v0.4.0

Created-by: Vittorio Cozzolino 00609018
Author-id: 553076
MR-id: 12361350
Commit-by: Vittorio Cozzolino;KubeEdge Bot;JoeyHwong;JimmyYang20;ShiXiaohou;DanLiu;HenryChou;Yutong Wang;EnfangCui;llhuii;XinYao1994;Jie Pu;wei.ji
Merged-by: Vittorio Cozzolino 00609018
E2E-issues: 
Description:
fix incremental_learning bug
- Add docs and code comment
- fix bugs: epoch always be 1, inference result not saved, s3 upload
  fail

Signed-off-by: JoeyHwong <[email protected]>,
Interface Improvement:
   1. The algorithm of HardExampleMining should be seleted by the developer.

Signed-off-by: JoeyHwong <[email protected]>,
install.sh: fix LC_BIND_PORT bug

rename the variable LC_BIND_PORT to SENDA_LC_BIND_PORT.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#121 from llhuii/fix-install-script-bug

install.sh: fix LC_BIND_PORT bug,
Update interface.py

fix env missing bug

Signed-off-by: JoeyHwong <[email protected]>,
fix bug: aggregation of weights should occur in the AggServer

Signed-off-by: JoeyHwong <[email protected]>,
fix bug: Cloud worker not exiting

Signed-off-by: JoeyHwong <[email protected]>,
gm: refactor all features into independent dir

All controllers are placed into globalmanager/controllers:
1. each feature has the independent subdirectory
2. upstream/downstream are kept as top level.

Commom types/utils/worker.go are placed into globalmanager/runtime.

Signed-off-by: llhuii <[email protected]>,
fix PR comment
- clean useless code
- catch server exception in threads

Signed-off-by: JoeyHwong <[email protected]>,
gm: refactor upstream controller

Split upstream controller, merge each feature CR logic code into its
controller.

Signed-off-by: llhuii <[email protected]>,
gm: share client/Informer with all controllers

Make all controllers sharing with:
1. kubernetes client, and informerFactory with random resync period.
2. sedna crd client, and informerFactory with random resync period.

This can reduce code and improve slim performance.

Signed-off-by: llhuii <[email protected]>,
gm: add dataset controller

Only handle dataset update from edge.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#106 from JoeyHwong-gk/federated

fix federated learning bugs,
gm: split all upstream logic into separate file

Signed-off-by: llhuii <[email protected]>,
gm: split all downstream logic into separate file

Since all CR watch actions are placed into corresponding controller,
controllers/downstream.go is unnecessary.

Signed-off-by: llhuii <[email protected]>,
LC: fix nil pointer dereference bug

It happened in evalTask of incremental job when deploy model hasn't
been synced to LC. evalTask should return error instead of logging
error. And it doesn't need job id info into error, same as trainTask.

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#139 from JimmyYang20/fixbug

LC: fix nil pointer dereference bug,
LC: send dataset update to GM only when changed

number of samples has been sent to GM only when adding new data.

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#137 from JimmyYang20/main

LC: send dataset update to GM only when changed,
lifelong learning s3 support
- fix file_ops method
- fix kb save bug

Signed-off-by: JoeyHwong <[email protected]>,
Add object search and tracking docs to docs/proposals/

Add object search and tracking crd samples to build/crd-samples/sedna/

Add object search and tracking type.go files to pkg/apis/sedna/v1alpha1/

Signed-off-by: EnfangCui <[email protected]>,
Merge pull request kubeedge#100 from EnfangCui/add-multi-edge-inference-PR

Add object search and tracking proposals,
gm: more code clean after initial refactor done

1. remove the feature redundant name in all feature controllers(e.g.
'federatedlearningJob' to 'job'), since it has already own independent
package, no need the feature extra name
2. upstream interface optimizaztion
3. fix empty Kind of all CR in downstream
4. add extra doc string
5. fix code style

Signed-off-by: llhuii <[email protected]>,
Fix the problem that kbimage cannot be compiled in Makefile

Signed-off-by: wei.ji <[email protected]>,
improve lifelong learning docs

1. improve the atc example words
2. fix the broken links in lifelong proposal

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#146 from Jw-Jm/main

Fix the problem that kbimage cannot be compiled in Makefile,
make the hard_example_mining alg to be a common interface

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#134 from llhuii/refactor-gm

gm: decouple all features into independent package,
Merge pull request kubeedge#107 from JoeyHwong-gk/incremental

[incremental learning] example:keep all  results whether is hardExample or not, fixed the issue of using s3 to save model,
Merge pull request kubeedge#143 from JoeyHwong-gk/lldoc

improve lifelong learning docs,
fix example bug: save result which get from cloud if is hard example

fix message when http connect fail

Signed-off-by: JoeyHwong <[email protected]>,
fix pr comment
- make the hard_example_mining alg to be a common interface
- fix get_hem_from_config: raise exception when value is unexpected

Signed-off-by: JoeyHwong <[email protected]>,
lc: decouple all features into independent package

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#117 from JoeyHwong-gk/joint

joint_inference: bug fix and interface reconstruction,
Merge pull request kubeedge#149 from JimmyYang20/refector-lc

lc: decouple all features into independent package,
fix lifelong issue
- Reduce parameters for initial
- show all interfaces of lifelong learning in example
- fix bugs from deep learning framework

Signed-off-by: JoeyHwong <[email protected]>,
fix il doc

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#153 from JimmyYang20/fix-doc

Fix rendering issue of example doc in readthedocs,
Merge pull request kubeedge#142 from JoeyHwong-gk/lls3

lifelong learning: issue-driven interface-adjustment and bug fix,
fix the lifelong example problem from backend and constant

- fix sklearn backend: support args in train/eval/infer
- fix lifelong constant

Signed-off-by: JoeyHwong <[email protected]>,
Automatic push images when publishing a release

A github action is added for pushing image when a new release is created:
1. login docker hub.
2. checkout the project, and run `make push-all`.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#154 from JoeyHwong-gk/lifelong

[Lifelong example]: fix the problem from backend and constant,
docs: update install guide

1. add GM/LC links
2. add GM/LC deploy form

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#155 from llhuii/add-image-push-gh-action

Push images automatically when a new release is created,
Merge pull request kubeedge#156 from llhuii/update-install-doc

docs: update install guide,
IL: LC supports to recover job when restart

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#152 from JimmyYang20/fixbug

IL: LC supports to recover job when restart,
Fix IMAGE_REPO in github image-publish action

Using the env 'GITHUB_REPOSITORY' instead of 'GITHUB_ACTOR' to get the right
image repo name i.e. `IMAGE_REPO` in Makefile.

Signed-off-by: llhuii <[email protected]>,
add lib doc

Signed-off-by: JoeyHwong <[email protected]>,
Improve the docs

Signed-off-by: JoeyHwong <[email protected]>,
fix syntax and information in the docs
Signed-off-by: JoeyHwong <[email protected]>,
update lib doc

Signed-off-by: JoeyHwong <[email protected]>,
Update s3 example docs of IL&JI

Signed-off-by: JimmyYang20 <[email protected]>,
Support websocket reconnection

Signed-off-by: JoeyHwong <[email protected]>,
Lib support hot model update

Signed-off-by: JoeyHwong <[email protected]>,
Adjusting the Log of IncrementalLearning example

Signed-off-by: JoeyHwong <[email protected]>,
Fix codegen verify checker

Note the codegen verify checker should report error

Signed-off-by: llhuii <[email protected]>,
Add the missing gencode for objectsearch/tracking

Signed-off-by: llhuii <[email protected]>,
fix job_kind value in LC_report

Signed-off-by: JoeyHwong <[email protected]>,
Update dependency of server request in lib
- replace `retry==1.3.3` with `tenacity==8.0.1` because of `retry` no longer maintained.

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#158 from llhuii/fix-imagerepo-of-image-publish-action

Fix IMAGE_REPO in github image-publish action,
Merge pull request kubeedge#164 from JoeyHwong-gk/federated

Support websocket reconnection when the server status is abnormal,
Merge pull request kubeedge#166 from llhuii/fix-verify-checker

Fix codegen verify checker,
Add contributing docs

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#159 from JoeyHwong-gk/libdoc

update lib doc,
Merge pull request kubeedge#148 from llhuii/add-contributing-docs

Add contributing docs,
Merge pull request kubeedge#160 from JimmyYang20/doc-s3

Update s3 example docs of IL&JI,
fix access exceptions when rendering with sphinx

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#150 from JoeyHwong-gk/docs

docs improvement,
GM&LC: IL supports model hot updates

Signed-off-by: JimmyYang20 <[email protected]>,
fix pr comment

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#138 from JimmyYang20/modelhotupdate

GM&LC: IL supports model hot updates,
Fix s3 example docs of IL&JI

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#174 from JimmyYang20/doc-s3

Fix s3 example docs of IL&JI,
Merge pull request kubeedge#157 from JoeyHwong-gk/hot_model

[Lib Support] hot model update,
Upgrade gorilla/websocket from v1.4.0 to v1.4.2

This upgrade fixes a potential DoS vector bug in gorilla/websocket 1.4.0,
see GHSA-jf24-p9p9-4rjh

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#182 from llhuii/upgrade-websocket

Upgrade gorilla/websocket from v1.4.0 to v1.4.2,
fix lib/requirements
- This upgrade fixes a CSRF error in FastAPI version earlier than 0.65.2,
see GHSA-8h2j-cgx8-6xv7

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#183

See merge request butterfly/sedna!9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants