New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

IL: LC supports to recover job when restart #152

Merged

kubeedge-bot merged 1 commit into kubeedge:main from JimmyYang20:fixbug

Aug 13, 2021

JimmyYang20 commented Aug 9, 2021 •

edited

Loading

What this PR does / why we need it:
When these are situations，such as the node being restarted，the LC will be restarted, then LC should recover incremental learning job.

Which issue(s) this PR fixes:
fix issue: #140

kubeedge-bot requested review from jaypume and TymonXie

August 9, 2021 02:28

kubeedge-bot added the size/XL label

JimmyYang20 force-pushed the fixbug branch from 3a6d10d to 2cfe1b9 Compare

August 9, 2021 02:30

llhuii reviewed

View reviewed changes

pkg/localcontroller/db/db.go Outdated

+              	queryResult := dbClient.Where("name = ?", name).First(&r)
+              	if queryResult.RowsAffected == 0 {
+              		klog.Warningf("resource(name=%s) not in db", name)

llhuii Aug 9, 2021

leave the log to the callers, because some callers may expect non-existence
no Error return?

Author

JimmyYang20 Aug 13, 2021

fix it

JimmyYang20 force-pushed the fixbug branch from 2cfe1b9 to 2090386 Compare

August 10, 2021 08:53

JimmyYang20 changed the title ~~IL: lc supports to recover job when restart~~ IL: LC supports to recover job when restart

ugvddm commented Aug 12, 2021

please update your commit meesage, we need to describe this pr to make reviewer understand what you want to do.

Author

JimmyYang20 commented Aug 13, 2021

please update your commit meesage, we need to describe this pr to make reviewer underst

Upated pr message. thanks for your suggestion.

JimmyYang20 force-pushed the fixbug branch from 2090386 to bdd352c Compare

August 13, 2021 01:49

kubeedge-bot added needs-rebase size/XXL and removed size/XL labels

JimmyYang20 force-pushed the fixbug branch 3 times, most recently from 2a20b2a to ca3a27a Compare

August 13, 2021 06:39

kubeedge-bot added size/XL and removed needs-rebase size/XXL labels

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

+              	return nil
+              }
+              // updateAnnotations updates job annotations

llhuii Aug 13, 2021

inconsistent: updateAnnotations VS updateJobToDB

llhuii Aug 13, 2021

updateJobToDB => saveJobToDB?

Author

JimmyYang20 Aug 13, 2021

fix it

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

@@ @@ -322,8 +338,7 @@ func (im *Manager) startJob(name string) { @@
               		}
               		if err != nil {
-              			klog.Errorf("job(name=%s) complete the %s task failed, error: %v",
-              				jobConfig.UniqueIdentifier, jobStage, err)
+              			klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err)

llhuii Aug 13, 2021

Suggested change

      
            			klog.Errorf("job(name=%s) complete the %s task failed, error: %v", name, jobStage, err)
          
            			klog.Errorf("job(name=%s) failed to complete the %s task: %v", name, jobStage, err)

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

-              	if err := job.updateDeployModel(deployModel, trainedModel); err != nil {
-              		return nil, err
+              	if err := job.JobConfig.Storage.CopyFile(trainedModel, deployModel); err != nil {
+              		return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v",

llhuii Aug 13, 2021

Suggested change

      
            		return fmt.Errorf("copy model(url=%s) to the model(url=%s) failed, error: %+v",
          
            		return fmt.Errorf("failed to copy trained model(url=%s) to the deploy model(url=%s): %w",

llhuii Aug 13, 2021

a nit: error wrap using %w.

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

    
              	}

              	klog.Infof("job(name=%s) deploys model(url=%s) successfully", jobConfig.UniqueIdentifier, trainedModel)

              	klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel)

llhuii Aug 13, 2021 •

edited

Loading

Suggested change

      
            	klog.V(4).Infof("copy model(url=%s) to the model(url=%s) successfully", trainedModel, deployModel)
          
            	klog.V(4).Infof("copy trained model(url=%s) to the deploy model(url=%s) successfully", trainedModel, deployModel)

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

-              	deployModel := jobConfig.EvalModels[1].URL
-              	if job.Storage.IsLocalStorage {
+              // updateModel updates deploy model
+              func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error {

llhuii Aug 13, 2021

how about this?

Suggested change

      
            func (im *Manager) updateModel(job *Job, trainedModel string, deployModel string) error {
          
            func (im *Manager) updateDeployModelFile(job *Job, trainedModel string, deployModel string) error {

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

               	dirNames := []string{"data/train", "data/eval", "train", "eval"}
-              	if job.Storage.IsLocalStorage {
+              	if job.JobConfig.Storage.IsLocalStorage {
               		if err := util.CreateFolder(outputDir); err != nil {
               			klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir)

llhuii Aug 13, 2021

Suggested change

      
            			klog.Errorf("job(name=%s) create fold %s failed", jobConfig.UniqueIdentifier, outputDir)
          
            			klog.Errorf("job(name=%s) failed to create folder %s: %v", jobConfig.UniqueIdentifier, outputDir, err)

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

-              			return err
-              		}
+              			if err != nil {
+              				klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v",

llhuii Aug 13, 2021

Suggested change

      
            				klog.Errorf("job(name=%s) complete the %sing phase triggering task failed, error: %v",
          
            				klog.Errorf("job(name=%s) failed to trigger the %sing phase: %v",

Author

JimmyYang20 Aug 13, 2021

ok

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

+              	credential := job.ObjectMeta.Annotations[runtime.SecretAnnotationKey]
+              	if credential != "" {
+              		if err := job.JobConfig.Storage.SetCredential(credential); err != nil {
+              			return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err)

llhuii Aug 13, 2021 •

edited

Loading

Suggested change

      
            			return fmt.Errorf("failed to set job(name=%s)'s storage credential, error: %+v", name, err)
          
            			return fmt.Errorf("failed to set storage credential: %w", err)

llhuii Aug 13, 2021

no job name in returned error.
wrap error using %w.

Author

JimmyYang20 Aug 13, 2021

fix it

JimmyYang20 force-pushed the fixbug branch from ca3a27a to 8ea7e8b Compare

August 13, 2021 09:18

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go

+              			err = im.saveJobToDB(job)
+              			if err != nil {
+              				klog.Errorf("job(name=%s) failed to update job to db, error: %v",
+              					jobConfig.UniqueIdentifier, err)

llhuii Aug 13, 2021

no return here instead of continuing since an error has been happend and logged? if intentional, add a comment here

llhuii Aug 13, 2021

like // continue anyway

Author

JimmyYang20 Aug 13, 2021

fix

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

-              			return err
+              		err = im.loadDataset(job)
+              		if err != nil || jobConfig.Dataset == nil || jobConfig.Dataset.DataSource == nil {
+              			return fmt.Errorf("job(name=%s) failed to load dataset, and waiting it: %v",

llhuii Aug 13, 2021

a nic: error wrap using %w instead of %v, see https://blog.golang.org/go1.13-errors#TOC_3.3.

Author

JimmyYang20 Aug 13, 2021

fix

llhuii reviewed

View reviewed changes

pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go Outdated

               		return nil
               	}
               	switch jobStage {
               	case sednav1.ILJobTrain:
-              		return &Model{Format: models[1].Format, URL: models[1].URL}
+              		// return two models.

llhuii Aug 13, 2021

		// two models for deploy stage:
		// the first model is the output model of train stage, which was evaluated as better than the second model in eval stage .
		// the second model is the serving model used in the inference worker.

Author

JimmyYang20 Aug 13, 2021

ok ,fix it

JimmyYang20 force-pushed the fixbug branch from 8ea7e8b to 0dfd1eb Compare

August 13, 2021 10:15


          IL: LC supports to recover job when restart

bf39a7c

Signed-off-by: JimmyYang20 <[email protected]>

JimmyYang20 force-pushed the fixbug branch from 0dfd1eb to bf39a7c Compare

August 13, 2021 10:20

llhuii commented Aug 13, 2021

/lgtm

kubeedge-bot assigned llhuii

kubeedge-bot added the lgtm label

llhuii commented Aug 13, 2021

/approve

Collaborator

kubeedge-bot commented Aug 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: llhuii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/localcontroller/OWNERS~~ [llhuii]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubeedge-bot added the approved label

kubeedge-bot merged commit 8485c90 into kubeedge:main

llhuii mentioned this pull request

LC can't recover forever when the corresponding node restarted #140

Closed

vcozzolino added a commit to vcozzolino/sedna that referenced this pull request


          merge 'latest_github' into 'main'

419fa2a

Upgrade to v0.4.0

Created-by: Vittorio Cozzolino 00609018
Author-id: 553076
MR-id: 12361350
Commit-by: Vittorio Cozzolino;KubeEdge Bot;JoeyHwong;JimmyYang20;ShiXiaohou;DanLiu;HenryChou;Yutong Wang;EnfangCui;llhuii;XinYao1994;Jie Pu;wei.ji
Merged-by: Vittorio Cozzolino 00609018
E2E-issues: 
Description:
fix incremental_learning bug
- Add docs and code comment
- fix bugs: epoch always be 1, inference result not saved, s3 upload
  fail

Signed-off-by: JoeyHwong <[email protected]>,
Interface Improvement:
   1. The algorithm of HardExampleMining should be seleted by the developer.

Signed-off-by: JoeyHwong <[email protected]>,
install.sh: fix LC_BIND_PORT bug

rename the variable LC_BIND_PORT to SENDA_LC_BIND_PORT.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#121 from llhuii/fix-install-script-bug

install.sh: fix LC_BIND_PORT bug,
Update interface.py

fix env missing bug

Signed-off-by: JoeyHwong <[email protected]>,
fix bug: aggregation of weights should occur in the AggServer

Signed-off-by: JoeyHwong <[email protected]>,
fix bug: Cloud worker not exiting

Signed-off-by: JoeyHwong <[email protected]>,
gm: refactor all features into independent dir

All controllers are placed into globalmanager/controllers:
1. each feature has the independent subdirectory
2. upstream/downstream are kept as top level.

Commom types/utils/worker.go are placed into globalmanager/runtime.

Signed-off-by: llhuii <[email protected]>,
fix PR comment
- clean useless code
- catch server exception in threads

Signed-off-by: JoeyHwong <[email protected]>,
gm: refactor upstream controller

Split upstream controller, merge each feature CR logic code into its
controller.

Signed-off-by: llhuii <[email protected]>,
gm: share client/Informer with all controllers

Make all controllers sharing with:
1. kubernetes client, and informerFactory with random resync period.
2. sedna crd client, and informerFactory with random resync period.

This can reduce code and improve slim performance.

Signed-off-by: llhuii <[email protected]>,
gm: add dataset controller

Only handle dataset update from edge.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#106 from JoeyHwong-gk/federated

fix federated learning bugs,
gm: split all upstream logic into separate file

Signed-off-by: llhuii <[email protected]>,
gm: split all downstream logic into separate file

Since all CR watch actions are placed into corresponding controller,
controllers/downstream.go is unnecessary.

Signed-off-by: llhuii <[email protected]>,
LC: fix nil pointer dereference bug

It happened in evalTask of incremental job when deploy model hasn't
been synced to LC. evalTask should return error instead of logging
error. And it doesn't need job id info into error, same as trainTask.

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#139 from JimmyYang20/fixbug

LC: fix nil pointer dereference bug,
LC: send dataset update to GM only when changed

number of samples has been sent to GM only when adding new data.

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#137 from JimmyYang20/main

LC: send dataset update to GM only when changed,
lifelong learning s3 support
- fix file_ops method
- fix kb save bug

Signed-off-by: JoeyHwong <[email protected]>,
Add object search and tracking docs to docs/proposals/

Add object search and tracking crd samples to build/crd-samples/sedna/

Add object search and tracking type.go files to pkg/apis/sedna/v1alpha1/

Signed-off-by: EnfangCui <[email protected]>,
Merge pull request kubeedge#100 from EnfangCui/add-multi-edge-inference-PR

Add object search and tracking proposals,
gm: more code clean after initial refactor done

1. remove the feature redundant name in all feature controllers(e.g.
'federatedlearningJob' to 'job'), since it has already own independent
package, no need the feature extra name
2. upstream interface optimizaztion
3. fix empty Kind of all CR in downstream
4. add extra doc string
5. fix code style

Signed-off-by: llhuii <[email protected]>,
Fix the problem that kbimage cannot be compiled in Makefile

Signed-off-by: wei.ji <[email protected]>,
improve lifelong learning docs

1. improve the atc example words
2. fix the broken links in lifelong proposal

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#146 from Jw-Jm/main

Fix the problem that kbimage cannot be compiled in Makefile,
make the hard_example_mining alg to be a common interface

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#134 from llhuii/refactor-gm

gm: decouple all features into independent package,
Merge pull request kubeedge#107 from JoeyHwong-gk/incremental

[incremental learning] example:keep all  results whether is hardExample or not, fixed the issue of using s3 to save model,
Merge pull request kubeedge#143 from JoeyHwong-gk/lldoc

improve lifelong learning docs,
fix example bug: save result which get from cloud if is hard example

fix message when http connect fail

Signed-off-by: JoeyHwong <[email protected]>,
fix pr comment
- make the hard_example_mining alg to be a common interface
- fix get_hem_from_config: raise exception when value is unexpected

Signed-off-by: JoeyHwong <[email protected]>,
lc: decouple all features into independent package

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#117 from JoeyHwong-gk/joint

joint_inference: bug fix and interface reconstruction,
Merge pull request kubeedge#149 from JimmyYang20/refector-lc

lc: decouple all features into independent package,
fix lifelong issue
- Reduce parameters for initial
- show all interfaces of lifelong learning in example
- fix bugs from deep learning framework

Signed-off-by: JoeyHwong <[email protected]>,
fix il doc

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#153 from JimmyYang20/fix-doc

Fix rendering issue of example doc in readthedocs,
Merge pull request kubeedge#142 from JoeyHwong-gk/lls3

lifelong learning: issue-driven interface-adjustment and bug fix,
fix the lifelong example problem from backend and constant

- fix sklearn backend: support args in train/eval/infer
- fix lifelong constant

Signed-off-by: JoeyHwong <[email protected]>,
Automatic push images when publishing a release

A github action is added for pushing image when a new release is created:
1. login docker hub.
2. checkout the project, and run `make push-all`.

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#154 from JoeyHwong-gk/lifelong

[Lifelong example]: fix the problem from backend and constant,
docs: update install guide

1. add GM/LC links
2. add GM/LC deploy form

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#155 from llhuii/add-image-push-gh-action

Push images automatically when a new release is created,
Merge pull request kubeedge#156 from llhuii/update-install-doc

docs: update install guide,
IL: LC supports to recover job when restart

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#152 from JimmyYang20/fixbug

IL: LC supports to recover job when restart,
Fix IMAGE_REPO in github image-publish action

Using the env 'GITHUB_REPOSITORY' instead of 'GITHUB_ACTOR' to get the right
image repo name i.e. `IMAGE_REPO` in Makefile.

Signed-off-by: llhuii <[email protected]>,
add lib doc

Signed-off-by: JoeyHwong <[email protected]>,
Improve the docs

Signed-off-by: JoeyHwong <[email protected]>,
fix syntax and information in the docs
Signed-off-by: JoeyHwong <[email protected]>,
update lib doc

Signed-off-by: JoeyHwong <[email protected]>,
Update s3 example docs of IL&JI

Signed-off-by: JimmyYang20 <[email protected]>,
Support websocket reconnection

Signed-off-by: JoeyHwong <[email protected]>,
Lib support hot model update

Signed-off-by: JoeyHwong <[email protected]>,
Adjusting the Log of IncrementalLearning example

Signed-off-by: JoeyHwong <[email protected]>,
Fix codegen verify checker

Note the codegen verify checker should report error

Signed-off-by: llhuii <[email protected]>,
Add the missing gencode for objectsearch/tracking

Signed-off-by: llhuii <[email protected]>,
fix job_kind value in LC_report

Signed-off-by: JoeyHwong <[email protected]>,
Update dependency of server request in lib
- replace `retry==1.3.3` with `tenacity==8.0.1` because of `retry` no longer maintained.

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#158 from llhuii/fix-imagerepo-of-image-publish-action

Fix IMAGE_REPO in github image-publish action,
Merge pull request kubeedge#164 from JoeyHwong-gk/federated

Support websocket reconnection when the server status is abnormal,
Merge pull request kubeedge#166 from llhuii/fix-verify-checker

Fix codegen verify checker,
Add contributing docs

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#159 from JoeyHwong-gk/libdoc

update lib doc,
Merge pull request kubeedge#148 from llhuii/add-contributing-docs

Add contributing docs,
Merge pull request kubeedge#160 from JimmyYang20/doc-s3

Update s3 example docs of IL&JI,
fix access exceptions when rendering with sphinx

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#150 from JoeyHwong-gk/docs

docs improvement,
GM&LC: IL supports model hot updates

Signed-off-by: JimmyYang20 <[email protected]>,
fix pr comment

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#138 from JimmyYang20/modelhotupdate

GM&LC: IL supports model hot updates,
Fix s3 example docs of IL&JI

Signed-off-by: JimmyYang20 <[email protected]>,
Merge pull request kubeedge#174 from JimmyYang20/doc-s3

Fix s3 example docs of IL&JI,
Merge pull request kubeedge#157 from JoeyHwong-gk/hot_model

[Lib Support] hot model update,
Upgrade gorilla/websocket from v1.4.0 to v1.4.2

This upgrade fixes a potential DoS vector bug in gorilla/websocket 1.4.0,
see GHSA-jf24-p9p9-4rjh

Signed-off-by: llhuii <[email protected]>,
Merge pull request kubeedge#182 from llhuii/upgrade-websocket

Upgrade gorilla/websocket from v1.4.0 to v1.4.2,
fix lib/requirements
- This upgrade fixes a CSRF error in FastAPI version earlier than 0.65.2,
see GHSA-8h2j-cgx8-6xv7

Signed-off-by: JoeyHwong <[email protected]>,
Merge pull request kubeedge#183

See merge request butterfly/sedna!9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm size/XL