Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[增量学习]支持不同节点训练/评估 #7

Open
llhuii opened this issue Jun 7, 2021 · 0 comments
Open

[增量学习]支持不同节点训练/评估 #7

llhuii opened this issue Jun 7, 2021 · 0 comments

Comments

@llhuii
Copy link
Owner

llhuii commented Jun 7, 2021

plantuml 序列图

@startuml
'https://plantuml.com/sequence-diagram

'autonumber
actor User
participant  "K8S API" as API
participant GM

participant  "LC at dataset-node" as LC0
participant  "LC at train-node" as LC1

participant  "LC at eval-node" as LC2



GM -> API: list/watch dataset / incremental job

User -> API: Create a dataset with: \n1. s3 specified url\n2. nodeName: dataset-node
API --> User:

GM -> LC0 : sync the dataset info to the LC located in dataset-node
LC0 -> LC0 : monitor the dataset and update the dataset's status

User -> API: Create an incremental job with:\n \
              1. train worker spec with train-node\n \
              2. eval worker spec with eval-node\n \
              3. infer worker spec with nodeSelector 
API --> User:

API --> GM: watched new job

GM -> API: create infer-worker

loop incremental traning
  GM -> API: set the job state to train-waiting
  GM -> LC0: sync the job info
  loop train-trigger is not satisfied
  LC0 -> LC0: append the new-incremental samples into the job if any
  end

  LC0 -> GM: triggered, translate the job state to train-ready
  GM -> API: create train-worker
  GM -> LC1: sync the job info
  LC1 -> GM: get the message from train-worker, \ntranslate the job state to eval-ready

  GM -> API: create eval-worker
  GM -> LC2: sync the job info

  LC2 -> LC2: handle eval result:
  alt deploy-trigger is satisfied
    LC2 -> LC2: update the deploy model,\ntranslate state to deploy-ready
    GM -> API: restart the infer-worker (cold model-update)

  else no satisfied
    LC2 -> GM: translate state to no-deploy
  end
end

@enduml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant