Training scripts Automation on PRs #954

bhack · 2022-10-21T13:16:11Z

bhack
Oct 21, 2022

I think that soon or later we need to start to discuss how to automate the contributed network PR on the training script launch, the Tensorboard.dev log upload and url share and eventually the retrieval of the weights/checkpoint.

This will help the internal member and the community to be on the same page after the review and the early approval of contributed Model PRs.

By now I could suggest two solutions:

Hosting our self-hosted Github Actions for our GPU/TPU GCP resources pool:
- PRO
  - Google cloud [maintains an official repository to quickly create self-hosted runners on GCP
  - We could let users to collaborate also on the infra on a well known DSL like Github Action and its well known behaviour
- CONSTRAINS
  - Workflow max runtime 35 days
  - Max presence of a job in a queue 24 hours
  - We could eventually need to hardening the Github Action permission setting "Require approval for all outside collaborators"
Just deploy the job on our GCP/GKE instance with Github Action:
- PRO
  - We have already an official Action
  - CONSTRAINS
    - We need to verify if we could be still attached to the launched job in the Github Action.

In both cases I suggest that we could setup a small GKE on GCP so that we could have a good flexibility on scaling down and up the resources still respecting the maximum limits that we have allocated.

This is just to bootstrap the topic, please contribute any other solution or constrains to the thread I really hope that we could discuss this in the community.

/cc @LukeWood @ianstenbit @qlzh727 @tanzhenyu @martin-gorner

bhack · 2022-12-22T15:57:04Z

bhack
Dec 22, 2022
Author

I will add also:
https://github.com/actions/actions-runner-controller

I think that now we are really on an inflection point where we are accumulating enough modularity (e.g. backbones), converted weights and train from scratch weights with the relative combinatorics.

It is already hard to know in a clear way what kind of performance an user could expect for our model on well known public datasets:

training from scratch (at lease with the reference backbone)
using uploaded weight (trained from scratch)
using converted weights (e.g. downstream fine-tuning)

So it is really hard to understand what to expect from the models contributed in this library and I think that many users don't have a reference baseline performance and flops to pick these model and train/fine-tune on their own production datasets.

0 replies

bhack · 2023-01-10T14:09:12Z

bhack
Jan 10, 2023
Author

Take a look also at what IREE built. It will under OpenXLA where other TF subcomponents are migrating:

https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner#github-self-hosted-runner-configuration

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training scripts Automation on PRs #954

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Training scripts Automation on PRs #954

bhack Oct 21, 2022

Replies: 2 comments

bhack Dec 22, 2022 Author

bhack Jan 10, 2023 Author

bhack
Oct 21, 2022

bhack
Dec 22, 2022
Author

bhack
Jan 10, 2023
Author