Replies: 2 comments
-
I will add also: I think that now we are really on an inflection point where we are accumulating enough modularity (e.g. backbones), converted weights and train from scratch weights with the relative combinatorics. It is already hard to know in a clear way what kind of performance an user could expect for our model on well known public datasets:
So it is really hard to understand what to expect from the models contributed in this library and I think that many users don't have a reference baseline performance and flops to pick these model and train/fine-tune on their own production datasets. |
Beta Was this translation helpful? Give feedback.
-
Take a look also at what IREE built. It will under OpenXLA where other TF subcomponents are migrating: |
Beta Was this translation helpful? Give feedback.
-
I think that soon or later we need to start to discuss how to automate the contributed network PR on the training script launch, the
Tensorboard.dev
log upload and url share and eventually the retrieval of the weights/checkpoint.This will help the internal member and the community to be on the same page after the review and the early approval of contributed Model PRs.
By now I could suggest two solutions:
Hosting our self-hosted Github Actions for our GPU/TPU GCP resources pool:
Just deploy the job on our GCP/GKE instance with Github Action:
In both cases I suggest that we could setup a small GKE on GCP so that we could have a good flexibility on scaling down and up the resources still respecting the maximum limits that we have allocated.
This is just to bootstrap the topic, please contribute any other solution or constrains to the thread I really hope that we could discuss this in the community.
/cc @LukeWood @ianstenbit @qlzh727 @tanzhenyu @martin-gorner
Beta Was this translation helpful? Give feedback.
All reactions