Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Kubric Dataset #35

Open
TahaRazzaq opened this issue Jul 17, 2023 · 5 comments
Open

Training on Kubric Dataset #35

TahaRazzaq opened this issue Jul 17, 2023 · 5 comments

Comments

@TahaRazzaq
Copy link

I am trying to train the TAPIR model on the Kubric Dataset using Google Colab however my code keeps stopping without any errors. I am using the python ./experiment.py --config ./configs/tapir_config.py command and the config file is loaded successfully. The training process stops abruptly without any errors. I am unable to determine the cause and would be really grateful for any help in this regards.

image

Thank You!

@cdoersch
Copy link
Collaborator

Apologies for the slow response; it's likely that this is just compilation time (the training graph is complex and the JAX GPU compiler is slow; it might take hours to compile), but it's somewhat time-consuming for us to debug this so we haven't dug into it yet. Hopefully we will find time to do so soon.

@yangyi02
Copy link
Collaborator

We attempt to reproduce your reported issues and here it is. It took approximately 40 minutes to see the first training log result (Also the codebase uses CPU to train by default and it is super slow). Currently we have not optimize the experience of training locally (if you are).

Also, could you check nvidia-smi and see if your model is built and trained on GPU or not?

Screenshot 2023-07-24 at 11 39 39 AM

@TahaRazzaq
Copy link
Author

@yangyi02 Thank you for your response. I did enable GPU and the model was built on GPU as well, however the execution stops midway and training doesn't take place.

@yangyi02
Copy link
Collaborator

@TahaRazzaq From the screenshot, I don't see the training stops.

Could you verify if the training message just hang there (if hanging there, could you just wait for i.e. 1 hour?), or indeed completely stoped?

You can adjust the batch_dim in tapir_config.py to 1 to see if it gives you slightly faster verification.

@TahaRazzaq
Copy link
Author

@yangyi02 The execution stops since I'm able to run other cells. Even with batch_dim set to 1, within 3 - 5 mins the execution stops. The last message printed is Initializing Parameters after which it displays a few warnings and stops.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants