-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid idle gpu #1939
Avoid idle gpu #1939
Conversation
48f9625
to
9032cdf
Compare
ae92568
to
bb366aa
Compare
The issue with the regression test was fixed by simply rebasing. (And then broke another, but I think it's just a fluke that will be fixed by rerunning). I've updated the timings as well. Side comment: I added in a slight fix to the logging, where for multiple replicas it would print |
95a7685
to
83f252a
Compare
I'll try to review this tomorrow and leave it ready for you to have a second look and merge it in case you need to touch n3fit as discussed @RoyStegeman . I need to check a few corner cases but afaics this seems ok. |
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
750204d
to
e5ca1ed
Compare
a77ead0
to
e72ce3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged. #2188 can be dealt with in a separate PR.
epoch. Avoids memory overhead by only combining up to 100 steps into one epoch, and not changing anything when using only 1 replica (i.e. on CPU).
a3ea3ac
to
1637dba
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
I'm going to merge this since the tests are passing and it is rebased on top of master (which means it is probably fixing something that has changed since the last merge, probably the TF / np version). Worst case scenario, it can be reverted. The report in #2127 https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw== was made with a PR which is on top of this one, so we have a reproduction of 4.0 with this changes that seems to do ok. |
The idea
We observed large gaps between training steps in the tensorboard profile when running on the GPU. After a lot of fiddling were found to be (at least partially) due to a per epoch overhead from tensorflow. This is reduced by redefining one actual training step as a single batch of size 1, not as a whole epoch as it used to be.
Implementation wise, this is done by copying the input up to 100 times, creating a set of 100 identical training inputs. Existing callbacks simply implement
on_step_end
instead ofon_epoch_end
and inherit fromCallbackStep
to take care of the conversion between steps, batches and epochs.One extra issue is that Keras computes metrics cumulatively, they are converted back to per step in
CallbackStep.correct_logs
. This is the only source of slight numerical differences, which however only appear in the logs and do not propagate at all, training results remain identical.Performance
Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.
This saves 24 seconds (or 25%) per 1k epochs.
Profile
Note: slightly outdated profiles
This branch:
and before this single commit for comparison: