-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No agreement between parallel gpu and sequential cpu fits #2118
Comments
Hi @RoyStegeman and @Radonirinaunimi. I ran the provided runcard using the current branch I was using for the hyperopt experiments. I added [INFO]: Epoch 100/17000: loss: 346479296.0000000
Validation loss after training step: 6054287.0000000.
Validation chi2s:
[INFO]: Epoch 200/17000: loss: 2397694464.0000000
Validation loss after training step: 898655.7500000.
Validation chi2s:
[INFO]: Epoch 300/17000: loss: 210753856.0000000
Validation loss after training step: 709461.3750000.
Validation chi2s:
[INFO]: Epoch 400/17000: loss: 109637336.0000000
Validation loss after training step: 1609623.7500000.
Validation chi2s:
[INFO]: Epoch 500/17000: loss: 142071760.0000000
Validation loss after training step: 2423371.0000000.
Validation chi2s:
[INFO]: Epoch 600/17000: loss: 237521776.0000000
Validation loss after training step: 356872.8125000.
Validation chi2s:
[INFO]: Epoch 700/17000: loss: 81304864.0000000
Validation loss after training step: 4219186.0000000.
Validation chi2s:
[INFO]: Epoch 800/17000: loss: 125517096.0000000
Validation loss after training step: 5745484.5000000.
Validation chi2s:
[INFO]: Epoch 900/17000: loss: 65846024.0000000
Validation loss after training step: 1957665.2500000.
Validation chi2s:
[INFO]: Epoch 1000/17000: loss: 49473056.0000000
Validation loss after training step: 1433478.5000000.
Validation chi2s:
[WARNING]: > NaN found, stopping activated
[INFO]: Stopped at epoch=1065 Just in case, I also attach here the run_fit.slurm script I am using in snellius and the output. I am also trying to run it with |
@Cmurilochem This seems really strange and indeed seem to point that there is definitely something wrong somewhere. But now I am wondering how the hyperopts are even working? Which commit are you exactly using? |
Hi @Radonirinaunimi. The last commit of my local master is I am also interested in checking the results before #2014, and I will make a test by checking out to |
Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16). |
Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master. I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package... |
Yes, I agree. Do you have some ideas on how to proceed?
I don't think this is the case though. I did try yesterday doing so with the exact same environment as in this report and still got the wrong results. If this is the case, we should be able to reproduce the results. |
Hi @Radonirinaunimi but the report only mentions a code version right? Not the fact that the gpu results were obtained with https://pypi.org/project/tensorflow-gpu/2.10.0/ rather than the official TF2.10. In any case, my plan is to take the current master and compare parallel and sequential losses for the first few epochs, probably using the simple runcard. If those are equivalent, the difference could be in the stopping conditions handling or some other post-processing step. |
Thanks for doing the tests. At that point there was some incompatibility between n3fit and tensorflow 2.16, because Aron's implementation of the MultiDense stuff used a tensorflow internal object The PR that made n3fit compatible with tf2.16: #1975 |
I can have a look at this, but looking through the PR I'm no longer sure Aron ever found agreement with the full current version (multidense, trvl, etc) I'll do it in the context of #1939 |
@scarlehoff do you find agreement between sequential and parallel fits without the multidense layer type? I recall that @Radonirinaunimi reverted to the trvl-mask-layer branch, so still didn't find agreement even without multidense layers... |
Are you sure that the latest version of the trvl-mask-layer was not rebased on top of multidense? In any case, maybe there was another bug there in the middle. I don't think that branch existed by itself for long (it was constantly being merged/rebased on other branches) |
Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers. |
Yes, that's right! It's a shame that I did not notice it before. Now with #2127 I can also get agreement (with |
@APJansen did perform a regression test https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/ but it's not clear to me whether this is using parallel replicas (using the same training validation mask) or sequential ones. |
I performed fits using both parallel gpu sequential cpu settings: https://vp.nnpdf.science/HrLhzWqgSEyMVbKRWuSmOA==/, unfortunately the results to do not agree. Agreement between the two was found by Aron, so I would like to understand why I do not find the same.
Reasons could be a problem with my environment or hardware (or interplay between), or perhaps over the past months a bug was introduced in the code.
The text was updated successfully, but these errors were encountered: