Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No agreement between parallel gpu and sequential cpu fits #2118

Closed
RoyStegeman opened this issue Jun 26, 2024 · 15 comments · Fixed by #2127
Closed

No agreement between parallel gpu and sequential cpu fits #2118

RoyStegeman opened this issue Jun 26, 2024 · 15 comments · Fixed by #2127
Assignees
Labels
bug Something isn't working

Comments

@RoyStegeman
Copy link
Member

I performed fits using both parallel gpu sequential cpu settings: https://vp.nnpdf.science/HrLhzWqgSEyMVbKRWuSmOA==/, unfortunately the results to do not agree. Agreement between the two was found by Aron, so I would like to understand why I do not find the same.

Reasons could be a problem with my environment or hardware (or interplay between), or perhaps over the past months a bug was introduced in the code.

@RoyStegeman RoyStegeman added the bug Something isn't working label Jun 26, 2024
@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Jun 27, 2024

Hi @RoyStegeman and @Radonirinaunimi.

I ran the provided runcard using the current branch I was using for the hyperopt experiments. I added parallel_models: true to it and ran with 120 replicas (to be able to get 100 later in postfit). Apparently, the fit just stops at epoch 1065, see part of the output below:

[INFO]: Epoch 100/17000: loss: 346479296.0000000
Validation loss after training step: 6054287.0000000.
Validation chi2s:
[INFO]: Epoch 200/17000: loss: 2397694464.0000000
Validation loss after training step: 898655.7500000.
Validation chi2s:
[INFO]: Epoch 300/17000: loss: 210753856.0000000
Validation loss after training step: 709461.3750000.
Validation chi2s:
[INFO]: Epoch 400/17000: loss: 109637336.0000000
Validation loss after training step: 1609623.7500000.
Validation chi2s:
[INFO]: Epoch 500/17000: loss: 142071760.0000000
Validation loss after training step: 2423371.0000000.
Validation chi2s:
[INFO]: Epoch 600/17000: loss: 237521776.0000000
Validation loss after training step: 356872.8125000.
Validation chi2s:
[INFO]: Epoch 700/17000: loss: 81304864.0000000
Validation loss after training step: 4219186.0000000.
Validation chi2s:
[INFO]: Epoch 800/17000: loss: 125517096.0000000
Validation loss after training step: 5745484.5000000.
Validation chi2s:
[INFO]: Epoch 900/17000: loss: 65846024.0000000
Validation loss after training step: 1957665.2500000.
Validation chi2s:
[INFO]: Epoch 1000/17000: loss: 49473056.0000000
Validation loss after training step: 1433478.5000000.
Validation chi2s:
[WARNING]:  > NaN found, stopping activated
[INFO]: Stopped at epoch=1065

Just in case, I also attach here the run_fit.slurm script I am using in snellius and the output.

I am also trying to run it with parallel_models = false but it is taking much longer than I expected.

@Radonirinaunimi
Copy link
Member

@Cmurilochem This seems really strange and indeed seem to point that there is definitely something wrong somewhere. But now I am wondering how the hyperopts are even working? Which commit are you exactly using?

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Jul 1, 2024

Hi @Radonirinaunimi. The last commit of my local master is 613bfbc651c4439ba6b555e6179e5a7b797dcdcf (Add 'average_best' and non-default 'penalties_in_loss' in hyperopt_studies runcards).

I am also interested in checking the results before #2014, and I will make a test by checking out to fb3f384495f1d0d1daa845cdb784e66e47558c66 (Merge pull request #2065 from NNPDF/add_hyperopt_limited_space_runcard).

@goord
Copy link
Collaborator

goord commented Jul 10, 2024

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

@Radonirinaunimi
Copy link
Member

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for checking @goord. So we indeed now have a consensus that it is not working, even for the old commits.

@goord
Copy link
Collaborator

goord commented Jul 11, 2024

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

@Radonirinaunimi
Copy link
Member

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

Yes, I agree. Do you have some ideas on how to proceed?

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

I don't think this is the case though. I did try yesterday doing so with the exact same environment as in this report and still got the wrong results. If this is the case, we should be able to reproduce the results.

@goord
Copy link
Collaborator

goord commented Jul 11, 2024

Hi @Radonirinaunimi but the report only mentions a code version right? Not the fact that the gpu results were obtained with https://pypi.org/project/tensorflow-gpu/2.10.0/ rather than the official TF2.10.

In any case, my plan is to take the current master and compare parallel and sequential losses for the first few epochs, probably using the simple runcard. If those are equivalent, the difference could be in the stopping conditions handling or some other post-processing step.

@RoyStegeman
Copy link
Member Author

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for doing the tests. At that point there was some incompatibility between n3fit and tensorflow 2.16, because Aron's implementation of the MultiDense stuff used a tensorflow internal object Dense.kernel, while at the moment we access Dense._kernel. Looking at the tf code you can see that's it's not a public attribute so it's something that can break MultiDense without warning: https://github.com/keras-team/keras/blob/v3.3.3/keras/src/layers/core/dense.py#L15-L603

The PR that made n3fit compatible with tf2.16: #1975

@scarlehoff
Copy link
Member

I can have a look at this, but looking through the PR I'm no longer sure Aron ever found agreement with the full current version (multidense, trvl, etc)

I'll do it in the context of #1939

@goord
Copy link
Collaborator

goord commented Jul 17, 2024

@scarlehoff do you find agreement between sequential and parallel fits without the multidense layer type? I recall that @Radonirinaunimi reverted to the trvl-mask-layer branch, so still didn't find agreement even without multidense layers...

@scarlehoff
Copy link
Member

scarlehoff commented Jul 17, 2024

Are you sure that the latest version of the trvl-mask-layer was not rebased on top of multidense?

In any case, maybe there was another bug there in the middle. I don't think that branch existed by itself for long (it was constantly being merged/rebased on other branches)

@goord
Copy link
Collaborator

goord commented Jul 18, 2024

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

@Radonirinaunimi
Copy link
Member

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

Yes, that's right! It's a shame that I did not notice it before. Now with #2127 I can also get agreement (with V100).

@goord
Copy link
Collaborator

goord commented Jul 18, 2024

@APJansen did perform a regression test https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/ but it's not clear to me whether this is using parallel replicas (using the same training validation mask) or sequential ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants