No agreement between parallel gpu and sequential cpu fits #2118

RoyStegeman · 2024-06-26T08:09:48Z

I performed fits using both parallel gpu sequential cpu settings: https://vp.nnpdf.science/HrLhzWqgSEyMVbKRWuSmOA==/, unfortunately the results to do not agree. Agreement between the two was found by Aron, so I would like to understand why I do not find the same.

Reasons could be a problem with my environment or hardware (or interplay between), or perhaps over the past months a bug was introduced in the code.

Cmurilochem · 2024-06-27T11:54:30Z

Hi @RoyStegeman and @Radonirinaunimi.

I ran the provided runcard using the current branch I was using for the hyperopt experiments. I added parallel_models: true to it and ran with 120 replicas (to be able to get 100 later in postfit). Apparently, the fit just stops at epoch 1065, see part of the output below:

[INFO]: Epoch 100/17000: loss: 346479296.0000000
Validation loss after training step: 6054287.0000000.
Validation chi2s:
[INFO]: Epoch 200/17000: loss: 2397694464.0000000
Validation loss after training step: 898655.7500000.
Validation chi2s:
[INFO]: Epoch 300/17000: loss: 210753856.0000000
Validation loss after training step: 709461.3750000.
Validation chi2s:
[INFO]: Epoch 400/17000: loss: 109637336.0000000
Validation loss after training step: 1609623.7500000.
Validation chi2s:
[INFO]: Epoch 500/17000: loss: 142071760.0000000
Validation loss after training step: 2423371.0000000.
Validation chi2s:
[INFO]: Epoch 600/17000: loss: 237521776.0000000
Validation loss after training step: 356872.8125000.
Validation chi2s:
[INFO]: Epoch 700/17000: loss: 81304864.0000000
Validation loss after training step: 4219186.0000000.
Validation chi2s:
[INFO]: Epoch 800/17000: loss: 125517096.0000000
Validation loss after training step: 5745484.5000000.
Validation chi2s:
[INFO]: Epoch 900/17000: loss: 65846024.0000000
Validation loss after training step: 1957665.2500000.
Validation chi2s:
[INFO]: Epoch 1000/17000: loss: 49473056.0000000
Validation loss after training step: 1433478.5000000.
Validation chi2s:
[WARNING]:  > NaN found, stopping activated
[INFO]: Stopped at epoch=1065

Just in case, I also attach here the run_fit.slurm script I am using in snellius and the output.

I am also trying to run it with parallel_models = false but it is taking much longer than I expected.

Radonirinaunimi · 2024-06-28T08:44:57Z

@Cmurilochem This seems really strange and indeed seem to point that there is definitely something wrong somewhere. But now I am wondering how the hyperopts are even working? Which commit are you exactly using?

Cmurilochem · 2024-07-01T04:56:27Z

Hi @Radonirinaunimi. The last commit of my local master is 613bfbc651c4439ba6b555e6179e5a7b797dcdcf (Add 'average_best' and non-default 'penalties_in_loss' in hyperopt_studies runcards).

I am also interested in checking the results before #2014, and I will make a test by checking out to fb3f384495f1d0d1daa845cdb784e66e47558c66 (Merge pull request #2065 from NNPDF/add_hyperopt_limited_space_runcard).

goord · 2024-07-10T14:00:24Z

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Radonirinaunimi · 2024-07-11T06:57:25Z

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for checking @goord. So we indeed now have a consensus that it is not working, even for the old commits.

goord · 2024-07-11T07:02:57Z

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

Radonirinaunimi · 2024-07-11T10:15:09Z

Right, so instead of focussing to recreate the results posted previously, we will look at what is exactly happening in the current master.

Yes, I agree. Do you have some ideas on how to proceed?

I still suspect a tensor flow issue btw. I believe the old test results were produced with the tensirflow-gpu pip package, which may somehow have slightly different behavior than the official package...

I don't think this is the case though. I did try yesterday doing so with the exact same environment as in this report and still got the wrong results. If this is the case, we should be able to reproduce the results.

goord · 2024-07-11T11:18:27Z

Hi @Radonirinaunimi but the report only mentions a code version right? Not the fact that the gpu results were obtained with https://pypi.org/project/tensorflow-gpu/2.10.0/ rather than the official TF2.10.

In any case, my plan is to take the current master and compare parallel and sequential losses for the first few epochs, probably using the simple runcard. If those are equivalent, the difference could be in the stopping conditions handling or some other post-processing step.

RoyStegeman · 2024-07-11T12:11:12Z

Testing the pull request faadba0 with 20 replicas on the CPU does show a significant difference with the baseline on my system as well (python 3.11, TF2.16).

Thanks for doing the tests. At that point there was some incompatibility between n3fit and tensorflow 2.16, because Aron's implementation of the MultiDense stuff used a tensorflow internal object Dense.kernel, while at the moment we access Dense._kernel. Looking at the tf code you can see that's it's not a public attribute so it's something that can break MultiDense without warning: https://github.com/keras-team/keras/blob/v3.3.3/keras/src/layers/core/dense.py#L15-L603

The PR that made n3fit compatible with tf2.16: #1975

scarlehoff · 2024-07-15T08:26:48Z

I can have a look at this, but looking through the PR I'm no longer sure Aron ever found agreement with the full current version (multidense, trvl, etc)

I'll do it in the context of #1939

goord · 2024-07-17T14:11:21Z

@scarlehoff do you find agreement between sequential and parallel fits without the multidense layer type? I recall that @Radonirinaunimi reverted to the trvl-mask-layer branch, so still didn't find agreement even without multidense layers...

scarlehoff · 2024-07-17T15:30:11Z

Are you sure that the latest version of the trvl-mask-layer was not rebased on top of multidense?

In any case, maybe there was another bug there in the middle. I don't think that branch existed by itself for long (it was constantly being merged/rebased on other branches)

goord · 2024-07-18T08:04:07Z

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

Radonirinaunimi · 2024-07-18T08:29:06Z

Yes it has been rebased on top of master when multidense was in there, I realize now. @Radonirinaunimi this is perhaps why you can't reproduce earlier results. We will test whether the baseline can be reproduced with single dense and debug the multidense with masking layers.

Yes, that's right! It's a shame that I did not notice it before. Now with #2127 I can also get agreement (with V100).

goord · 2024-07-18T08:58:43Z

@APJansen did perform a regression test https://vp.nnpdf.science/FGwCRitnQhmqBYBjeQWS7Q==/ but it's not clear to me whether this is using parallel replicas (using the same training validation mask) or sequential ones.

RoyStegeman added the bug Something isn't working label Jun 26, 2024

RoyStegeman assigned goord, Radonirinaunimi, RoyStegeman and Cmurilochem Jun 26, 2024

scarlehoff mentioned this issue Jul 16, 2024

Revert the default for parallel fits to a single model per replica #2127

Merged

2 tasks

scarlehoff linked a pull request Jul 16, 2024 that will close this issue

Revert the default for parallel fits to a single model per replica #2127

Merged

2 tasks

scarlehoff closed this as completed in #2127 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No agreement between parallel gpu and sequential cpu fits #2118

No agreement between parallel gpu and sequential cpu fits #2118

RoyStegeman commented Jun 26, 2024

Cmurilochem commented Jun 27, 2024 •

edited

Loading

Radonirinaunimi commented Jun 28, 2024

Cmurilochem commented Jul 1, 2024 •

edited

Loading

goord commented Jul 10, 2024

Radonirinaunimi commented Jul 11, 2024

goord commented Jul 11, 2024

Radonirinaunimi commented Jul 11, 2024

goord commented Jul 11, 2024

RoyStegeman commented Jul 11, 2024

scarlehoff commented Jul 15, 2024

goord commented Jul 17, 2024 •

edited

Loading

scarlehoff commented Jul 17, 2024 •

edited

Loading

goord commented Jul 18, 2024

Radonirinaunimi commented Jul 18, 2024

goord commented Jul 18, 2024 •

edited

Loading

No agreement between parallel gpu and sequential cpu fits #2118

No agreement between parallel gpu and sequential cpu fits #2118

Comments

RoyStegeman commented Jun 26, 2024

Cmurilochem commented Jun 27, 2024 • edited Loading

Radonirinaunimi commented Jun 28, 2024

Cmurilochem commented Jul 1, 2024 • edited Loading

goord commented Jul 10, 2024

Radonirinaunimi commented Jul 11, 2024

goord commented Jul 11, 2024

Radonirinaunimi commented Jul 11, 2024

goord commented Jul 11, 2024

RoyStegeman commented Jul 11, 2024

scarlehoff commented Jul 15, 2024

goord commented Jul 17, 2024 • edited Loading

scarlehoff commented Jul 17, 2024 • edited Loading

goord commented Jul 18, 2024

Radonirinaunimi commented Jul 18, 2024

goord commented Jul 18, 2024 • edited Loading

Cmurilochem commented Jun 27, 2024 •

edited

Loading

Cmurilochem commented Jul 1, 2024 •

edited

Loading

goord commented Jul 17, 2024 •

edited

Loading

scarlehoff commented Jul 17, 2024 •

edited

Loading

goord commented Jul 18, 2024 •

edited

Loading