You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is transferred from the tensorflow repo as suggested by the developers. Original issue: tensorflow/tensorflow#71833.
I trained a small model in distributed settings and detected inconsistent results. I would expect there's no difference when training the same model on the same input with different world sizes.
I first created a four-layer model containing a SeparableConv2D layer defined by the following code:
Then I trained it using MirroredStrategy with 1 CPU and 2 CPU, respectively, and compared the prediction results. There was a large difference Pred Linf: 0.0029296875 between the two prediction results. Further investigation shows that the loss and gradient values are close (Linf < 2e-6).
Note that I use a large learning rate (10.0) deliberately in the reproduction code below to show the large inconsistencies in the nightly version.
This issue is transferred from the tensorflow repo as suggested by the developers. Original issue: tensorflow/tensorflow#71833.
I trained a small model in distributed settings and detected inconsistent results. I would expect there's no difference when training the same model on the same input with different world sizes.
I first created a four-layer model containing a
SeparableConv2D
layer defined by the following code:Then I trained it using MirroredStrategy with 1 CPU and 2 CPU, respectively, and compared the prediction results. There was a large difference
Pred Linf: 0.0029296875
between the two prediction results. Further investigation shows that the loss and gradient values are close (Linf < 2e-6).Note that I use a large learning rate (10.0) deliberately in the reproduction code below to show the large inconsistencies in the nightly version.
Standalone code to reproduce the issue:
https://colab.research.google.com/drive/1XKqIS2_1Ho5d4KfO4uFI0ddMJGzlzSYf?usp=sharing
Relevant log output
Gradient 3 Linf: 2.9802322e-08
The text was updated successfully, but these errors were encountered: