Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results from distributed training of models containing SeparableConv2D #20153

Open
jiannanWang opened this issue Aug 23, 2024 · 0 comments
Assignees
Labels

Comments

@jiannanWang
Copy link

This issue is transferred from the tensorflow repo as suggested by the developers. Original issue: tensorflow/tensorflow#71833.

I trained a small model in distributed settings and detected inconsistent results. I would expect there's no difference when training the same model on the same input with different world sizes.

I first created a four-layer model containing a SeparableConv2D layer defined by the following code:

layer_0 = layers.Input(shape=[32, 32, 3], batch_shape=None, dtype='float32', sparse=False, tensor=None, name='00_input_object')
layer_1 = layers.SeparableConv2D(filters=4, kernel_size=[27, 27], strides=[1, 1], padding='same', data_format='channels_last', dilation_rate=[1, 1], depth_multiplier=5, activation='sigmoid', use_bias=True, depthwise_initializer='random_uniform', pointwise_initializer='random_uniform', bias_initializer='random_uniform', depthwise_regularizer=None, pointwise_regularizer=None, bias_regularizer=None, activity_regularizer=None, depthwise_constraint=None, pointwise_constraint=None, bias_constraint=None, name='01_separable_conv2D')(layer_0)
layer_2 = layers.Flatten(data_format=None, name='02_flatten')(layer_1)
layer_3 = layers.Dense(units=10, activation='linear', use_bias=False, kernel_initializer='random_uniform', bias_initializer='random_uniform', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, name='03_dense')(layer_2)
layer_4 = layers.Reshape(target_shape=[10], name='04_reshape')(layer_3)
model = keras.Model(layer_0, layer_4)

Then I trained it using MirroredStrategy with 1 CPU and 2 CPU, respectively, and compared the prediction results. There was a large difference Pred Linf: 0.0029296875 between the two prediction results. Further investigation shows that the loss and gradient values are close (Linf < 2e-6).

Note that I use a large learning rate (10.0) deliberately in the reproduction code below to show the large inconsistencies in the nightly version.

Standalone code to reproduce the issue:
https://colab.research.google.com/drive/1XKqIS2_1Ho5d4KfO4uFI0ddMJGzlzSYf?usp=sharing

Relevant log output

1CPU vs 2CPU:
Pred Linf: 0.0029296875
Loss Linf: 2.3841858e-07
Gradient 0 Linf: 8.381903e-09
Gradient 1 Linf: 1.8626451e-08
Gradient 2 Linf: 1.2777746e-06

Gradient 3 Linf: 2.9802322e-08

@sachinprasadhs sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Aug 28, 2024
@mattdangerw mattdangerw removed the keras-team-review-pending Pending review by a Keras team member. label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants