Inconsistent results from distributed training of models containing `SeparableConv2D` #20153

jiannanWang · 2024-08-23T01:47:39Z

This issue is transferred from the tensorflow repo as suggested by the developers. Original issue: tensorflow/tensorflow#71833.

I trained a small model in distributed settings and detected inconsistent results. I would expect there's no difference when training the same model on the same input with different world sizes.

I first created a four-layer model containing a SeparableConv2D layer defined by the following code:

layer_0 = layers.Input(shape=[32, 32, 3], batch_shape=None, dtype='float32', sparse=False, tensor=None, name='00_input_object')
layer_1 = layers.SeparableConv2D(filters=4, kernel_size=[27, 27], strides=[1, 1], padding='same', data_format='channels_last', dilation_rate=[1, 1], depth_multiplier=5, activation='sigmoid', use_bias=True, depthwise_initializer='random_uniform', pointwise_initializer='random_uniform', bias_initializer='random_uniform', depthwise_regularizer=None, pointwise_regularizer=None, bias_regularizer=None, activity_regularizer=None, depthwise_constraint=None, pointwise_constraint=None, bias_constraint=None, name='01_separable_conv2D')(layer_0)
layer_2 = layers.Flatten(data_format=None, name='02_flatten')(layer_1)
layer_3 = layers.Dense(units=10, activation='linear', use_bias=False, kernel_initializer='random_uniform', bias_initializer='random_uniform', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, name='03_dense')(layer_2)
layer_4 = layers.Reshape(target_shape=[10], name='04_reshape')(layer_3)
model = keras.Model(layer_0, layer_4)

Then I trained it using MirroredStrategy with 1 CPU and 2 CPU, respectively, and compared the prediction results. There was a large difference Pred Linf: 0.0029296875 between the two prediction results. Further investigation shows that the loss and gradient values are close (Linf < 2e-6).

Note that I use a large learning rate (10.0) deliberately in the reproduction code below to show the large inconsistencies in the nightly version.

Standalone code to reproduce the issue:
https://colab.research.google.com/drive/1XKqIS2_1Ho5d4KfO4uFI0ddMJGzlzSYf?usp=sharing

Relevant log output

1CPU vs 2CPU:
Pred Linf: 0.0029296875
Loss Linf: 2.3841858e-07
Gradient 0 Linf: 8.381903e-09
Gradient 1 Linf: 1.8626451e-08
Gradient 2 Linf: 1.2777746e-06

Gradient 3 Linf: 2.9802322e-08

The text was updated successfully, but these errors were encountered:

github-actions bot assigned sachinprasadhs Aug 23, 2024

sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Aug 28, 2024

mattdangerw assigned mattdangerw and unassigned sachinprasadhs Aug 29, 2024

mattdangerw removed the keras-team-review-pending Pending review by a Keras team member. label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results from distributed training of models containing `SeparableConv2D` #20153

Inconsistent results from distributed training of models containing `SeparableConv2D` #20153

jiannanWang commented Aug 23, 2024

Inconsistent results from distributed training of models containing SeparableConv2D #20153

Inconsistent results from distributed training of models containing SeparableConv2D #20153

Comments

jiannanWang commented Aug 23, 2024

Inconsistent results from distributed training of models containing `SeparableConv2D` #20153

Inconsistent results from distributed training of models containing `SeparableConv2D` #20153