GRU + Large `recurrent_dropout` Bug #20276

nokados · 2024-09-21T22:16:22Z

Keras version: 3.5.0
Backend: TensorFlow 2.17.0

I encountered a strange bug when working with the GRU layer. If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

With sequence length 20: Everything works as expected.
With sequence length 100: The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.
With sequence length 145: The behavior is unstable. I received the following warning:

   Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f6311231eb0>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/rnn.py", line 419, in <genexpr>
    ta.write(ta_index_to_write, out)  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/tensorflow/python/util/tf_should_use.py", line 288, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs),

I was unable to reproduce this behavior in Colab; there, either the loss becomes inf, or it behaves similarly to the longer sequence lengths.
3. With sequence length 200: It throws an error:

Epoch 1/50

2024-09-21 22:10:35.493005: I tensorflow[/core/framework/local_rendezvous.cc:404](http://localhost:8888/core/framework/local_rendezvous.cc#line=403)] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0] = 2648522 is not in [0, 25601)

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
Cell In[15], line 1
----> 1 model.fit(
      2     dataset, onehot_target,
      3     batch_size=128,
      4     epochs=50,     
     5 )

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:122](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py#line=121), in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py:136](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py#line=135), in indexed_slices_union_indices_and_values.<locals>.values_for_union(indices_expanded, indices_count, values)
    132 to_union_indices = tf.gather(indices_indices, union_indices)
    133 values_with_leading_zeros = tf.concat(
    134     [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135 )
--> 136 return tf.gather(values_with_leading_zeros, to_union_indices)

InvalidArgumentError: {{function_node __wrapped__GatherV2_device_[/job](http://localhost:8888/job):localhost[/replica:0](http://localhost:8888/replica#line=-1)[/task:0](http://localhost:8888/task#line=-1)[/device](http://localhost:8888/device):CPU:0}} indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2] name:

Key points:

This issue only occurs with GRU. LSTM works fine.
recurrent_dropout=0.5. It works fine with smaller recurrent_dropout values, such as 0.1.

Irrelevant factors:

Initialization does not affect the outcome.
The optimizer only slightly affects the behavior: I observed the errors with rmsprop; adam did not throw errors but resulted in loss = nan.
Regular dropout does not affect the issue.

I have prepared a minimal reproducible example in Colab. Here is the link: https://colab.research.google.com/drive/1msGuYB5E_eg_IIU_YK4cJcWrkEm3o0NL?usp=sharing.

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2024-09-24T14:38:03Z

Hi @nokados -

Here as the points you mentioned:

With sequence length 100: Here model is simple and input_sequence_length is more. Addition of dense layer after GRU with tanh activation function to increase model complexity will reduce the loss.
With sequence length 145: Here also along with additional dense layer need to increase units of GRU layer and dense layer will generate good result with accuracy and loss.
With sequence length 200: Here we need to use Adam optimizer with learning_rate and recurrent_dropout=0.2 with same units and layer which use with sequence length 145 will give proper training without error. Here we increase model_complexity with GRU layer so rmsprop is not more adaptive.

And reducing recurrent_dropout is leading to underfitting so reducing the recurrent_dropout will get learn pattern of input data easily.

Here the gist shows all the changes mentioned with different sequence length. Let me know anything more required...!!!

nokados · 2024-09-26T05:00:40Z

This seems more like a workaround than a solution to the original problem. Adding an extra layer with a tangent function doesn't address the issue but merely "hides" the enormous outputs from the GRU, restricting them to the range of -1 to 1 with the tangent. However, the problem is that these values should already be in this range after the GRU, as tanh is already built into it. Mathematically, it shouldn't produce values like -2.5e25. The same behavior is expected from Keras as well.

hertschuh · 2024-10-02T00:09:37Z

@nokados ,

The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.

The tanh activation is not applied to the actual output of GRU, it's applied to intermediate calculations. The output of the GRU can be outside of [-1, 1], there's nothing that prevents that.

If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

With sequence length 20: Everything works as expected.
With sequence length 100: The output of the GRU layer during training [...] produces very large values in the range of ±1e25

What happens is that the recurrent_dropout is applied on intermediate state for each item in the sequence. So with a sequence length of 100, the recurrent_dropout of 0.5 is applied a hundred times. Almost all the state gets dropped, to the point that the math becomes meaningless and the model cannot learn.

To avoid this, you have to adapt the recurrent_dropout to the sequence length. A recurrent_dropout of 0.5 may be ok for a sequence length of 20, but as you experimented, with a sequence length of 100, a recurrent_dropout of 0.1 is probably more adapted.

nokados · 2024-10-02T17:49:26Z

A. Improper behavior:

It has worked until keras 3
It works well with LSTM
Let's look at the GRU math:

Update gate:

$$ z_t = \sigma(W_z \cdot x_t + U_z \cdot h_{t-1} + b_z) \\ $$

From 0 to 1.

Reset gate:

$$ r_t = \sigma(W_r \cdot x_t + U_r \cdot h_{t-1} + b_r) $$

From 0 to 1.

Candidate hidden state:

$$ \tilde{h}_t = \tanh(W_h \cdot x_t + U_h \cdot (r_t \odot h_{t-1}) + b_h) $$

From -1 to 1.

New hidden state:

$$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $$

$1 - z_t$ ranges from 0 to 1.
$(1 - z_t) \odot h_{t-1}$ ranges from 0 to $h_{t-1}$.
$z_t \odot \tilde{h}_t$ ranges from -1 to 1.

Correct?

At each recurrent step the maximum difference between $h_{t-1}$ and $h_t$ weigths is 1. So after 100 steps, h_t should be less than 100 $\pm$ $h_0$ that is 0. Practically, they are in [-0.1, 0.1] range without recurrent dropout.

This behavior remains the same for the model before fitting, so we can ignore the trainable weights for now.

Also, look at the relationship:

How is dropout applied? I am not sure, but I guess it happens under tanh at the $\tilde{h}_t$ calculations, so it shouldn't impact on the output limits.

B. Other problems

What about exceptions? Is it okay if too large a recurrent_dropout causes indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2]. Could this be a memory issue?

github-actions bot assigned mehtamansi29 Sep 21, 2024

mehtamansi29 added type:Bug stat:awaiting response from contributor labels Sep 24, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label Sep 26, 2024

mehtamansi29 added the keras-team-review-pending Pending review by a Keras team member. label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRU + Large `recurrent_dropout` Bug #20276

GRU + Large `recurrent_dropout` Bug #20276

nokados commented Sep 21, 2024

mehtamansi29 commented Sep 24, 2024

nokados commented Sep 26, 2024

hertschuh commented Oct 2, 2024

nokados commented Oct 2, 2024 •

edited

Loading

GRU + Large recurrent_dropout Bug #20276

GRU + Large recurrent_dropout Bug #20276

Comments

nokados commented Sep 21, 2024

mehtamansi29 commented Sep 24, 2024

nokados commented Sep 26, 2024

hertschuh commented Oct 2, 2024

nokados commented Oct 2, 2024 • edited Loading

A. Improper behavior:

B. Other problems

GRU + Large `recurrent_dropout` Bug #20276

GRU + Large `recurrent_dropout` Bug #20276

nokados commented Oct 2, 2024 •

edited

Loading