Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRU + Large recurrent_dropout Bug #20276

Open
nokados opened this issue Sep 21, 2024 · 4 comments
Open

GRU + Large recurrent_dropout Bug #20276

nokados opened this issue Sep 21, 2024 · 4 comments
Assignees
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug

Comments

@nokados
Copy link

nokados commented Sep 21, 2024

Keras version: 3.5.0
Backend: TensorFlow 2.17.0

I encountered a strange bug when working with the GRU layer. If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

  1. With sequence length 20: Everything works as expected.
  2. With sequence length 100: The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.
  3. With sequence length 145: The behavior is unstable. I received the following warning:
   Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f6311231eb0>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/rnn.py", line 419, in <genexpr>
    ta.write(ta_index_to_write, out)  File "/home/nokados/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/tensorflow/python/util/tf_should_use.py", line 288, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs),

I was unable to reproduce this behavior in Colab; there, either the loss becomes inf, or it behaves similarly to the longer sequence lengths.
3. With sequence length 200: It throws an error:

Epoch 1/50

2024-09-21 22:10:35.493005: I tensorflow[/core/framework/local_rendezvous.cc:404](http://localhost:8888/core/framework/local_rendezvous.cc#line=403)] Local rendezvous is aborting with status: INVALID_ARGUMENT: indices[0] = 2648522 is not in [0, 25601)

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
Cell In[15], line 1
----> 1 model.fit(
      2     dataset, onehot_target,
      3     batch_size=128,
      4     epochs=50,     
     5 )

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:122](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py#line=121), in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File [~/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py:136](http://localhost:8888/lab/tree/qaclassifiers/notebooks/.pyenv/versions/qaclassifiers39-pdm/lib/python3.9/site-packages/keras/src/backend/tensorflow/sparse.py#line=135), in indexed_slices_union_indices_and_values.<locals>.values_for_union(indices_expanded, indices_count, values)
    132 to_union_indices = tf.gather(indices_indices, union_indices)
    133 values_with_leading_zeros = tf.concat(
    134     [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135 )
--> 136 return tf.gather(values_with_leading_zeros, to_union_indices)

InvalidArgumentError: {{function_node __wrapped__GatherV2_device_[/job](http://localhost:8888/job):localhost[/replica:0](http://localhost:8888/replica#line=-1)[/task:0](http://localhost:8888/task#line=-1)[/device](http://localhost:8888/device):CPU:0}} indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2] name:

Key points:

  • This issue only occurs with GRU. LSTM works fine.
  • recurrent_dropout=0.5. It works fine with smaller recurrent_dropout values, such as 0.1.

Irrelevant factors:

  • Initialization does not affect the outcome.
  • The optimizer only slightly affects the behavior: I observed the errors with rmsprop; adam did not throw errors but resulted in loss = nan.
  • Regular dropout does not affect the issue.

I have prepared a minimal reproducible example in Colab. Here is the link: https://colab.research.google.com/drive/1msGuYB5E_eg_IIU_YK4cJcWrkEm3o0NL?usp=sharing.

@mehtamansi29
Copy link
Collaborator

Hi @nokados -

Here as the points you mentioned:

  1. With sequence length 100: Here model is simple and input_sequence_length is more. Addition of dense layer after GRU with tanh activation function to increase model complexity will reduce the loss.

  2. With sequence length 145: Here also along with additional dense layer need to increase units of GRU layer and dense layer will generate good result with accuracy and loss.

  3. With sequence length 200: Here we need to use Adam optimizer with learning_rate and recurrent_dropout=0.2 with same units and layer which use with sequence length 145 will give proper training without error. Here we increase model_complexity with GRU layer so rmsprop is not more adaptive.

And reducing recurrent_dropout is leading to underfitting so reducing the recurrent_dropout will get learn pattern of input data easily.

Here the gist shows all the changes mentioned with different sequence length. Let me know anything more required...!!!

@nokados
Copy link
Author

nokados commented Sep 26, 2024

This seems more like a workaround than a solution to the original problem. Adding an extra layer with a tangent function doesn't address the issue but merely "hides" the enormous outputs from the GRU, restricting them to the range of -1 to 1 with the tangent. However, the problem is that these values should already be in this range after the GRU, as tanh is already built into it. Mathematically, it shouldn't produce values like -2.5e25. The same behavior is expected from Keras as well.

@mehtamansi29 mehtamansi29 added the keras-team-review-pending Pending review by a Keras team member. label Sep 30, 2024
@hertschuh
Copy link
Contributor

@nokados ,

The output of the GRU layer during training, with the default tanh activation, produces very large values in the range of ±1e25, even though it should be constrained to [-1, 1]. This results in an extremely large loss.

The tanh activation is not applied to the actual output of GRU, it's applied to intermediate calculations. The output of the GRU can be outside of [-1, 1], there's nothing that prevents that.

If you create a simple model with a GRU layer and set recurrent_dropout=0.5, very strange behavior occurs:

With sequence length 20: Everything works as expected.
With sequence length 100: The output of the GRU layer during training [...] produces very large values in the range of ±1e25

What happens is that the recurrent_dropout is applied on intermediate state for each item in the sequence. So with a sequence length of 100, the recurrent_dropout of 0.5 is applied a hundred times. Almost all the state gets dropped, to the point that the math becomes meaningless and the model cannot learn.

To avoid this, you have to adapt the recurrent_dropout to the sequence length. A recurrent_dropout of 0.5 may be ok for a sequence length of 20, but as you experimented, with a sequence length of 100, a recurrent_dropout of 0.1 is probably more adapted.

@nokados
Copy link
Author

nokados commented Oct 2, 2024

A. Improper behavior:

  1. It has worked until keras 3
  2. It works well with LSTM
  3. Let's look at the GRU math:

Update gate:

$$ z_t = \sigma(W_z \cdot x_t + U_z \cdot h_{t-1} + b_z) \\ $$

From 0 to 1.

Reset gate:

$$ r_t = \sigma(W_r \cdot x_t + U_r \cdot h_{t-1} + b_r) $$

From 0 to 1.

Candidate hidden state:

$$ \tilde{h}_t = \tanh(W_h \cdot x_t + U_h \cdot (r_t \odot h_{t-1}) + b_h) $$

From -1 to 1.

New hidden state:

$$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t $$

  • $1 - z_t$ ranges from 0 to 1.
  • $(1 - z_t) \odot h_{t-1}$ ranges from 0 to $h_{t-1}$.
  • $z_t \odot \tilde{h}_t$ ranges from -1 to 1.

Correct?

At each recurrent step the maximum difference between $h_{t-1}$ and $h_t$ weigths is 1. So after 100 steps, h_t should be less than 100 $\pm$ $h_0$ that is 0. Practically, they are in [-0.1, 0.1] range without recurrent dropout.

This behavior remains the same for the model before fitting, so we can ignore the trainable weights for now.

Also, look at the relationship:
recurrent_dropout

How is dropout applied? I am not sure, but I guess it happens under tanh at the $\tilde{h}_t$ calculations, so it shouldn't impact on the output limits.

B. Other problems

What about exceptions? Is it okay if too large a recurrent_dropout causes indices[0] = 2648522 is not in [0, 25601) [Op:GatherV2]. Could this be a memory issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug
Projects
None yet
Development

No branches or pull requests

3 participants