Incorrect implementation of self-attention #23

TrinitialChan · 2023-11-21T06:07:24Z

Your paper specifies that the Decoder section performs a stacked multi-head self-attention operation, however I have found in the code that the behavior of the DecoderLayer class is inconsistent with the above description. By printing the attn_output_weights of the self_attn module, I found attention map shaped '([L, 1, 1])', and there is clearly a problem with such an attention computation. #22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect implementation of self-attention #23

Incorrect implementation of self-attention #23

TrinitialChan commented Nov 21, 2023

Incorrect implementation of self-attention #23

Incorrect implementation of self-attention #23

Comments

TrinitialChan commented Nov 21, 2023