Why not always project out from Attention block? #335

fiskrt · 2024-11-19T12:40:08Z

Hey, why is the out projection from the Attention block optional? See:

Line 33 in d47c57e

project_out = not (heads == 1 and dim_head == dim)

In the original Attention is all you need paper there is always an out projection $W^O$ from the Attention block, as given by the un-numbered equations in section 3.2.2.
The projection is also always applied in the timm library: see
https://github.com/huggingface/pytorch-image-models/blob/ae0737f5d098900180c4457845dda35433ab92c0/timm/models/vision_transformer.py#L105

The text was updated successfully, but these errors were encountered:

fiskrt · 2024-11-22T13:25:43Z

To answer myself here, it seems like the other implementations don't care about the case heads=1 and just use two linear layers in a row in a redundant fashion.

The attention given by $$\left(\vert\vert _h A_hV_h\right)W^O\neq||_h A_hV_hW^O_h$$, the only way they are the same is if $h=1$ since then both sides collapse to $$AVW^O$$ and then $VW^O$ can be merged into one linear layer $V_2$ without loss of expressivity. The dim_head == dim condition is neccesary since we still have to project $B\times L\times d_h$ back into shape $B\times L\times d$ if $d_h\neq d$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not always project out from Attention block? #335

Why not always project out from Attention block? #335

fiskrt commented Nov 19, 2024

fiskrt commented Nov 22, 2024

Why not always project out from Attention block? #335

Why not always project out from Attention block? #335

Comments

fiskrt commented Nov 19, 2024

fiskrt commented Nov 22, 2024