You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the original Attention is all you need paper there is always an out projection $W^O$ from the Attention block, as given by the un-numbered equations in section 3.2.2.
To answer myself here, it seems like the other implementations don't care about the case heads=1 and just use two linear layers in a row in a redundant fashion.
The attention given by $$\left(\vert\vert _h A_hV_h\right)W^O\neq||_h A_hV_hW^O_h$$, the only way they are the same is if $h=1$ since then both sides collapse to $$AVW^O$$ and then $VW^O$ can be merged into one linear layer $V_2$ without loss of expressivity. The dim_head == dim condition is neccesary since we still have to project $B\times L\times d_h$ back into shape $B\times L\times d$ if $d_h\neq d$.
Hey, why is the out projection from the
Attention
block optional? See:vit-pytorch/vit_pytorch/vit.py
Line 33 in d47c57e
In the original Attention is all you need paper there is always an out projection$W^O$ from the
Attention
block, as given by the un-numbered equations in section 3.2.2.The projection is also always applied in the timm library: see
https://github.com/huggingface/pytorch-image-models/blob/ae0737f5d098900180c4457845dda35433ab92c0/timm/models/vision_transformer.py#L105
The text was updated successfully, but these errors were encountered: