Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not always project out from Attention block? #335

Open
fiskrt opened this issue Nov 19, 2024 · 1 comment
Open

Why not always project out from Attention block? #335

fiskrt opened this issue Nov 19, 2024 · 1 comment

Comments

@fiskrt
Copy link

fiskrt commented Nov 19, 2024

Hey, why is the out projection from the Attention block optional? See:

project_out = not (heads == 1 and dim_head == dim)

@fiskrt
Copy link
Author

fiskrt commented Nov 22, 2024

To answer myself here, it seems like the other implementations don't care about the case heads=1 and just use two linear layers in a row in a redundant fashion.

The attention given by $$\left(\vert\vert _h A_hV_h\right)W^O\neq||_h A_hV_hW^O_h$$, the only way they are the same is if $h=1$ since then both sides collapse to $$AVW^O$$ and then $VW^O$ can be merged into one linear layer $V_2$ without loss of expressivity. The dim_head == dim condition is neccesary since we still have to project $B\times L\times d_h$ back into shape $B\times L\times d$ if $d_h\neq d$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant