-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAB Implementation diverges from Paper #8
Comments
Hi Jannik, |
Dear Jingweiz, |
Hi, thanks for your interest!
|
Hey Juho Lee! It makes sense that this does not give a big empirical difference. I just wanted to check if I missed something. And |
Oh right exactly, I got messed up, thanks! |
I have a follow up question linked to this topic. In the paper a row-wise FF block is used for the pooling, and unlike the rFF in the MAB, the rFF in the pooling doesn't have an activation function. Should the PMA rFF have an activation function or not? |
Dear Juho,
is it possible that the implementation of the MAB diverges from the paper?
In more detail: The paper states
but the code does
It seems that the matrix
W_O
is not being used in the code at all to mix the output of the different heads?The skip connection
Q_ + A.bmm(V_)
also diverges from what's stated in the paper, given thatQ_
is derived fromQ
which gets linearly transformed viaQ = self.fc_q(Q)
in the first line offorward()
and is therefore no longer equal to the original query. (On second thought, this may be a necessary requirement, since the output of the MAB has different shape than the input shape. That means in this case, the paper is imprecise.)Thanks a lot and best wishes
Jannik
The text was updated successfully, but these errors were encountered: