MAB Implementation diverges from Paper #8

jlko · 2020-11-17T17:43:18Z

Dear Juho,

is it possible that the implementation of the MAB diverges from the paper?

In more detail: The paper states

Multihead(Q,K,V;λ,ω)=concat(O_1,··· ,O_h)W_O
H = LayerNorm(X + Multihead(X, Y, Y ; ω))
MAB(X, Y ) = LayerNorm(H + rFF(H))

but the code does

A = torch.softmax(Q_.bmm(K_.transpose(1,2))/math.sqrt(self.dim_V), 2)
O = torch.cat((Q_ + A.bmm(V_)).split(Q.size(0), 0), 2)  # This is output of multihead
O = O if getattr(self, 'ln0', None) is None else self.ln0(O)
O = O + F.relu(self.fc_o(O))
O = O if getattr(self, 'ln1', None) is None else self.ln1(O)

It seems that the matrix W_O is not being used in the code at all to mix the output of the different heads?
The skip connection Q_ + A.bmm(V_) also diverges from what's stated in the paper, given that Q_ is derived from Q which gets linearly transformed via Q = self.fc_q(Q) in the first line of forward() and is therefore no longer equal to the original query. (On second thought, this may be a necessary requirement, since the output of the MAB has different shape than the input shape. That means in this case, the paper is imprecise.)

Thanks a lot and best wishes
Jannik

The text was updated successfully, but these errors were encountered:

jingweiz · 2020-11-26T09:58:31Z

Hi Jannik,
I think here fc_o (in the 5th line of the code you pasted) is the W_O in the paper, what do you think?

jlko · 2020-11-26T10:08:38Z

Dear Jingweiz,
thanks for your reply!
I would have identified fc_o with the rFF(H) of the MAB and not with W_O.

juho-lee · 2020-11-26T10:35:34Z

Hi, thanks for your interest!

Multiplying W_0 after the concat and 2) multiplying W to the query to get Q and then split-attend-concat, in essence, makes a small difference (one of them is a restricted version of another). For the paper, I followed the description in the original transformer paper, and for the code, I chose the current form following the code available for original transfomer (also, it gives a cleaner code). But they don't make a big empirical difference.

jlko · 2020-11-26T10:54:03Z

Hey Juho Lee!
Thanks for your reply.

It makes sense that this does not give a big empirical difference. I just wanted to check if I missed something.

And LayerNorm(X + Multihead(X, Y, Y ; ω)) in the paper, should probably be something like LayerNorm(W_q X + Multihead(X, Y, Y ; ω)), correct?

jingweiz · 2020-11-27T01:16:01Z

Dear Jingweiz,
thanks for your reply!
I would have identified fc_o with the rFF(H) of the MAB and not with W_O.

Oh right exactly, I got messed up, thanks!

npielawski · 2023-06-02T11:17:52Z

I have a follow up question linked to this topic.

In the paper a row-wise FF block is used for the pooling, and unlike the rFF in the MAB, the rFF in the pooling doesn't have an activation function. Should the PMA rFF have an activation function or not?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAB Implementation diverges from Paper #8

MAB Implementation diverges from Paper #8

jlko commented Nov 17, 2020 •

edited

Loading

jingweiz commented Nov 26, 2020

jlko commented Nov 26, 2020

juho-lee commented Nov 26, 2020

jlko commented Nov 26, 2020 •

edited

Loading

jingweiz commented Nov 27, 2020

npielawski commented Jun 2, 2023

MAB Implementation diverges from Paper #8

MAB Implementation diverges from Paper #8

Comments

jlko commented Nov 17, 2020 • edited Loading

jingweiz commented Nov 26, 2020

jlko commented Nov 26, 2020

juho-lee commented Nov 26, 2020

jlko commented Nov 26, 2020 • edited Loading

jingweiz commented Nov 27, 2020

npielawski commented Jun 2, 2023

jlko commented Nov 17, 2020 •

edited

Loading

jlko commented Nov 26, 2020 •

edited

Loading