You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for the good paper and for releasing the code.
I'm reading both the paper and code, and I've got questions regarding the case of multi-head attention.
It seems that the code exactly follows the initialization procedures explained in Sec 3.2 (ie, initialize weights by Xavier and scale v by 0.67 *N**(-1/4) or (9*N)**(-1/4)), even for multi-head attention. But when v (v_proj.weight) is defined by a shape (embed_dim, embed_dim), then xavier_uniform_ initializes it by a uniform distribution U(-a, a), where a = sqrt(6/(fan_in + fan_out)) = sqrt(6/(embed_dim + embed_dim)). But in multi-head attention, v is actually used as multiple (num_heads) matrices with a shape (head_dim, embed_dim). In this case, shouldn't v be initialized by U(-a, a) where a = sqrt(6/(embed_dim + head_dim)) ....? When num_heads=8, this initialization increases the scale of v's weights by 4/3 (= sqrt(2/(1+1/num_heads)).
Other questions:
In the paper, I assumed that d and d' in eq 5 correspond to d_model and d_k, respectively, in the original Transformer's paper, and also correspond to embed_dim and head_dim in the code. Correct? If so, 1/sqrt(d) in eq 5 is perhaps 1/sqrt(d') ?
In the code, TransformerEncoderLayer.self_attn.v_proj.weight is scaled as (0.67 * (en_layers) ** (- 1. / 4.)) * (param * (2**0.5)) and (9 * de_layers) ** (- 1. / 4.) * (param * (2**0.5)) for encoder and decoder, respectively. I assumed that the extra scaling * (2**0.5) is to cancel the gain=1/math.sqrt(2) option used in xavier_uniform_ when initializing v_proj.weight. Correct?
Best,
Tatsunori
The text was updated successfully, but these errors were encountered:
Hi, thanks for the good paper and for releasing the code.
I'm reading both the paper and code, and I've got questions regarding the case of multi-head attention.
It seems that the code exactly follows the initialization procedures explained in Sec 3.2 (ie, initialize weights by Xavier and scale v by
0.67 *N**(-1/4)
or(9*N)**(-1/4)
), even for multi-head attention. But when v (v_proj.weight
) is defined by a shape(embed_dim, embed_dim)
, thenxavier_uniform_
initializes it by a uniform distribution U(-a, a), wherea = sqrt(6/(fan_in + fan_out)) = sqrt(6/(embed_dim + embed_dim))
. But in multi-head attention, v is actually used as multiple (num_heads
) matrices with a shape(head_dim, embed_dim)
. In this case, shouldn't v be initialized by U(-a, a) wherea = sqrt(6/(embed_dim + head_dim))
....? When num_heads=8, this initialization increases the scale of v's weights by 4/3 (= sqrt(2/(1+1/num_heads)).Other questions:
d
andd'
in eq 5 correspond tod_model
andd_k
, respectively, in the original Transformer's paper, and also correspond toembed_dim
andhead_dim
in the code. Correct? If so,1/sqrt(d)
in eq 5 is perhaps1/sqrt(d')
?TransformerEncoderLayer.self_attn.v_proj.weight
is scaled as(0.67 * (en_layers) ** (- 1. / 4.)) * (param * (2**0.5))
and(9 * de_layers) ** (- 1. / 4.) * (param * (2**0.5))
for encoder and decoder, respectively. I assumed that the extra scaling* (2**0.5)
is to cancel thegain=1/math.sqrt(2)
option used inxavier_uniform_
when initializingv_proj.weight
. Correct?Best,
Tatsunori
The text was updated successfully, but these errors were encountered: