You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by 1/scale during initialisation and then you are scaling the word embeddings after lookup by scale. Is this intentional? Don't they cancel each other?
The text was updated successfully, but these errors were encountered:
We scale down the weights, so that it can be shared at the output layer. So to offset this scaling, we are scaling up the values before feeding it to the transformer.
This is also different from many implementations where the embedding is only scaled up before feeding it to the transformer. There are some discussions suggesting that it's to make sure that the positional embedding does not overwhelm the word embeddings, but there's no consensus.
Do you have any references for implementing the way you have implemented?
Thank you for the excellent blog post and code.
Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by
1/scale
during initialisation and then you are scaling the word embeddings after lookup byscale
. Is this intentional? Don't they cancel each other?The text was updated successfully, but these errors were encountered: