Embedding scaling #1

jamesanto · 2024-03-11T18:12:43Z

Thank you for the excellent blog post and code.

Looking at the embedding scaling logic, it looks like you are first scaling the word embedding weights by 1/scale during initialisation and then you are scaling the word embeddings after lookup by scale. Is this intentional? Don't they cancel each other?

The text was updated successfully, but these errors were encountered:

jamesanto · 2024-03-12T09:01:48Z

I think I get it now,

We scale down the weights, so that it can be shared at the output layer. So to offset this scaling, we are scaling up the values before feeding it to the transformer.

This is also different from many implementations where the embedding is only scaled up before feeding it to the transformer. There are some discussions suggesting that it's to make sure that the positional embedding does not overwhelm the word embeddings, but there's no consensus.

Do you have any references for implementing the way you have implemented?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding scaling #1

Embedding scaling #1

jamesanto commented Mar 11, 2024

jamesanto commented Mar 12, 2024

Embedding scaling #1

Embedding scaling #1

Comments

jamesanto commented Mar 11, 2024

jamesanto commented Mar 12, 2024