Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ViLT token_type_embeddings implemented twice #34758

Open
niktheod opened this issue Nov 16, 2024 · 3 comments
Open

ViLT token_type_embeddings implemented twice #34758

niktheod opened this issue Nov 16, 2024 · 3 comments

Comments

@niktheod
Copy link

I loaded ViltModel.from_pretrained("dandelin/vilt-b32-mlm") and I printed out the parameters of the model. I noticed that token_type_embeddings was included twice.

Additionaly, both of these token_type_embeddings are added to the text embeddings, which doesn't seem to be in agreement with the original paper.

The image embeddings don't seem to have the same issue, as the token_type_embeddings are added to them only once in ViltEmbeddings.

Is this an intented behaviour for some reason I don't understand or is it a mistake?

@NielsRogge
Copy link
Contributor

NielsRogge commented Nov 16, 2024

Hi,

Yes ViLT uses 2 types of token type embeddings:

  • The first are similar to what BERT uses: this means that 0's are used for the first sequence, and 1's for the second sequence. These are just used to indicate 2 different sequences of text, for instance when doing extractive question answering, in which case the input text would be [CLS] what is his name [SEP] his name is Niels [SEP] for instance. You indicate the question tokens "what is his name" with 0's and the context "his name is Niels" from which to extract the answer with 1's. See also the explainer here.
  • The second are unique to ViLT, and are used to indicate whether it's a text or image token (the modality of the token). This is because ViLT concatenates the embeddings of text tokens and image patches, so the token type embeddings are added to indicate to the model which modality they are.

@niktheod
Copy link
Author

Hi,

Thank you for your reply.

Do you know if the first are included in the original ViLT implementation or are they an addition of the hugging face implementation? Because in the equations that describe ViLT in the original paper, the first token type embeddings are not mentioned:
Screenshot 2024-11-17 092843

@NielsRogge
Copy link
Contributor

It's included in the original implementation here, so I assume that they just don't mention it in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants