ViLT token_type_embeddings implemented twice #34758

niktheod · 2024-11-16T10:36:44Z

I loaded ViltModel.from_pretrained("dandelin/vilt-b32-mlm") and I printed out the parameters of the model. I noticed that token_type_embeddings was included twice.

Once in the TextEmbeddings class: https://github.com/huggingface/transformers/blob/13493215abceafc1653af88b045120014fb4c1fc/src/transformers/models/vilt/modeling_vilt.py#L239
and once in the ViltEmbeddings class:
https://github.com/huggingface/transformers/blob/13493215abceafc1653af88b045120014fb4c1fc/src/transformers/models/vilt/modeling_vilt.py#L98

Additionaly, both of these token_type_embeddings are added to the text embeddings, which doesn't seem to be in agreement with the original paper.

The first time is added in the TextEmbeddings class in this line: embeddings = inputs_embeds + token_type_embeddings.
https://github.com/huggingface/transformers/blob/13493215abceafc1653af88b045120014fb4c1fc/src/transformers/models/vilt/modeling_vilt.py#L280
The second time is added in the ViltEmbeddings class in this line: text_embeds = text_embeds + self.token_type_embeddings( torch.zeros_like(attention_mask, dtype=torch.long, device=text_embeds.device) ).
https://github.com/huggingface/transformers/blob/13493215abceafc1653af88b045120014fb4c1fc/src/transformers/models/vilt/modeling_vilt.py#L218C9-L220C10

The image embeddings don't seem to have the same issue, as the token_type_embeddings are added to them only once in ViltEmbeddings.

Is this an intented behaviour for some reason I don't understand or is it a mistake?

The text was updated successfully, but these errors were encountered:

NielsRogge · 2024-11-16T17:38:39Z

Hi,

Yes ViLT uses 2 types of token type embeddings:

The first are similar to what BERT uses: this means that 0's are used for the first sequence, and 1's for the second sequence. These are just used to indicate 2 different sequences of text, for instance when doing extractive question answering, in which case the input text would be [CLS] what is his name [SEP] his name is Niels [SEP] for instance. You indicate the question tokens "what is his name" with 0's and the context "his name is Niels" from which to extract the answer with 1's. See also the explainer here.
The second are unique to ViLT, and are used to indicate whether it's a text or image token (the modality of the token). This is because ViLT concatenates the embeddings of text tokens and image patches, so the token type embeddings are added to indicate to the model which modality they are.

niktheod · 2024-11-17T09:34:16Z

Hi,

Thank you for your reply.

Do you know if the first are included in the original ViLT implementation or are they an addition of the hugging face implementation? Because in the equations that describe ViLT in the original paper, the first token type embeddings are not mentioned:

NielsRogge · 2024-11-17T10:31:29Z

It's included in the original implementation here, so I assume that they just don't mention it in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ViLT token_type_embeddings implemented twice #34758

ViLT token_type_embeddings implemented twice #34758

niktheod commented Nov 16, 2024

NielsRogge commented Nov 16, 2024 •

edited

Loading

niktheod commented Nov 17, 2024

NielsRogge commented Nov 17, 2024

ViLT token_type_embeddings implemented twice #34758

ViLT token_type_embeddings implemented twice #34758

Comments

niktheod commented Nov 16, 2024

NielsRogge commented Nov 16, 2024 • edited Loading

niktheod commented Nov 17, 2024

NielsRogge commented Nov 17, 2024

NielsRogge commented Nov 16, 2024 •

edited

Loading