text embedding length and video length #4

9B8DY6 · 2022-10-26T05:39:03Z

I have a question about text embedding length and video length.
When pretraining C-ViViT, the length of video seq is (1, 11, 3, 128, 128) = (Batchsize, Frames, channel, H, W).
I want to know the length of text embedding to be cross-attentioned with video tokens so, as you implement this code, could you let me know it?

dome272 · 2022-10-26T07:19:05Z

Im not exactly sure I understand the question, but when training the cViViT you are not using any text embeddings. But still to try to answer your question, you can see an example of the text embeddings here which results in a bs x 1 x 512 shape. But this is only for a small T5. I believe if you choose a larger T5, you will also get a higher dimension instead of 512.
Does that help to answer your question?

9B8DY6 · 2022-10-26T08:16:33Z

caption = "the weather is so beautiful"
# text_tokens = t5_tokenizer(caption, return_tensors="pt", padding=True, truncation=True).input_ids
# text_tokens = text_tokens.to(device)
# text_embeddings = t5_model.encoder(input_ids=text_tokens).last_hidden_state
text_embeddings = torch.randn(1, 10, 512).to(device)

text_embeddings size is (batchsize,10,512) or (bs, 1, 512)? 10 means the number of frames?
During inference, is the length of video or text tokens consistent as 10?

dome272 · 2022-10-26T10:06:46Z

Sorry I was not focused enough. The text embedding shape is bs x text_tokens x 512. So depending on the captions this will be different. For example: "the weather is so beautiful" gives a shape of [1, 6, 512] and "the weather is so beautiful today" gives me [1, 7, 512]. Does this help?

9B8DY6 · 2022-10-26T12:02:02Z

Thank you for your help. Your quick and kind reply really really helps me. I have one more question about length of empty video tokens during inference. During inference, is the length of video or text tokens same as 10 during pretraining? But I think the length of video generated would vary according to new prompt. Do I understand it right?

dome272 · 2022-10-28T04:37:48Z

Technically your model can always only see a context of 10 frames. For that reason the authors proposed to use a shifting context. For example if you want to generate 20 frames, you first generate the first 10 and then you shift the window and take only the last 5 frames, initialize the token sequence with it and predict the next 5. And then shift it again 5 frames and generate the last 5 frames.
Does that help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text embedding length and video length #4

text embedding length and video length #4

9B8DY6 commented Oct 26, 2022

dome272 commented Oct 26, 2022

9B8DY6 commented Oct 26, 2022 •

edited

Loading

dome272 commented Oct 26, 2022

9B8DY6 commented Oct 26, 2022 •

edited

Loading

dome272 commented Oct 28, 2022

text embedding length and video length #4

text embedding length and video length #4

Comments

9B8DY6 commented Oct 26, 2022

dome272 commented Oct 26, 2022

9B8DY6 commented Oct 26, 2022 • edited Loading

dome272 commented Oct 26, 2022

9B8DY6 commented Oct 26, 2022 • edited Loading

dome272 commented Oct 28, 2022

9B8DY6 commented Oct 26, 2022 •

edited

Loading

9B8DY6 commented Oct 26, 2022 •

edited

Loading