Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text embedding length and video length #4

Open
9B8DY6 opened this issue Oct 26, 2022 · 5 comments
Open

text embedding length and video length #4

9B8DY6 opened this issue Oct 26, 2022 · 5 comments

Comments

@9B8DY6
Copy link

9B8DY6 commented Oct 26, 2022

I have a question about text embedding length and video length.
When pretraining C-ViViT, the length of video seq is (1, 11, 3, 128, 128) = (Batchsize, Frames, channel, H, W).
I want to know the length of text embedding to be cross-attentioned with video tokens so, as you implement this code, could you let me know it?

@dome272
Copy link
Collaborator

dome272 commented Oct 26, 2022

Im not exactly sure I understand the question, but when training the cViViT you are not using any text embeddings. But still to try to answer your question, you can see an example of the text embeddings here which results in a bs x 1 x 512 shape. But this is only for a small T5. I believe if you choose a larger T5, you will also get a higher dimension instead of 512.
Does that help to answer your question?

@9B8DY6
Copy link
Author

9B8DY6 commented Oct 26, 2022

caption = "the weather is so beautiful"
# text_tokens = t5_tokenizer(caption, return_tensors="pt", padding=True, truncation=True).input_ids
# text_tokens = text_tokens.to(device)
# text_embeddings = t5_model.encoder(input_ids=text_tokens).last_hidden_state
text_embeddings = torch.randn(1, 10, 512).to(device)

text_embeddings size is (batchsize,10,512) or (bs, 1, 512)? 10 means the number of frames?
During inference, is the length of video or text tokens consistent as 10?

@dome272
Copy link
Collaborator

dome272 commented Oct 26, 2022

Sorry I was not focused enough. The text embedding shape is bs x text_tokens x 512. So depending on the captions this will be different. For example: "the weather is so beautiful" gives a shape of [1, 6, 512] and "the weather is so beautiful today" gives me [1, 7, 512]. Does this help?

@9B8DY6
Copy link
Author

9B8DY6 commented Oct 26, 2022

Thank you for your help. Your quick and kind reply really really helps me. I have one more question about length of empty video tokens during inference. During inference, is the length of video or text tokens same as 10 during pretraining? But I think the length of video generated would vary according to new prompt. Do I understand it right?

@dome272
Copy link
Collaborator

dome272 commented Oct 28, 2022

Technically your model can always only see a context of 10 frames. For that reason the authors proposed to use a shifting context. For example if you want to generate 20 frames, you first generate the first 10 and then you shift the window and take only the last 5 frames, initialize the token sequence with it and predict the next 5. And then shift it again 5 frames and generate the last 5 frames.
Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants