-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text embedding length and video length #4
Comments
Im not exactly sure I understand the question, but when training the cViViT you are not using any text embeddings. But still to try to answer your question, you can see an example of the text embeddings here which results in a bs x 1 x 512 shape. But this is only for a small T5. I believe if you choose a larger T5, you will also get a higher dimension instead of 512. |
text_embeddings size is (batchsize,10,512) or (bs, 1, 512)? 10 means the number of frames? |
Sorry I was not focused enough. The text embedding shape is bs x text_tokens x 512. So depending on the captions this will be different. For example: "the weather is so beautiful" gives a shape of [1, 6, 512] and "the weather is so beautiful today" gives me [1, 7, 512]. Does this help? |
Thank you for your help. Your quick and kind reply really really helps me. I have one more question about length of empty video tokens during inference. During inference, is the length of video or text tokens same as 10 during pretraining? But I think the length of video generated would vary according to new prompt. Do I understand it right? |
Technically your model can always only see a context of 10 frames. For that reason the authors proposed to use a shifting context. For example if you want to generate 20 frames, you first generate the first 10 and then you shift the window and take only the last 5 frames, initialize the token sequence with it and predict the next 5. And then shift it again 5 frames and generate the last 5 frames. |
I have a question about text embedding length and video length.
When pretraining C-ViViT, the length of video seq is (1, 11, 3, 128, 128) = (Batchsize, Frames, channel, H, W).
I want to know the length of text embedding to be cross-attentioned with video tokens so, as you implement this code, could you let me know it?
The text was updated successfully, but these errors were encountered: