-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model zero-shot retrieval capability of the videochat2 stage-1 model #212
Comments
Hi! You may refer to BLIP2 for help. In my memory, the stage-1 model does not work better than UMT. |
Once I resume your released stage-1 model and continual the stage-1 training process, it seems the VTM results will go better than the VTC results, while my testing on released stage-1 model checkpoint shows VTM results are worse than the VTC results. |
Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭 |
Also it seems the code of loading pretrained UMT model codeline has some problems, because of the misleading name prefix "vision_encoder." The parameter names of the UMT vision encoder in umt-l16 contains the "vision_encoder" prefix, while in the codeline, parameters of the vit model do not contain that prefix still. That will cause the failure of the pretrained model loading and bring the failure of reimplementation of stage-1. I would be appreciate if you can check whether that bug exists. Thank you so much. |
Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬 |
Yizhuo Li conducts the experiment~ |
Hello. Can you apply the evaluation results (especially the zero-shot retrieval performance on MSR-VTT dataset) for the videochat2 stage-1 model. Should it perform better the UMT model or not? Thanks.
The text was updated successfully, but these errors were encountered: