Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model zero-shot retrieval capability of the videochat2 stage-1 model #212

Open
wengzejia1 opened this issue Jul 26, 2024 · 7 comments
Open

Comments

@wengzejia1
Copy link

Hello. Can you apply the evaluation results (especially the zero-shot retrieval performance on MSR-VTT dataset) for the videochat2 stage-1 model. Should it perform better the UMT model or not? Thanks.

@wengzejia1
Copy link
Author

wengzejia1 commented Jul 27, 2024

Also the current code seems to have some problems in stage-1 evaluation. I modify the code to run the evaluation process for your released stage-1 model checkpoint. But the result is strange that the VTM results are worse than the VTC results. Can you help me verify this. Or can you release the evaluation for stage-1? Thank you.

image

@Andy1621
Copy link
Collaborator

Hi! You may refer to BLIP2 for help. In my memory, the stage-1 model does not work better than UMT.

@wengzejia1
Copy link
Author

Once I resume your released stage-1 model and continual the stage-1 training process, it seems the VTM results will go better than the VTC results, while my testing on released stage-1 model checkpoint shows VTM results are worse than the VTC results.
I would be appreciate if you can update the stage-1 evaluation code, and give the stage-1 zero-shot retrieval results for your released stage-1 model.

@Andy1621
Copy link
Collaborator

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

@wengzejia1
Copy link
Author

wengzejia1 commented Jul 29, 2024

Also it seems the code of loading pretrained UMT model codeline has some problems, because of the misleading name prefix "vision_encoder." The parameter names of the UMT vision encoder in umt-l16 contains the "vision_encoder" prefix, while in the codeline, parameters of the vit model do not contain that prefix still. That will cause the failure of the pretrained model loading and bring the failure of reimplementation of stage-1.

I would be appreciate if you can check whether that bug exists. Thank you so much.

@wengzejia1
Copy link
Author

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬

@Andy1621
Copy link
Collaborator

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬

Yizhuo Li conducts the experiment~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants