You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, I appreciate for your outstanding project.
I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.
Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:
model
pretrained
vg_relation
vg_attribution
coco_order
flickr30k_order
Task Avg.
ViT-B-32
openai
59.9%
63.2%
47.4%
58.8%
57.3%
NegCLIP
coco ft
80.2%
70.5%
86.8%
89.7%
81.8%
BLIP-base
flickr ft
49.7%
89.9%
42.5%
40.5%
55.7%
BLIP-base
coco ft
58.4%
89.5%
37.1%
46.3%
57.8%
I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).
In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy.
In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.
Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.
Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.
Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.
Best regards,
The text was updated successfully, but these errors were encountered:
Dear Authors,
Firstly, I appreciate for your outstanding project.
I've been playing with your benchmark using the provided codebase, and noticed some discrepancies between the evaluation results I obtained and those reported in your paper.
Below is a comparative overview table of my evaluation results on the VG datasets and PRC tasks using COCO_order and Flickr30k_order:
I observed that my reproduced results for VG_Relation and VG_Attribution closely align with the numbers presented in your paper. However, I have concerns regarding the NegCLIP results for flickr30k_order, where a 91% (0.91) is reported in your paper (Appendix Table 6).
In addition, regarding the BLIP models, there seems to be a somewhat higher discrepancy.
In your paper from Appendix Table 5, the results reported are 0.369 for Flickr30k-PRC (BLIP-flickr-base) and 0.321 for COCO-PRC (BLIP-coco-base). In contrast, my results showed significantly higher scores of 40.5% and 37.1%, respectively, for the same models.
Note 1: I observed that some level of randomness arises when creating order annotations from the original annotation file. However, this randomness does not seem to cause the large gap observed.
Note 2: To account for the randomness, I maintained the same order annotations across different models in my experiments.
Given that these results were obtained using the provided codes and checkpoints, I would like to see any potential my mistakes or what was gone wrong by any reason.
Best regards,
The text was updated successfully, but these errors were encountered: