You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm attempting to finetune the BLIP model using a custom ViT for vision encoding. My ViT model is trained to classify medical images into three classes: healthy, COVID, and other. I’ve replaced the BLIP vision transformer with this custom ViT model and adjusted the image embedding shape to match BLIP's requirements. However, after training, the model produces repetitive single-word predictions for all images or even the same caption for all images.
Below are the details of my implementation, training setup, and the resulting predictions.
Steps to Reproduce:
Model Setup:
Replace the vision transformer in BLIP with a custom ViT model fine-tuned for image classification (I provided the path to folder).
Ensure the ViT output embedding shape is padded to match BLIP’s expected input (1, 577, 768).
Training:
Use a custom dataset with captions.
Train using the code provided below with AdamW optimizer and batch size of 2.
In other cases, the model generates the same caption for all the images in the test set.
Expected Behavior:
Generate unique captions that accurately describe each image in my dataset.
Any insights into why this repetitive output might occur or suggestions on how to adapt my ViT output to work better with BLIP's text generation would be greatly appreciated!
@switchingImageEncoder
The text was updated successfully, but these errors were encountered:
I'm attempting to finetune the BLIP model using a custom ViT for vision encoding. My ViT model is trained to classify medical images into three classes: healthy, COVID, and other. I’ve replaced the BLIP vision transformer with this custom ViT model and adjusted the image embedding shape to match BLIP's requirements. However, after training, the model produces repetitive single-word predictions for all images or even the same caption for all images.
Below are the details of my implementation, training setup, and the resulting predictions.
Steps to Reproduce:
Code:
Click to expand
Inference Code:
Click to expand
Observed Output:
The model generates repetitive single-word captions as shown below (my test set had :
In other cases, the model generates the same caption for all the images in the test set.
Expected Behavior:
Generate unique captions that accurately describe each image in my dataset.
Any insights into why this repetitive output might occur or suggestions on how to adapt my ViT output to work better with BLIP's text generation would be greatly appreciated!
@switchingImageEncoder
The text was updated successfully, but these errors were encountered: