Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning on downstream tasks directly #84

Open
githubiubiu opened this issue Aug 27, 2024 · 6 comments
Open

Fine-tuning on downstream tasks directly #84

githubiubiu opened this issue Aug 27, 2024 · 6 comments

Comments

@githubiubiu
Copy link

Hello !
Thanks for this amazing work, I want to know how to use the radio model for fine-tuning on downstream tasks (maybe not classification tasks). For vit-L/14, is it possible to load only the backbone parameters including multiple cls tokens (like loading imagenet pre-trained weights), or is it necessary to load dino/clip head? My downstream task is similar to defect detection. Thank you very much for your reply!

@githubiubiu
Copy link
Author

And when I replaced the existing DINOv2 vit-L/14 (pretrained weights only) with RADIO vit-L/14, the accuracy has dropped. I am not sure if it is caused by the incorrect use of RADIO, which bothers me.

@gheinrich
Copy link
Collaborator

Hello, yes RADIO is very much designed to be used in a downstream application. We usually keep the backbone frozen and train a task-specific head on top of the shared backbone features. Have you seen the semantic segmentation example? RADIO was also integrated as a backbone in Probe3D.

Can I check with you that your inputs into RADIO are RGB values in the [0,1] range?

@githubiubiu
Copy link
Author

Thanks for your quick reply. I followed your instructions to freeze the entire backbone and only train the head, but it didn't work for my task. I added lora to the backbone, which gave me better results, but still inferior to dinov2. And for data preprocessing, I first normalized and regularized the img and replaced the input_conditioner with nn.Identity().

@mranzinger
Copy link
Collaborator

One thing that comes to mind is that RADIOv2.5-L is a ViT-L/16 model, not an /14 model. Have you ensured that you're handling that difference in patch sizes properly? For example, running DINOv2 at 448px is equivalent to running RADIOv2.5-L at 512px given that it's the same number of tokens processed by either model +/- some negligible compute.

@githubiubiu
Copy link
Author

Yes, I noticed this diff at first, but I found that in the paper vit-l is vit-l/14 instead of vit-l/16. And the patch size in the code can easily achieve interpolation from 16 to 14, so I used the interpolated patch size (16 -> 14). Will this be the key to the performance degradation? I will try it experimentally. Thank you for your reply.

@mranzinger
Copy link
Collaborator

Yeah, very possible that interpolating to patch 14 is causing enough of an issue to degrade results. The choice between 14 and 16 is tricky for our models. I suppose I personally prefer 16 because it's a better number for computing. From a modeling standpoint, this choice mostly affects what we call "effective resolution" which is essentially the number of patch rows and columns. So if you have a ViT-L/14 at resolution 448, then it's roughly identical to a ViT-L/16 at resolution 512. Both have an effective resolution of 32x32 in this case.

Because you're using DINO-L/14, you'll want to account for the effective resolution when comparing to RADIOv2.5-L/16 by scaling the input resolution by 16/14 when handing to RADIO. In doing so, you'll get an identical number of output patches between the two models, and also each input patch will encompass the exact same image content, making the comparison fair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants