Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Questions #5

Open
bryanwong17 opened this issue Nov 23, 2022 · 12 comments
Open

Some Questions #5

bryanwong17 opened this issue Nov 23, 2022 · 12 comments

Comments

@bryanwong17
Copy link

bryanwong17 commented Nov 23, 2022

Hello, I would appreciate it if you could respond to some of my questions below:

  1. Could you clarify why SelfPatch loss is only evaluated when v == iq (student and teacher operate on the same view)?
  2. Why is teacher output viewed locally as well? According to what I understand of the original DINO, only 2 global views pass through the teacher
  3. Why is loc=False given to student_output?

student_output = [student(torch.cat(images[:2]), head_only=True, loc=False), student(torch.cat(images[2:]), head_only=True, loc=False)]

Thanks for your time and kindness!

@yanjk3
Copy link

yanjk3 commented Nov 25, 2022

I am not the author and I hope my answer can help you.
A1: because the neighbor of a patch is defined in the same view. The ''neighbor'' is not easy to define in a cross-view situation. (or you can try to define it with some spatial priori)
A2: the local views are fed into the teacher network to contribute to the selfpatch loss, i.e., the loss from the same view mentioned before, which may not be a must and may accelerate the convergence.
A3: ''loc=True'' means aggregating the neighbor's features, which is enabled in the teacher network. E.g., the i^th patch of the teacher network aggregates its neighbor's features. In the student model, we do not aggregate them. Then, we maximize the similarity between the student's i^th patch and the teacher's i^th patch (it includes the neighbor's features) to model the patch-level representations.

I hope the above opinion may help u.

@bryanwong17
Copy link
Author

Hi @yanjk3, Thank you very much for the answers, I really appreciate it. It makes more sense now that I know the authors made a slight modification to the original DINO

@bryanwong17
Copy link
Author

Hi @yanjk3, When I use eval_knn.py from original dino to evaluate selfpatch, it says:  

size mismatch for pos_embed: copying a param with shape torch.Size([1, 196, 384]) from checkpoint, the shape in current model is torch.Size([1, 197, 384]).

Do you have any ideas on how can I fix it? Thank you

@bryanwong17 bryanwong17 reopened this Dec 29, 2022
@yanjk3
Copy link

yanjk3 commented Dec 30, 2022

This is because the selfpatch checkpoint does not contain the CLS token. Therefore, the position embedding's size is mismatched. In selfpatch, the CLS token is in the SelfPatchHead https://github.com/alinlab/SelfPatch/blob/main/selfpatch_vision_transformer.py#L362, so the ViT backbone does not need the CLS token.

I think you can fix it by modifying the dino's ViT codes https://github.com/facebookresearch/dino/blob/main/vision_transformer.py#L147 from self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) to self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim)).
And then you should delete the '-1' in line 175 and line 176, and exchange line 202 and line 205.
However, as the selfpatch checkpoint does not contain the CLS token, the ViT model will randomly initialize a CLS token and lead to a potential performance drop. I think you can use a global average pooling on the last transformer block to get the global feature representation of images instead of using the CLS token.

@bryanwong17
Copy link
Author

bryanwong17 commented Dec 30, 2022

Hi @yanjk3, thank you for your answers. Could you demonstrate how I can use a global avg pooling on the last transformer blocks?

@yanjk3
Copy link

yanjk3 commented Dec 30, 2022

You should make sure you delete the CLS token in the ViT first. And then, you can insert
x = x.mean(dim=1)
after the
x = self.norm(x)
and then return the x

@bryanwong17
Copy link
Author

Hi @yanjk3, I already took your advice, but it appears that the accuracy is 3% less than it was for the original DINO under the same settings for eval knn.py. What solutions do you have for this? How can accuracy be checked more accurately? Is it better to check from eval linear.py or eval knn.py? Thanks

@yanjk3
Copy link

yanjk3 commented Dec 30, 2022

To overcome the performance drops, I recommend copying the SelfPatch ViT to the Dino ViT.
The main difference between them is:

SelfPatch uses the CA block after the ViT blocks to aggregate the global feature representations and output the CLS token.

If you use this CLS token, the performance may be improved.
But unfortunately, the released checkpoint only contains the ViT backbone. So if you want to get a precise answer, you should pre-train the entire model on your own.

@bryanwong17
Copy link
Author

Hi @yanjk3, sorry I don't really get it. What do you mean by copying SelfPatch VIT to DINO VIT?

@yanjk3
Copy link

yanjk3 commented Dec 30, 2022

I mean you should replace the dino vit model's code with selfpatch vit model's code.

@bryanwong17
Copy link
Author

Hi @yanjk3, do you mean adding everything you previously suggested to the code for the Dino Vit Model (vision transformer.py)?

@alijavidani
Copy link

alijavidani commented Jun 8, 2023

Hi @bryanwong17, @yanjk3 . I'm having the same problem as yours. I cannot do the evaluation using eval_knn.py.
I was wondering could you find a solution for this problem?
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants