Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about prediction of SGNP #288

Open
JianxiangFENG opened this issue Feb 2, 2021 · 5 comments
Open

Questions about prediction of SGNP #288

JianxiangFENG opened this issue Feb 2, 2021 · 5 comments

Comments

@JianxiangFENG
Copy link

Hi @jereliu ,

I have a few questions about the inference stage of SGNP:

  1. According to the Eq 9) and Algorithrm 1) in the paper, shouldn't there be K precision matrix for each dimension of the output, where K is the number of class? And the dimension of each one is [ batch_size, batch_size], but the total matrix should be [K, batch_size, batch_size], am I understanding something wrong? And in the codes, I can just find the a single covariance matrix with size of [batch_size, batch_size].
  2. After searching the codes for a while, I couldn't find the sampling step which is the 5th step in Algorithm 2). Without this sampling step, the prediction is similar to MAP prediction except for the difference during training. This way to make prediction should be essential in this method, right?

I would appreciate if you can explain more to me.

Best,
Jianxiang

@jereliu
Copy link
Collaborator

jereliu commented Feb 3, 2021

Hi Jianxiang,

Thanks for getting in touch! Sorry for the confusion about the mismatch between the paper and this implementation. Yes we made two changes for computational feasibility / performance reasons:

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

  2. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

@JianxiangFENG
Copy link
Author

JianxiangFENG commented Feb 5, 2021

Thank you for the quick reply!

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

  1. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this?
Thank you in advance!

@mdabbah
Copy link

mdabbah commented Mar 1, 2021

Hi,
just throwing a possible explanation here for 1.
maybe one covariance matrix for all classes is better because it reduces the overfitting.
maybe on Large datasets, we would see the opposite (more intuitive) effect: better performance when using covariance matrix for each class, there we would have enough data to better approximate a covariance matrix for each class.

@Jordy-VL
Copy link

Jordy-VL commented Jun 3, 2021

Thank you for the quick reply!

  1. After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

  1. We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this?
Thank you in advance!

@JianxiangFENG Did you get or figure out an answer to your last question? I am wondering this myself :)

@JianxiangFENG
Copy link
Author

@Jordy-VL hey, I did not follow it in the end. But the paper relevant paper (https://arxiv.org/abs/2006.0758) is worth reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants