Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

Open
kkew3 opened this issue Dec 11, 2024 · 0 comments

Comments

@kkew3
Copy link

kkew3 commented Dec 11, 2024

I'm reading the subsection identify scientific, nuisance, and fixed hyperparameters, and have the following question: do we need a secondary validation set in order to evaluate the scientific hyperparameters after optimizing away the nuisance hyperparameters on the (primary) validation set?

Denote the training set by $D_\mathrm{train}$, the validation set by $D_\mathrm{val}$, the secondary validation set by $D_\mathrm{val2}$, the learnable parameters by $\theta$, the nuisance hyperparameters by $\phi$, the scientific hyperparameter by $\psi$. Assume there's no conditional hyperparameters for simplicity.

For example, if our goal is to "determine whether a model with more hidden layers will reduce validation error", then the number of hidden layers is a scientific hyperparameter.

From Bayesian point of view, we may rewrite the goal as evaluating $p(D_\mathrm{val} \mid D_\mathrm{train}, \psi)$ for different $\psi$. It follows that:

$$ \begin{aligned} p(D_\mathrm{val} \mid D_\mathrm{train}, \psi) &= \sum_{\theta,\phi} p(\theta, \phi, D_\mathrm{val} \mid D_\mathrm{train}, \psi)\\ &= \sum_{\theta,\phi} p(\phi) p(\theta \mid D_\mathrm{train}, \phi, \psi) p(D_\mathrm{val} \mid \theta, \phi, \psi).\\ \end{aligned} $$

If we learn $\theta$ by maximum likelihood, then the posterior of $\theta$ reduces to a point estimate at $\hat\theta(\phi, \psi) = \arg\max_\theta p(D_\mathrm{train} \mid \theta, \phi, \psi)$:

$$ p(D_\mathrm{val} \mid D_\mathrm{train}, \psi) = \sum_\phi p(\phi) p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi). $$

However, we can't see where the process of "optimizing away" takes place.

On the other hand, if we rephrase our goal as "evaluating $p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)$ for different $\psi$", meaning to see the effect of the scientific hyperparameters on the validation error over a secondary validation set after tuning nuisance hyperparameters on the primary validation set:

$$ \begin{aligned} p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi) &= \sum_{\theta,\phi} p(\theta, \phi, D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)\\ &= \sum_{\theta,\phi} p(\phi \mid D_\mathrm{train}, D_\mathrm{val}, \psi) p(\theta \mid D_\mathrm{train}, \phi, \psi), p(D_\mathrm{val2} \mid \theta, \phi, \psi),\\ \end{aligned} $$

where the posterior $p(\phi \mid D_\mathrm{train}, D_\mathrm{val}, \psi) \propto p(\phi, D_\mathrm{val} \mid D_\mathrm{train}, \psi) = \sum_\theta p(\phi) p(\theta \mid D_\mathrm{train}, \phi, \psi) p(D_\mathrm{val} \mid \theta, \phi, \psi) = p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$ assuming uninformative prior $p(\phi) \propto 1$.

Here, we clearly witness the optimizing away effect on $\phi$, if we resort again to a point estimate of $\phi$ as $\hat\phi(\psi) = \arg\max_\phi p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$. For each $\psi$, we find the best $\hat\phi(\psi)$ over the validation set. And to evaluate the effect of the scientific hyperparameters $\psi$:

$$ p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi) = p(D_\mathrm{val2} \mid \hat\theta(\hat\phi(\psi), \psi), \hat\phi(\psi)). $$

Am I missing something? Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant