Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

kkew3 · 2024-12-11T05:27:40Z

I'm reading the subsection identify scientific, nuisance, and fixed hyperparameters, and have the following question: do we need a secondary validation set in order to evaluate the scientific hyperparameters after optimizing away the nuisance hyperparameters on the (primary) validation set?

Denote the training set by $D_\mathrm{train}$, the validation set by $D_\mathrm{val}$, the secondary validation set by $D_\mathrm{val2}$, the learnable parameters by $\theta$, the nuisance hyperparameters by $\phi$, the scientific hyperparameter by $\psi$. Assume there's no conditional hyperparameters for simplicity.

For example, if our goal is to "determine whether a model with more hidden layers will reduce validation error", then the number of hidden layers is a scientific hyperparameter.

From Bayesian point of view, we may rewrite the goal as evaluating $p(D_\mathrm{val} \mid D_\mathrm{train}, \psi)$ for different $\psi$. It follows that:

$$ \begin{aligned} p(D_\mathrm{val} \mid D_\mathrm{train}, \psi) &= \sum_{\theta,\phi} p(\theta, \phi, D_\mathrm{val} \mid D_\mathrm{train}, \psi)\\ &= \sum_{\theta,\phi} p(\phi) p(\theta \mid D_\mathrm{train}, \phi, \psi) p(D_\mathrm{val} \mid \theta, \phi, \psi).\\ \end{aligned} $$

If we learn $\theta$ by maximum likelihood, then the posterior of $\theta$ reduces to a point estimate at $\hat\theta(\phi, \psi) = \arg\max_\theta p(D_\mathrm{train} \mid \theta, \phi, \psi)$:

$$ p(D_\mathrm{val} \mid D_\mathrm{train}, \psi) = \sum_\phi p(\phi) p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi). $$

However, we can't see where the process of "optimizing away" takes place.

On the other hand, if we rephrase our goal as "evaluating $p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)$ for different $\psi$", meaning to see the effect of the scientific hyperparameters on the validation error over a secondary validation set after tuning nuisance hyperparameters on the primary validation set:

$$ \begin{aligned} p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi) &= \sum_{\theta,\phi} p(\theta, \phi, D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)\\ &= \sum_{\theta,\phi} p(\phi \mid D_\mathrm{train}, D_\mathrm{val}, \psi) p(\theta \mid D_\mathrm{train}, \phi, \psi), p(D_\mathrm{val2} \mid \theta, \phi, \psi),\\ \end{aligned} $$

where the posterior $p(\phi \mid D_\mathrm{train}, D_\mathrm{val}, \psi) \propto p(\phi, D_\mathrm{val} \mid D_\mathrm{train}, \psi) = \sum_\theta p(\phi) p(\theta \mid D_\mathrm{train}, \phi, \psi) p(D_\mathrm{val} \mid \theta, \phi, \psi) = p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$ assuming uninformative prior $p(\phi) \propto 1$.

Here, we clearly witness the optimizing away effect on $\phi$, if we resort again to a point estimate of $\phi$ as $\hat\phi(\psi) = \arg\max_\phi p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$. For each $\psi$, we find the best $\hat\phi(\psi)$ over the validation set. And to evaluate the effect of the scientific hyperparameters $\psi$:

$$ p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi) = p(D_\mathrm{val2} \mid \hat\theta(\hat\phi(\psi), \psi), \hat\phi(\psi)). $$

Am I missing something? Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

kkew3 commented Dec 11, 2024

Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

Question about optimizing away nuisance hyperparameters and the need for a secondary validation set #79

Comments

kkew3 commented Dec 11, 2024