You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm reading the subsection identify scientific, nuisance, and fixed hyperparameters, and have the following question: do we need a secondary validation set in order to evaluate the scientific hyperparameters after optimizing away the nuisance hyperparameters on the (primary) validation set?
Denote the training set by $D_\mathrm{train}$, the validation set by $D_\mathrm{val}$, the secondary validation set by $D_\mathrm{val2}$, the learnable parameters by $\theta$, the nuisance hyperparameters by $\phi$, the scientific hyperparameter by $\psi$. Assume there's no conditional hyperparameters for simplicity.
For example, if our goal is to "determine whether a model with more hidden layers will reduce validation error", then the number of hidden layers is a scientific hyperparameter.
From Bayesian point of view, we may rewrite the goal as evaluating $p(D_\mathrm{val} \mid D_\mathrm{train}, \psi)$ for different $\psi$. It follows that:
If we learn $\theta$ by maximum likelihood, then the posterior of $\theta$ reduces to a point estimate at $\hat\theta(\phi, \psi) = \arg\max_\theta p(D_\mathrm{train} \mid \theta, \phi, \psi)$:
However, we can't see where the process of "optimizing away" takes place.
On the other hand, if we rephrase our goal as "evaluating $p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)$ for different $\psi$", meaning to see the effect of the scientific hyperparameters on the validation error over a secondary validation set after tuning nuisance hyperparameters on the primary validation set:
Here, we clearly witness the optimizing away effect on $\phi$, if we resort again to a point estimate of $\phi$ as $\hat\phi(\psi) = \arg\max_\phi p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$. For each $\psi$, we find the best $\hat\phi(\psi)$ over the validation set. And to evaluate the effect of the scientific hyperparameters $\psi$:
I'm reading the subsection identify scientific, nuisance, and fixed hyperparameters, and have the following question: do we need a secondary validation set in order to evaluate the scientific hyperparameters after optimizing away the nuisance hyperparameters on the (primary) validation set?
Denote the training set by$D_\mathrm{train}$ , the validation set by $D_\mathrm{val}$ , the secondary validation set by $D_\mathrm{val2}$ , the learnable parameters by $\theta$ , the nuisance hyperparameters by $\phi$ , the scientific hyperparameter by $\psi$ . Assume there's no conditional hyperparameters for simplicity.
From Bayesian point of view, we may rewrite the goal as evaluating$p(D_\mathrm{val} \mid D_\mathrm{train}, \psi)$ for different $\psi$ . It follows that:
If we learn$\theta$ by maximum likelihood, then the posterior of $\theta$ reduces to a point estimate at $\hat\theta(\phi, \psi) = \arg\max_\theta p(D_\mathrm{train} \mid \theta, \phi, \psi)$ :
However, we can't see where the process of "optimizing away" takes place.
On the other hand, if we rephrase our goal as "evaluating$p(D_\mathrm{val2} \mid D_\mathrm{train}, D_\mathrm{val}, \psi)$ for different $\psi$ ", meaning to see the effect of the scientific hyperparameters on the validation error over a secondary validation set after tuning nuisance hyperparameters on the primary validation set:
where the posterior$p(\phi \mid D_\mathrm{train}, D_\mathrm{val}, \psi) \propto p(\phi, D_\mathrm{val} \mid D_\mathrm{train}, \psi) = \sum_\theta p(\phi) p(\theta \mid D_\mathrm{train}, \phi, \psi) p(D_\mathrm{val} \mid \theta, \phi, \psi) = p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$ assuming uninformative prior $p(\phi) \propto 1$ .
Here, we clearly witness the optimizing away effect on$\phi$ , if we resort again to a point estimate of $\phi$ as $\hat\phi(\psi) = \arg\max_\phi p(D_\mathrm{val} \mid \hat\theta(\phi, \psi), \phi, \psi)$ . For each $\psi$ , we find the best $\hat\phi(\psi)$ over the validation set. And to evaluate the effect of the scientific hyperparameters $\psi$ :
Am I missing something? Thank you so much!
The text was updated successfully, but these errors were encountered: