Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Way to model heteroskedastic noise? #982

Open
ArnoVel opened this issue Dec 2, 2019 · 10 comments
Open

[Docs] Way to model heteroskedastic noise? #982

ArnoVel opened this issue Dec 2, 2019 · 10 comments

Comments

@ArnoVel
Copy link

ArnoVel commented Dec 2, 2019

📚 Documentation/Examples

Hi,
I am fairly new to gpytorch and have very basic knowledge of GPs in general.
I found this paper which uses latent variables (with a gaussian prior) as additional variables to model heteroskedastic relationships.
What would be the easiest way to implement this using gpytorch?

Is there documentation missing?

Perhaps some documentation might already exists on this issue. If so, please direct me to the relevant pages!

Thanks,
A V

@ArnoVel
Copy link
Author

ArnoVel commented Dec 2, 2019

If I were to do it my way, I would build a kernel in which the latent variables are extra parameters to optimise over, but I wouldn't know how to put priors on them, neither how to integrate over them as in the paper.

@Balandat
Copy link
Collaborator

Balandat commented Dec 2, 2019

Do you have observations for the observation noise? If so, take a look at the HeteroskedasticSingleTaskGP that is implemented in BoTorch. That uses a nested GP model to model the log-variance of the observation noise.

If not, we also have pytorch/botorch#250 that uses a "most likely heteroskedastic GP" to infer the noise. This will still need to be cleaned up some though.

@ArnoVel
Copy link
Author

ArnoVel commented Dec 2, 2019

Wouldn't you say that 2 layer GP is sufficient?
g(f(X)+e1) +e2 = Y
allows me to fit both a noise distribution and a mapping (fix X=x, change g so that the distribution of g(f(x)+e1)+e2 yields better fit?).
My goal is to have a flexible way to estimate a relationship which doesn't only depend on X through additive noise.
Edit: Most papers related to deep GPs keep gaussian likelihood only on the final layer's outputs, therefore assuming Y = f(g(X)) + E.

@ArnoVel
Copy link
Author

ArnoVel commented Dec 5, 2019

Hi,

I've tried to run the example you gave, and the following issue appeared:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-55-029c43a7c5c1> in fit_most_likely_HeteroskedasticGP(train_X, train_Y, covar_module, num_var_samples, max_iter, atol_mean, atol_var)
     65         try:
---> 66             botorch.fit.fit_gpytorch_model(hetero_mll)
     67         except Exception as e:
.
.
.

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I do believe it comes from this call

hetero_model = HeteroskedasticSingleTaskGP(train_X=train_X, train_Y=train_Y,
                                                   train_Yvar=observed_var)

with observed variances obtained as the result of a GPytorch homoscedastic GP fit

botorch.fit.fit_gpytorch_model(homo_mll)

    # get estimates of noise 
    homo_mll.eval()
    with torch.no_grad():
        homo_posterior = homo_mll.model.posterior(train_X.clone())
        homo_predictive_posterior = homo_mll.model.posterior(train_X.clone(),
                                                             observation_noise=True)
    sampler = IIDNormalSampler(num_samples=num_var_samples, resample=True)
    predictive_samples = sampler(homo_predictive_posterior)
    observed_var = 0.5 * ((predictive_samples - train_Y.reshape(-1,1))**2).mean(dim=0)

I do not know what happens inside of the sampler method, but it appears that the initialisation variance is still connected to the basic GP graph, and therefore to fit the heteroskedastic gp
the code computes the gradient of the heterodastic noise likelihood w.r.t the parameters of the homoscedastic GP (among other parameters).

However, it does not seem that the original paper suggests one should change the target GP parameters to fit the likelihood of the noise ( r(x) ) model (it is written the two GP should be independent).
While the combined model uses the most likely noise to then predict the targets, and fit the GP f(x)=t under the new noise, it seems your implementation attempts to fit the GP twice.

I am referring to the following paragraphs:

To summarize the procedure, we take the current noise model and complete the data, i.e., make the noise levels observed.
We then fix the completed data cases and use them to compute the maximum likelihood parameters of G3

And for what G3 and its likelihood are:

  1. Given the input data D={(xi, ti)}, we estimate a standard, homoskedastic GP G1 maximizing the likelihood for predicting t from x.
  2. Given G1, we estimate the empirical noise levels for the training data, i.e., z′i=log (var[ti, G1(xi,D)]), forming a new data set D′={(x1, z′1),(x2, z′2), . . . ,(xn, z′n)}.
  3. On D′, we estimate a second GP G2.
  4. Now we estimate the combined GP G3 on D using G2 to predict the (logarithmic) noise levels ri.
  5. If not converged, we set G1=G3 and go to step 2

This makes clear that at each step, the noise levels of the previous GP should be fixed, and not change as to fit G2.
After setting

hetero_model = HeteroskedasticSingleTaskGP(train_X=train_X, train_Y=train_Y,
                                                   train_Yvar=observed_var.detach())

I no longer have the aforementioned issue.
However, I do agree that it is not clear whether step 4 should involve a joint optimization on the parameters of both G1 and G2 or not, but as it is inspired from EM, it probably shouldn't.

Essentially, I would like to know

  • whether HeteroskedasticSingleTaskGP models P( t | x, z_tr , x_tr) with z_tr fixed, doing so by fitting some GP P(z|x) and some GP P(t| x,z), or whether it trains noise and target mappings in a different way/order.
  • In particular, if it assumes z_tr fixed an fits both the noise model and the target model, one does not need to keep the observed_var attached to the graph to optimize over the noise model...
  • whether each call to HeteroskedasticSingleTaskGP resets the parameters of each of the GPs, or just resumes the optimization using the noise levels obtained from evaluating the targets are the last step.

Any links, explanation or hint would help me greatly !

@ArnoVel
Copy link
Author

ArnoVel commented Dec 6, 2019

Hi again,
I ran the code without knowing whether it was correct and obtained a very similar fit for 20,30,40,50 or 60 runs on my data:
test_htr

Compared to a standard homoskedastic GP fit exactly like the one you use to initialize the MLHGP method:
test_homo

Would you say the predictions are as expected ?
What seems different is the variance at X values where the points are few but concentrated around the mean. In this case the heteroskedastic is much less certain about the correct value of Y, but that might be due to the underlying priors rather than the type of noise?

@ArnoVel
Copy link
Author

ArnoVel commented Dec 6, 2019

For reference: the collab link showing how to use a simpler model uses the following snippet to observe the variance:

with torch.no_grad():
        # watch broadcasting here
        observed_var = torch.tensor(
                           np.power(mll.model.posterior(X_train).mean.numpy().reshape(-1,) - y_train.numpy(), 2), 
                           dtype=torch.float
        )

Detaching variance seems the correct way.

@ArnoVel
Copy link
Author

ArnoVel commented Dec 6, 2019

Another Related question: If I wanted to save the previous heteroskedastic model and resume training instead of fitting a new one, would there be a principled/generic way of doing so?
I would somehow extract the parameters of the previous model, and give them to the new one...

@Balandat
Copy link
Collaborator

Balandat commented Dec 7, 2019

Hi @ArnoVel , sorry for the delay here. Let me see if I can check off your questions.

Generally, I should say that #250 is quite old and does not make use of a number of changes that have happened in gpytorch since. We need top clean this up and get it up to date.

Wouldn't you say that 2 layer GP is sufficient? g(f(X)+e1) +e2 = Y

That would be a reasonable way to model things as well, depending on the application. One way to do something like this wold be to use a multi-task GP (e.g. w. an ICM kernel) to model both the output and the noise level. I'm not sure how to go about sticking that modeled noise level into the kernel for computing the posterior though.

Detaching the inferred observed variance

Yes, you'd want to do observed_var.detach() if you compute observed_var with constructing the graph. Instead you can wrap the whole block (including the sampling and computing observed_var in the no_grad() context). This could also be caused from the train_Y in the observed_var. Either way, as mentioned above there were some changes in gpytorch regarding auto-detaching test caches after back-propagating, which will likely require some slight changes to the implementation.

This makes clear that at each step, the noise levels of the previous GP should be fixed, and not change as to fit G2.

Correct, this should be a classing EM-style fitting procedure. If it's not then there is likely something wrong going on.

Would you say the predictions are as expected ?

From the plot, the noise level of the heteroskedastic version seems too overfit the noise variance (i.e. ends up with very short lengthscales.). Not quite sure which version you're running this on, but in the initial version we may not have properly accounted for the log-likelihood of the noise model itself. I put the infra in place for fixing this in #870. Note that this requires to add the NoiseModelAddedLossTerm to the model to work properly.

If I wanted to save the previous heteroskedastic model and resume training instead of fitting a new one, would there be a principled/generic way of doing so?

Yes you should be able to do the standard pytorch thing of calling state_dict() on the model to get the current param values, and then call load_state_dict() on the new model. if you then fit this basically amounts to warm-starting the model from the old values.

whether HeteroskedasticSingleTaskGP models P( t | x, z_tr , x_tr) with z_tr fixed, doing so by fitting some GP P(z|x) and some GP P(t| x,z)

Yes that's how it works z_tr is assumed fixed during fitting of HeteroskedasticSingleTaskGP. It fits these two GPs jointly (via the effect of the fitted noise on the output model's likelihood as well as the NoiseModelAddedLossTerm mentioned above).

whether each call to HeteroskedasticSingleTaskGP resets the parameters of each of the GPs, or just resumes the optimization using the noise levels obtained from evaluating the targets are the last step.

Depends on how you call it. If you use the constructor HeteroskedasticSingleTaskGP(train_x, ...) then it resets the parameters (though you can load the state dict of the previous model, see above). Or you can use the same model but update the training data using the set_train_data functionality.

I hope this clarifies some things.

@ArnoVel
Copy link
Author

ArnoVel commented Dec 8, 2019

Hi @Balandat thanks for this detailed answer!

About "warm starting", I mentioned it because the paper suggests to "set G1=G3", which might imply that the current parameters of G3 should be kept as initial values for the next loop..

About NoiseModelAddedLossTerm I'll look into it, but I'm not exactly sure how to tinker with the HeteroskedasticGP that's currently available.
Because the related paper essentially (if I understood correctly) only wants the "target process" fitting to depend on the most likely noise (the mode of the noise GP once fit), and so if we follow the paper carefully I'm not sure we should include anything that takes the likelihood of the noise and the targets jointly...

Maybe the overfitting comes from using the most likely noise, and not sampling from the noise posterior?

Anyway, I will be trying to implement different versions of this method in the future, so any help would be greatly appreciated 😀

@julioasotodv
Copy link

Hi @ArnoVel, you may find useful this answer in another issue: #1158 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants