Variance inconsistency in HeteroskedasticSingleTaskGP #933

mklpr · 2021-09-06T08:58:22Z

hi,
in HeteroskedasticSingleTaskGP, where using different ways to compute posterior with noise, i got different results and can't explain or understand it myself, so seek for helps here.

i use four ways to compute posterior with noise,

model_heter.posterior(scan_x, observation_noise=True)
mll_heter.likelihood(model_heter.posterior(scan_x, observation_noise=False), scan_x)
model_heter.likelihood.noise_covar.noise_model.posterior(scan_x).mean to calculate noise variance and than add variance from model_heter.posterior(scan_x, observation_noise=False) to compute total posterior variance
model_heter.likelihood.noise_covar.noise_model(scan_x).mean.exp() to calculate noise variance and than add variance from model_heter.posterior(scan_x, observation_noise=False) to compute total posterior variance

method 1 and method 2 has the same results, but method 3 and method 4 different from all others, in my knowledge total posterior variance equals noise variance from noise_model plus variance from GP kernel, and verify it in SingleTaskGP, so what's wrong in HeteroskedasticSingleTaskGP? is it comes from the log transfrom and how mll_heter.likelihood(model_heter.posterior(scan_x, observation_noise=False), scan_x) process it internally? thanks.

test code

Refer to https://colab.research.google.com/drive/1dOUHQzl3aQ8hz6QUtwRrXlQBGqZadQgG#scrollTo=D0A4Cf0W_QkZ

import os
import torch
import matplotlib.pyplot as plt
import warnings
import numpy as np

plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.size'] = 14

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.double

# warnings.filterwarnings('ignore')

seed = 7433
torch.manual_seed(seed)
np.random.seed(seed)

W_x = np.random.uniform(0, np.pi, size=200)
W_x = np.sort(W_x)
W_y = np.random.normal(loc=(np.sin(2.5*W_x)*np.sin(1.5*W_x)),
                       scale=(0.01 + 0.25*(1-np.sin(2.5*W_x))**2),
                       size=200)

X_train = torch.tensor(W_x.reshape(-1,1), dtype=torch.double)
y_train = torch.tensor(W_y.reshape(-1, 1), dtype=torch.double)

from botorch.models import SingleTaskGP
from gpytorch.constraints import GreaterThan
from gpytorch.mlls import ExactMarginalLogLikelihood
from botorch import fit_gpytorch_model

model = SingleTaskGP(train_X=X_train, train_Y=y_train)
mll = ExactMarginalLogLikelihood(model.likelihood, model)
_ = fit_gpytorch_model(mll)

scan_x = torch.linspace(0, np.pi, 500, dtype=dtype).reshape(-1,1,1)

with torch.no_grad():
    scan_y = model.posterior(scan_x, observation_noise=False)
    plt.plot(scan_x.numpy().reshape(-1), scan_y.mean.reshape(-1))
    
    lower, upper = scan_y.mvn.confidence_region()
    plt.fill_between(scan_x.numpy().reshape(-1), lower.numpy().reshape(-1), upper.numpy().reshape(-1), alpha=0.2)
    
    scan_y_with_noise = model.posterior(scan_x, observation_noise=True)
    lower_with_noise, upper_with_noise = scan_y_with_noise.mvn.confidence_region()
    plt.fill_between(scan_x.numpy().reshape(-1), lower_with_noise.numpy().reshape(-1), upper_with_noise.numpy().reshape(-1), alpha=0.2)
    
    plt.scatter(X_train, y_train)
    
    plt.legend(['posterior mean', 'posterior confidence', 'posterior confidence with noise', 'observed data'])

with torch.no_grad():
    observed_var = torch.pow(model.posterior(X_train).mean - y_train, 2)

from botorch.models import HeteroskedasticSingleTaskGP

model_heter = HeteroskedasticSingleTaskGP(train_X=X_train, train_Y=y_train,
                                    train_Yvar=observed_var)
mll_heter = ExactMarginalLogLikelihood(model_heter.likelihood, model_heter)
_ = fit_gpytorch_model(mll_heter)

mll_heter.eval()
model_heter.eval()
with torch.no_grad():
    plt.figure()
    scan_y = model_heter.posterior(scan_x, observation_noise=False)
    plt.plot(scan_x.numpy().reshape(-1), scan_y.mean.reshape(-1))
    
    lower, upper = scan_y.mvn.confidence_region()
    plt.fill_between(scan_x.numpy().reshape(-1), lower.numpy().reshape(-1), upper.numpy().reshape(-1), alpha=0.2)
    
    scan_y_with_noise = model_heter.posterior(scan_x, observation_noise=True)
    lower_with_noise, upper_with_noise = scan_y_with_noise.mvn.confidence_region()
    plt.fill_between(scan_x.numpy().reshape(-1), lower_with_noise.numpy().reshape(-1), upper_with_noise.numpy().reshape(-1), alpha=0.2)

    scan_y_with_noise2 = mll_heter.likelihood(scan_y.mvn, scan_x)
    lower_with_noise2, upper_with_noise2 = scan_y_with_noise2.confidence_region()
    plt.fill_between(scan_x.numpy().reshape(-1), lower_with_noise2.numpy().reshape(-1), upper_with_noise2.numpy().reshape(-1), alpha=0.2)

    noise_var = model_heter.likelihood.noise_covar.noise_model.posterior(scan_x).mean
    std_with_noise = (scan_y.variance.reshape(-1) + noise_var.reshape(-1)).sqrt()
    plt.fill_between(scan_x.numpy().reshape(-1), (scan_y.mean.reshape(-1) - 2 * std_with_noise.reshape(-1)).numpy(),
                     (scan_y.mean.reshape(-1) + 2 * std_with_noise.reshape(-1)).numpy(), alpha=0.2)

    noise_var2 = model_heter.likelihood.noise_covar.noise_model(scan_x).mean.exp()
    std_with_noise2 = (scan_y.variance.reshape(-1) + noise_var2.reshape(-1)).sqrt()
    plt.fill_between(scan_x.numpy().reshape(-1), (scan_y.mean.reshape(-1) - 2 * std_with_noise2.reshape(-1)).numpy(),
                     (scan_y.mean.reshape(-1) + 2 * std_with_noise2.reshape(-1)).numpy(), alpha=0.2)
    
    plt.scatter(X_train, y_train)
    plt.legend(['posterior mean', 'posterior confidence', 'posterior confidence with noise', 'posterior confidence with noise2',
                'posterior confidence with noise3', 'posterior confidence with noise4' , 'observed data'])

system info

botorch==0.5.0
gpytorch==1.5.0
torch==1.9.0

saitcakmak · 2021-09-07T15:22:00Z

Hi @mklpr. There's a known bug with the noise model of the HeteroscedasticSingleTaskGP, see #861. That may explain why you're running into this issue.

I haven't had time to look closely into methods you're trying, but if I am not mistaken, the issue in #861 is that the noise model is trained over log transformed input, which never gets untransformed. So, what you do with the 4th option may be the correct way around the bug. In addition to this, the bug may also lead to issues during the hyper-parameter training, so I'd not recommend using the packaged HeteroscedasticSingleTaskGP model right now. There are some fixes proposed in #861 that you could implement locally to get around the issue.

mklpr · 2021-09-08T16:38:35Z

hi @saitcakmak , thanks the helpful comment, let's ignore this known bug for the time now, there are some other questions,

ignore the overall model and how well the model fitted, just consider the fitted noise model inside the overall model, why noise_model.posterior(t_X, observation_noise=False).mean differs from noise_model(t_X).mean.exp()? in may sense the posterior method differs from model.forward only with added output untransform, where is the difference comes from?

# test code mainly from issue #861

import torch
from botorch import fit_gpytorch_model
from botorch.models.gp_regression import HeteroskedasticSingleTaskGP
from gpytorch import ExactMarginalLogLikelihood


torch.manual_seed(1)
t_X = torch.rand(10, 2)
t_Y_var = torch.ones(10, 1) * 10

model = HeteroskedasticSingleTaskGP(
    train_X=t_X,
    train_Y=torch.randn(10, 1),
    train_Yvar=t_Y_var,
)
mll = ExactMarginalLogLikelihood(model.likelihood, model)
fit_gpytorch_model(mll)

noise_model = model.likelihood.noise_covar.noise_model
noise_mean_predict1 = noise_model.posterior(t_X, observation_noise=False).mean
noise_mean_predict2 = noise_model(t_X).mean.exp()

print(noise_mean_predict1)
print(noise_mean_predict2)

# print output
tensor([[1.6492],
        [2.1872],
        [0.5681],
        [2.0095],
        [3.5275],
        [3.9483],
        [2.1735],
        [3.1278],
        [3.8957],
        [0.9924]], grad_fn=<ExpBackward>)
tensor([0.2114, 0.2991, 0.0394, 0.2630, 0.7618, 0.7673, 0.3464, 0.5275, 0.8492,
        0.0753], grad_fn=<ExpBackward>)

in paper http://people.csail.mit.edu/kersting/papers/kersting07icml_mlHetGP.pdf, it seems to fit the noise model and target model separately, and has a convergence based iterative process, in botorch it fit the two models jointly and only once, could you introduce some design considerations or practical experience? btw, i don't clearly know what's the step4 realy do in this paper, dose it means use the predict noise from G2 as train_Yvar to fit a FixedNoiseGP? very appreciate if you can give some tips to implement the paper's method. thanks.

since the problem comes from log transform, why should we model the log transform of noise insdead of directly model the noise itself, Intuitively, log transform can smooth big noise, but at the same time expand the noise near zero, what's the mainly benefits of log transform?

saitcakmak · 2021-09-10T17:23:31Z

Hi @mklpr.

why noise_model.posterior(t_X, observation_noise=False).mean differs from noise_model(t_X).mean.exp()

The reasoning being that mean.exp() is not the correct way to transform the posterior mean (I missed this at first as well). This is the difference between E[exp(posterior)] and exp(E[posterior]). Since exp() is not a linear operator, it is not interchangeable with the expectation. The Log() transform uses a helper method to get the correct untransformed posterior mean.

botorch/botorch/models/transforms/utils.py

Line 68 in be5435d

def norm_to_lognorm_mean(mu: Tensor, var: Tensor) -> Tensor:

in botorch it fit the two models jointly and only once, could you introduce some design considerations or practical experience?

From a theoretical perspective, the reason for fitting them jointly is that the solution you'll obtain by solving y* = max_y f(x', y) with some x' then solving x* = max_x f(x, y*) is generally worse than the solution obtained by max_{x, y} f(x, y) (f here would be the MLL). I haven't read the paper, but it seems that the paper tries to improve the first version by iterating this several times. BoTorch uses the second approach of maximizing the MLE over the two model parameters jointly, which can theoretically give you the best model fit. In practice, we only solve this problem to a local optimum, so that best model fit is dependent on the convexity. That's the theoretical motivation. From a computational perspective, I think it is more efficient there as well since you do not need to compute MLL separately for each optimization and iterate etc.

why should we model the log transform of noise instead of directly model the noise itself

I don't know much about why one is used vs the other. I think you have the right intuition (I have the same intuition), and eventually which one works better would depend on the particular problem instance. If you don't want to use the log transform (and avoid the bug around it), you can subclass the HeteroscedasticSingleTaskGP and remove the transform from the noise model. I think it may even work better if the noise is relatively smooth.

mklpr · 2021-09-15T10:29:26Z

hi @saitcakmak .
I'm try to remove log transform in noise model by modify HeteroskedasticSingleTaskGP but still have some problems, e.g. large negetive noise var posterior predict value even when the noise model's train target value is positive.

I tested in practical iterative fitting is sufficient to use, so implement an IterativeHeteroskedasticSingleTaskGP model for convience, refer to model source code, gaussian process regression demo and bayesian optimization demo , hope it can provide some help to who want to use heteroskedastic model.

Balandat · 2021-09-15T13:31:13Z

it seems to fit the noise model and target model separately, and has a convergence based iterative process, in botorch it fit the two models jointly and only once

The main difference is that HeteroskedasticSingleTaskGP takes variance observations, whereas the most likely heteroskeadastic GP from the paper does not and tries to fully infer this from the data. So the iterative process is necessary there, whereas for HeteroskedasticSingleTaskGP we can fit jointly.

why should we model the log transform of noise instead of directly model the noise itself

The main reason is there is no guarantee that a GP fit on non-negative data will produce non-negative predictions (as you seem to have found out yourself). Using a log transform is one straightforward way of dealing with this.

I tested in practical iterative fitting is sufficient to use, so implement an IterativeHeteroskedasticSingleTaskGP model for convience,

We had a PR for this a long time ago that never got wrapped up: #250. @jelena-markovic worked on updating this and we have an internal version of that. Sill needs some more work but we could probably put that out as an PR if that would be helpful.

saitcakmak · 2022-06-29T22:17:41Z

Closing this since the bug is being tracked in #861

saitcakmak closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variance inconsistency in HeteroskedasticSingleTaskGP #933

Variance inconsistency in HeteroskedasticSingleTaskGP #933

mklpr commented Sep 6, 2021

saitcakmak commented Sep 7, 2021

mklpr commented Sep 8, 2021 •

edited

Loading

saitcakmak commented Sep 10, 2021

mklpr commented Sep 15, 2021

Balandat commented Sep 15, 2021

saitcakmak commented Jun 29, 2022

Variance inconsistency in HeteroskedasticSingleTaskGP #933

Variance inconsistency in HeteroskedasticSingleTaskGP #933

Comments

mklpr commented Sep 6, 2021

test code

system info

saitcakmak commented Sep 7, 2021

mklpr commented Sep 8, 2021 • edited Loading

saitcakmak commented Sep 10, 2021

mklpr commented Sep 15, 2021

Balandat commented Sep 15, 2021

saitcakmak commented Jun 29, 2022

mklpr commented Sep 8, 2021 •

edited

Loading