From 6cef99940e28eec8244d0637c29b58f95bea2c44 Mon Sep 17 00:00:00 2001 From: Elizabeth Santorella Date: Sun, 11 Aug 2024 12:58:01 -0700 Subject: [PATCH] Speed up Warped GP tutorial by only running one replication and not comparing against other methods (#2462) Summary: Pull Request resolved: https://github.com/pytorch/botorch/pull/2462 Context: This tutorial has been taking too long to run. Also, a tutorial doesn't need to serve both as a demonstration tha the method works better than other methods (in a statistically significant way) and as a demonstration of how to use it. This PR: * Only ones one replication, rather than 3. (Putting a CI on 3 data points is a little silly anyway.) * Removes the comparision methods, Sobol and qNEI with a non-warped GP. * Uses qLogNEI instead of qNEI * Use SingleTaskGP instead of deprecated FixedNoiseGP * No longer manually specifies outcome transform (building on #2458) * Makes copy edits Differential Revision: D61054473 --- tutorials/bo_with_warped_gp.ipynb | 264 +++++++++--------------------- 1 file changed, 77 insertions(+), 187 deletions(-) diff --git a/tutorials/bo_with_warped_gp.ipynb b/tutorials/bo_with_warped_gp.ipynb index 75c4cc83fa..9d2bd2c971 100644 --- a/tutorials/bo_with_warped_gp.ipynb +++ b/tutorials/bo_with_warped_gp.ipynb @@ -6,9 +6,9 @@ "source": [ "## BO with Warped Gaussian Processes\n", "\n", - "In this tutorial, we illustrate how to use learned input warping functions for robust bayesian optimization when the outcome may be non-stationary functions. When the lenglescales are non-stationarity in the raw input space, learning a warping function that maps raw inputs to a warped space where the lengthscales are stationary can be useful because then standard stationary kernels can be used for to effectively model the function.\n", + "In this tutorial, we illustrate how to use learned input warping functions for robust Bayesian Optimization when the outcome may be non-stationary functions. When the lengthscales are non-stationarity in the raw input space, learning a warping function that maps raw inputs to a warped space where the lengthscales are stationary can be useful, because then standard stationary kernels can be used to effectively model the function.\n", "\n", - "In general, we recommend for a relatively simple setup (like this one) to use [Ax](https://ax.dev), since this will simplify your setup (including the amount of code you need to write) considerably. See the [Using BoTorch with Ax](./custom_botorch_model_in_ax) tutorial. To use input warping with `MODULAR_BOTORCH`, we can pass the `warp_tf`, constructed as below, by adding `input_transform=warp_tf` argument to the `Surrogate(...)` call. \n", + "In general, for a relatively simple setup (like this one), we recommend using [Ax](https://ax.dev), since this will simplify your setup (including the amount of code you need to write) considerably. See the [Using BoTorch with Ax](./custom_botorch_model_in_ax) tutorial. To use input warping with `MODULAR_BOTORCH`, we can pass the `warp_tf`, constructed as below, by adding `input_transform=warp_tf` argument to the `Surrogate(...)` call. \n", "\n", "We consider use a Kumaraswamy CDF as the class of input warping function and learn the concentration parameters ($a>0$ and $b>0$). Kumaraswamy CDFs are quite flexible and map inputs in [0, 1] to outputs in [0, 1]. This work follows the Beta CDF input warping proposed by Snoek et al., but replaces the Beta distribution Kumaraswamy distribution, which has a *differentiable* and closed-form CDF. \n", " \n", @@ -16,14 +16,14 @@ " \n", "This enables maximum likelihood (or maximum a posteriori) estimation of the CDF hyperparameters using gradient methods to maximize the likelihood (or posterior probability) jointly with the GP hyperparameters. (Snoek et al. use a fully Bayesian treatment of the CDF parameters). Each input dimension is transformed using a separate warping function.\n", "\n", - "We use the Noisy Expected Improvement (qNEI) acquisition function to optimize a synthetic Hartmann6 test function. The standard problem is\n", + "We use the Log Noisy Expected Improvement (qLogNEI) acquisition function to optimize a synthetic Hartmann6 test function. The standard problem is\n", "\n", "$$f(x) = -\\sum_{i=1}^4 \\alpha_i \\exp \\left( -\\sum_{j=1}^6 A_{ij} (x_j - P_{ij})^2 \\right)$$\n", "\n", "over $x \\in [0,1]^6$ (parameter values can be found in `botorch/test_functions/hartmann6.py`). For this demonstration,\n", "We first warp each input dimension through a different inverse Kumaraswamy CDF.\n", "\n", - "Since botorch assumes a maximization problem, we will attempt to maximize $-f(x)$ to achieve $\\max_{x} -f(x) = 3.32237$.\n", + "Since BoTorch assumes a maximization problem, we will attempt to maximize $-f(x)$ to achieve $\\max_{x} -f(x) = 3.32237$.\n", "\n", "[1] [J. Snoek, K. Swersky, R. S. Zemel, R. P. Adams. Input Warping for Bayesian Optimization of Non-Stationary Functions. Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1674-1682, 2014.](http://proceedings.mlr.press/v32/snoek14.pdf)" ] @@ -126,7 +126,7 @@ "\n", "The models are initialized with 14 points in $[0,1]^6$ drawn from a scrambled sobol sequence.\n", "\n", - "We add observe the objectives with additive Gaussian noise with a standard deviation of 0.05." + "We observe the objectives with additive Gaussian noise with a standard deviation of 0.05." ] }, { @@ -135,7 +135,7 @@ "metadata": {}, "outputs": [], "source": [ - "from botorch.models import FixedNoiseGP\n", + "from botorch.models import SingleTaskGP\n", "from gpytorch.mlls.sum_marginal_log_likelihood import ExactMarginalLogLikelihood\n", "from botorch.utils.sampling import draw_sobol_samples\n", "\n", @@ -145,16 +145,15 @@ "bounds = torch.tensor([[0.0] * 6, [1.0] * 6], device=device, dtype=dtype)\n", "\n", "\n", - "def generate_initial_data(n=14):\n", - " # generate training data\n", - " train_x = draw_sobol_samples(\n", - " bounds=bounds, n=n, q=1, seed=torch.randint(0, 10000, (1,)).item()\n", - " ).squeeze(1)\n", - " exact_obj = obj(train_x).unsqueeze(-1) # add output dimension\n", + "n = 14\n", + "# generate initial training data\n", + "train_x = draw_sobol_samples(\n", + " bounds=bounds, n=n, q=1, seed=torch.randint(0, 10000, (1,)).item()\n", + ").squeeze(1)\n", + "exact_obj = obj(train_x).unsqueeze(-1) # add output dimension\n", "\n", - " best_observed_value = exact_obj.max().item()\n", - " train_obj = exact_obj + NOISE_SE * torch.randn_like(exact_obj)\n", - " return train_x, train_obj, best_observed_value" + "best_observed_value = exact_obj.max().item()\n", + "train_obj = exact_obj + NOISE_SE * torch.randn_like(exact_obj)" ] }, { @@ -171,28 +170,24 @@ "metadata": {}, "outputs": [], "source": [ - "from botorch.utils.transforms import standardize\n", "from botorch.models.transforms.input import Warp\n", "from gpytorch.priors.torch_priors import LogNormalPrior\n", "\n", "\n", - "def initialize_model(train_x, train_obj, use_input_warping):\n", - " if use_input_warping:\n", - " # initialize input_warping transformation\n", - " warp_tf = Warp(\n", - " indices=list(range(train_x.shape[-1])),\n", - " # use a prior with median at 1.\n", - " # when a=1 and b=1, the Kumaraswamy CDF is the identity function\n", - " concentration1_prior=LogNormalPrior(0.0, 0.75**0.5),\n", - " concentration0_prior=LogNormalPrior(0.0, 0.75**0.5),\n", - " )\n", - " else:\n", - " warp_tf = None\n", + "def initialize_model(train_x, train_obj):\n", + " # initialize input_warping transformation\n", + " warp_tf = Warp(\n", + " indices=list(range(train_x.shape[-1])),\n", + " # use a prior with median at 1.\n", + " # when a=1 and b=1, the Kumaraswamy CDF is the identity function\n", + " concentration1_prior=LogNormalPrior(0.0, 0.75**0.5),\n", + " concentration0_prior=LogNormalPrior(0.0, 0.75**0.5),\n", + " )\n", " # define the model for objective\n", - " model = FixedNoiseGP(\n", - " train_x,\n", - " standardize(train_obj),\n", - " train_yvar.expand_as(train_obj),\n", + " model = SingleTaskGP(\n", + " train_X=train_x,\n", + " train_Y=train_obj,\n", + " train_Yvar=train_yvar.expand_as(train_obj),\n", " input_transform=warp_tf,\n", " ).to(train_x)\n", " mll = ExactMarginalLogLikelihood(model.likelihood, model)\n", @@ -235,17 +230,20 @@ " new_x = candidates.detach()\n", " exact_obj = obj(new_x).unsqueeze(-1) # add output dimension\n", " train_obj = exact_obj + NOISE_SE * torch.randn_like(exact_obj)\n", - " return new_x, train_obj\n", - "\n", + " return new_x, train_obj\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Perform Bayesian Optimization\n", + "The Bayesian optimization loop iterates the following steps:\n", + "1. given a surrogate model, choose a candidate point $x$\n", + "2. observe $f(x)$\n", + "3. update the surrogate model. \n", "\n", - "def update_random_observations(best_random):\n", - " \"\"\"Simulates a quasi-random policy by taking a the current list of best values observed randomly,\n", - " drawing a new random point, observing its value, and updating the list.\n", - " \"\"\"\n", - " rand_x = draw_sobol_samples(bounds=bounds, n=1, q=1).squeeze(1)\n", - " next_random_best = obj(rand_x).max().item()\n", - " best_random.append(max(best_random[-1], next_random_best))\n", - " return best_random" + "We do `N_BATCH=50` rounds of optimization." ] }, { @@ -257,137 +255,48 @@ "name": "stdout", "output_type": "stream", "text": [ - "\n", - "Trial 1 of 3 ..................................................\n", - "Trial 2 of 3 ..................................................\n", - "Trial 3 of 3 .................................................." + ".................................................." ] } ], "source": [ "from botorch import fit_gpytorch_mll\n", - "from botorch.acquisition.monte_carlo import qNoisyExpectedImprovement\n", + "from botorch.acquisition.logei import qLogNoisyExpectedImprovement\n", "from botorch.exceptions import BadInitialCandidatesWarning\n", "\n", - "import time\n", "import warnings\n", "\n", - "\n", "warnings.filterwarnings(\"ignore\", category=BadInitialCandidatesWarning)\n", "warnings.filterwarnings(\"ignore\", category=RuntimeWarning)\n", "\n", - "\n", - "N_TRIALS = 3 if not SMOKE_TEST else 2\n", "N_BATCH = 50 if not SMOKE_TEST else 5\n", "\n", - "verbose = False\n", - "\n", - "best_observed_all_ei, best_observed_all_warp, best_random_all = [], [], []\n", - "\n", "torch.manual_seed(0)\n", "\n", + "best_observed = [best_observed_value]\n", + "mll, model = initialize_model(train_x, train_obj)\n", "\n", - "# average over multiple trials\n", - "for trial in range(1, N_TRIALS + 1):\n", + "# run N_BATCH rounds of BayesOpt after the initial random batch\n", + "for iteration in range(1, N_BATCH + 1):\n", "\n", - " print(f\"\\nTrial {trial:>2} of {N_TRIALS} \", end=\"\")\n", - " best_observed_ei, best_observed_warp, best_random = [], [], []\n", + " # fit the models\n", + " fit_gpytorch_mll(mll)\n", + " ei = qLogNoisyExpectedImprovement(model=model, X_baseline=train_x)\n", "\n", - " # call helper functions to generate initial training data and initialize model\n", - " train_x_ei, train_obj_ei, best_observed_value_ei = generate_initial_data(n=14)\n", - " mll_ei, model_ei = initialize_model(\n", - " train_x_ei, train_obj_ei, use_input_warping=False\n", - " )\n", + " # optimize and get new observation\n", + " new_x, new_obj = optimize_acqf_and_get_observation(ei)\n", "\n", - " train_x_warp, train_obj_warp, = (\n", - " train_x_ei,\n", - " train_obj_ei,\n", - " )\n", - " best_observed_value_warp = best_observed_value_ei\n", - " # use input warping\n", - " mll_warp, model_warp = initialize_model(\n", - " train_x_warp, train_obj_warp, use_input_warping=True\n", - " )\n", + " # update training points\n", + " train_x = torch.cat([train_x, new_x])\n", + " train_obj = torch.cat([train_obj, new_obj])\n", "\n", - " best_observed_ei.append(best_observed_value_ei)\n", - " best_observed_warp.append(best_observed_value_warp)\n", - " best_random.append(best_observed_value_ei)\n", - "\n", - " # run N_BATCH rounds of BayesOpt after the initial random batch\n", - " for iteration in range(1, N_BATCH + 1):\n", - "\n", - " t0 = time.monotonic()\n", - "\n", - " # fit the models\n", - " fit_gpytorch_mll(mll_ei)\n", - " fit_gpytorch_mll(mll_warp)\n", - "\n", - " ei = qNoisyExpectedImprovement(\n", - " model=model_ei,\n", - " X_baseline=train_x_ei,\n", - " )\n", - "\n", - " ei_warp = qNoisyExpectedImprovement(\n", - " model=model_warp,\n", - " X_baseline=train_x_warp,\n", - " )\n", - "\n", - " # optimize and get new observation\n", - " new_x_ei, new_obj_ei = optimize_acqf_and_get_observation(ei)\n", - " new_x_warp, new_obj_warp = optimize_acqf_and_get_observation(ei_warp)\n", - "\n", - " # update training points\n", - " train_x_ei = torch.cat([train_x_ei, new_x_ei])\n", - " train_obj_ei = torch.cat([train_obj_ei, new_obj_ei])\n", - "\n", - " train_x_warp = torch.cat([train_x_warp, new_x_warp])\n", - " train_obj_warp = torch.cat([train_obj_warp, new_obj_warp])\n", - "\n", - " # update progress\n", - " best_random = update_random_observations(best_random)\n", - " best_value_ei = obj(train_x_ei).max().item()\n", - " best_value_warp = obj(train_x_warp).max().item()\n", - " best_observed_ei.append(best_value_ei)\n", - " best_observed_warp.append(best_value_warp)\n", - "\n", - " mll_ei, model_ei = initialize_model(\n", - " train_x_ei, train_obj_ei, use_input_warping=False\n", - " )\n", - " mll_warp, model_warp = initialize_model(\n", - " train_x_warp, train_obj_warp, use_input_warping=True\n", - " )\n", - "\n", - " t1 = time.monotonic()\n", - "\n", - " if verbose:\n", - " print(\n", - " f\"\\nBatch {iteration:>2}: best_value (random, ei, ei_warp) = \"\n", - " f\"({max(best_random):>4.2f}, {best_value_ei:>4.2f}, {best_value_warp:>4.2f}), \"\n", - " f\"time = {t1-t0:>4.2f}.\",\n", - " end=\"\",\n", - " )\n", - " else:\n", - " print(\".\", end=\"\")\n", - "\n", - " best_observed_all_ei.append(best_observed_ei)\n", - " best_observed_all_warp.append(best_observed_warp)\n", - " best_random_all.append(best_random)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Perform Bayesian Optimization\n", - "The Bayesian optimization \"loop\" simply iterates the following steps:\n", - "1. given a surrogate model, choose a candidate point\n", - "2. observe $f(x)$ for each $x$ in the batch \n", - "3. update the surrogate model. \n", + " # update progress\n", + " best_value = obj(train_x).max().item()\n", + " best_observed.append(best_value)\n", "\n", + " mll, model = initialize_model(train_x, train_obj)\n", "\n", - "Just for illustration purposes, we run three trials each of which do `N_BATCH=50` rounds of optimization.\n", - "\n", - "*Note*: Running this may take a little while." + " print(\".\", end=\"\")" ] }, { @@ -395,7 +304,7 @@ "metadata": {}, "source": [ "#### Plot the results\n", - "The plot below shows the log regret at each step of the optimization for each of the algorithms. The confidence intervals represent the variance at that step in the optimization across the trial runs. In order to get a better estimate of the average performance early on, one would have to run a much larger number of trials `N_TRIALS` (we avoid this here to limit the runtime of this tutorial). " + "The plot below shows the log regret at each step of the optimization for each of the algorithms." ] }, { @@ -409,15 +318,15 @@ "" ] }, - "execution_count": 8, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { - "image/png": "\n", + "image/png": "", "text/plain": [ - "
" + "
" ] }, "metadata": { @@ -433,49 +342,30 @@ "%matplotlib inline\n", "\n", "\n", - "def ci(y):\n", - " return 1.96 * y.std(axis=0) / np.sqrt(N_TRIALS)\n", - "\n", - "\n", "GLOBAL_MAXIMUM = neg_hartmann6.optimal_value\n", "\n", - "\n", "iters = np.arange(N_BATCH + 1)\n", - "y_ei = np.log10(GLOBAL_MAXIMUM - np.asarray(best_observed_all_ei))\n", - "y_ei_warp = np.log10(GLOBAL_MAXIMUM - np.asarray(best_observed_all_warp))\n", - "y_rnd = np.log10(GLOBAL_MAXIMUM - np.asarray(best_random_all))\n", + "y_ei = np.log10(GLOBAL_MAXIMUM - np.asarray(best_observed))\n", "\n", "fig, ax = plt.subplots(1, 1, figsize=(8, 6))\n", - "ax.errorbar(\n", - " iters,\n", - " y_rnd.mean(axis=0),\n", - " yerr=ci(y_rnd),\n", - " label=\"Sobol\",\n", - " linewidth=1.5,\n", - " capsize=3,\n", - " alpha=0.6,\n", - ")\n", - "ax.errorbar(\n", - " iters,\n", - " y_ei.mean(axis=0),\n", - " yerr=ci(y_ei),\n", - " label=\"NEI\",\n", - " linewidth=1.5,\n", - " capsize=3,\n", - " alpha=0.6,\n", - ")\n", - "ax.errorbar(\n", + "\n", + "ax.plot(\n", " iters,\n", - " y_ei_warp.mean(axis=0),\n", - " yerr=ci(y_ei_warp),\n", - " label=\"NEI + Input Warping\",\n", + " y_ei,\n", " linewidth=1.5,\n", - " capsize=3,\n", " alpha=0.6,\n", ")\n", - "ax.set(xlabel=\"number of observations (beyond initial points)\", ylabel=\"Log10 Regret\")\n", - "ax.legend(loc=\"lower left\")" + "\n", + "ax.set_xlabel(\"number of observations (beyond initial points)\")\n", + "ax.set_ylabel(\"Log10 Regret\")" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {