-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review #1 #9
Comments
We want to thank the reviewer for reviewing the article in such depth and providing actionable improvements. We will address the specific comments below.
Our article earlier had a modeling error that was pointed out in the review by Jasper Snoek. Moving forward with the suggestion in the review, the updated article’s BO framework is able to perform much better. We have further expanded the section where we introduce the concept of acquisition functions where we try to give reader the core ideas behind acquisition functions before introducing them.
We were not able to understand the issue that the above comment talks about, and would like to clarify the point raised.
We would like to thank the reviewer for noticing this. Yes, we want to convey why it is more beneficial to optimize the acquisition function instead of the earlier question. We have made the recommended corrections to the title of the question in the article. FROM: TO:
Upon having a similar discussion with one of our reviewers, we introduced a slide deck at the end of the discussion where we summarise BO in a few slides. We have further moved the interactive plot the comment talks about below the discussion where we introduce PI. FROM: TO:
We have updated the article to no longer have buggy labels.
We have updated the plots with Distill's template.
We have reduced the number of animations and added a slider to each of these animations for better control.
We have re-framed the introduction in the newer version of the article. FROM: In this acquisition function, t + 1th query point, xt+1, is selected according to the equation below. TO:
We have added the explanation regarding our claim "It has a low overhead of setting up", in the newer version of the article. FROM: TO:
Our GP surrogate was not able to model the ground truth effectively due to the reason pointed out in the review by Jasper Snoek. Upon updating the article with his modeling suggestions we no longer have the above issue and using Thompson Acquisition function we are able to get to the global maxima with much ease.
We have made the suggested change in the updated article.
We no longer have changing colormaps for each iteration.
As mentioned above, we have reduced the number of animations and added a slider to each of these animations for better control.
We thank you for your time to notice these minute details. We have updated the legends accordingly.
We have modified the description which now directly gets to the main point we want to highlight. Below are the exact changes made. FROM: Older problem - Earlier in the active learning problem, our motivation for drilling at locations was to predict the distribution of the gold content over all the locations in the one-dimensional line. We, therefore, had chosen the next location to drill where we had maximum uncertainty about our estimate. In this problem, we are instead interested to know the location at which we find the maximum gold. To get the location of maximum gold content, we might want to drill at the location where predicted mean is the highest, i.e. to exploit. But unfortunately our mean is not always accurate, so we need to correct our mean which can be done by reducing variance or exploration. Bayesian Optimization looks at both exploitation and exploration, whereas in the case of Active Learning Problem, we only cared about exploration. TO: Given the fact that we are only interested in knowing the location where the maximum occurs, it might be a good idea to evaluate at locations where our surrogate model's predicted mean is the highest, i.e. to exploit. But unfortunately, our model mean is not always accurate (since we have limited observations), so we need to correct our model, which can be done by reducing variance or exploration. BO looks at both exploitation and exploration, whereas in the case of active learning, we only cared about exploration.
We no longer mention that acquisition functions are a function of mean and variance as done in the earlier article. That description limits the acquisition functional space and forces any acquisition function to be of the form g(mean(x), uncertainty(x)) which isn’t entirely true. We now have a discussion where we point out that acquisition functions are essentially a sequence of inexpensive optimizations focusing on three core ideas: i) they are a function of the surrogate posterior; ii) they combine exploration and exploitation, and iii) they are inexpensive to evaluate.
Based on the feedback received from one of the other reviews, we condensed some sections and the above issues are no longer present in the article.
We would like to give many thanks for the above comment regarding the terminology. We looked up various sources and understood that values for CDF of a Gaussian distribution are calculated making use of pre-calculated values of the error function, which are themselves calculated by Taylor’s expansion. Taylor’s expansion (which is a sum of infinite terms) always converges for the error function. Since, convergent infinite sums are called analytical expressions, we have updated the article to reflect this change.
Yes, we did mean to refer "posterior mean" instead of the phrase used above. We have updated the article based on this suggestion.
Thanks a lot for pointing this out. We certainly seemed to have used this qualifier too many times. We have removed this qualifier from almost all the cases where we thought it wasn't necessary. We would like to thank the reviewer for the detailed and actionable comments above. The article improved significantly moving forward with the suggestions. |
The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to Austin Huang for taking the time to review this article.
General Comments
Missing Tools for Reasoning
Acquisition functions are introduced from a definitional standpoint and their behavior is illustrated for a relatively artificial example. Sometimes the methods are shown to work, sometimes they don't. How does one think about implementation alternatives when working on a new problem? The article provides few conceptual tools for the reader to apply these methods successfully.
There's also serious issues with model misspecification underneath the surface of these implementations (see for example, Thompson Sampling discussion). However, the article doesn't even raise the topic - the discussion starts from a fixed model specification and anecdotally shows methods either working or not under a narrow example.
Relatedly, there's a section entitled ""Why is it easier to optimize the acquisition function?"" This framing may be misleading since ""easiness"" isn't the goal. The real question seems to be ""Why is it beneficial to optimize the acquisition function?"" or perhaps ""is it even beneficial to optimize with respect to an acquisition function""?
Does the Hero Plot Illustrate a Cental Aspect of the Discussion?
An interactive visualization communicates a response function to the variables that can be affected by input. In the hero plot, this corresponds to the response of the activation function as a function of the epsilon hyperparameter in a PI acquisition function for fixed data and ground truth. It also shows the CDF for two slices of X (1.0 and 5.0) which are intermediate computations used by the activation function.
Is that particular relationship sufficiently central to the article to be front and center? There are other relationships that seem more central to the topic that could have been highlighted (how choice of acquisition functions compare, how the activation function changes with data). The plot is nice to interact with for thinking about exploration/exploitation in PI, but it doesn't seem to be an obvious choice as the hero plot.
Minor visual issue - the vertical labels look buggy, with 0.00e+0 cutting through the axis line.
Grey backgrounds don't fit Distill's Template
The patch of grey rectangle background for each figure doesn't fit the aesthetic of the distill template. The convention in other articles seems to be white-on-white with no boundary or occasionally a horizontal ribbon that runs the width of the page for visualizations with lots of margin content.
Animations are Overused
Note in other distill articles, animations are used sparingly, and usually just at the top figure or concluding figure.
Looping animations were overused and ultimately not a good way to illustrate a dependency relationship compared to a visual with a control.
Even if the content in those figures is kept as is with a slider http://worrydream.com/LadderOfAbstraction/, this would be an improvement by not being distracting and allowing the reader to examine relationships between iterations more carefully.
Introduction to EI is Confusing
Perhaps the framing using the unknown ground truth was the original motivation but here it just makes the reasoning convoluted without adding much insight. Don't see any reason not to just jump to the definition as described by the name - expected improvement (i.e. the 2nd equation).
Thompson Sampling
""It has a low overhead of setting up."" - not sure why this is specifically pointed out in the case of TS, is overhead any lower to set up than the other acquisition functions?
The statement that ""This will ensure an exploratory behaviour."" is contradicted by the animation demonstration that follows. From that demo's figures, it would actually seem nearly impossible to reach the global minimma without refining the underlying GP model - there's not enough noise in the function distribution to adequately explore. However the example is simply left without further comment.
Hyperparameter Tuning - Axis Labels
Using the horizontal label ""# of Hyper-Parameters Tested"" is a confusing label description since it doesn't really refer to the # of hyper-parameters tested, but rather the # of values that have been evaluated.
Hyperparameter Tuning - Changing colormap scale makes it impossible to track the function evolution
The colormaps should probably not rescale with each iteration - it makes it very difficult to track the evolution of the acquisition function between frames.
As mentioned above, replacing all or most animations with a slider control would also improve the legibility of the figure.
Legend tweaks
"# Minor Writing Improvements
Concluding Comments
Bayesian optimization and active learning aren't particularly popular to write about currently. I also suspect there's quite a bit of interest in the topic, particularly in industry and applied machine learning contexts.
Given that, this article does contribute to a notable gap in the research distillation space. However, I think more work needs to be put into this manuscript to raise the quality of communication to be comparable to other distill articles."
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanation of existing results
The text was updated successfully, but these errors were encountered: