-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BQL function reference describes a normalization constant without info on how to compute it #76
Comments
Related: |
It's true that the current documentation is problematic. Probability density values are tricky to describe: the phrase quoted above from the BQL documentation is poorly worded and borderline incorrect. It tries to explain the concept of a probability density value in layman's terms, but ends up giving the (incorrect) impressions that: (i) the returned probability density values are not "normalized"; Regarding (i): Every returned probability density value in BQL is correctly normalized, in the sense that it is a value of a non-negative function (the pdf, which integrates to unity) assessed at some point in the support of the random variable. Regarding (ii): There is no such normalizing constant that needs to (or even could be) computed from the PDF values returned by
It's not possible in the general setting to map pdf values into [0,1] so that they are comparable across different distributions. One can compare probabilities of actual events such as tail probabilities or variables taking values in a small neighborhood. However BQL does not currently expose a cumulative density function (cdf) interface, so these would have to be estimated by simple Monte Carlo integration using samples from
To improve the documentation we should (i) remove mention of "normalizing constant", and (ii) note that PDF values can be thought of as "relative probabilities" in the sense that if
The design space is quite big here, especially for the CrossCat model class. Can discuss, but the scope will extend beyond this ticket. |
The BQL documentation ( http://probcomp.csail.mit.edu/dev/bayesdb/doc/bql.html ) states the following in the section for:
PROBABILITY DENSITY OF (<targets>) [GIVEN (<constraints>)]
:WARNING: The value this function returns is not a normalized probability in [0, 1], but rather a probability density with a normalization constant that is common to the column but may vary between columns. So it may take on values above 1.
Presumably, this also applies to
ESTIMATE PREDICTIVE PROBABILITY
, which also returns values greater than one.As a user, it is challenging to make use of the results of
ESTIMATE PREDICTIVE PROBABILITY
for the most typical use cases (e.g., ranking the most improbable data in a .csv file -- including across multiple columns) unless there is more clarity about the right way to compute the normalization constant in BQL. I understand in theory what a PDF is; however, it wasn't obvious to me what is the right way to compute the normalization constant using a series of BQL expressions or other code so that the probability densities could be compared across columns.I think this is a documentation issue that could affect many typical users. Some examples of solutions to this issue are:
PROBABILITY DENSITY
andPREDICTIVE PROBABILITY
.For the moment, some basic information in the form of a reference or explanation about the normalization constant would be very helpful.
The text was updated successfully, but these errors were encountered: