Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQL function reference describes a normalization constant without info on how to compute it #76

Open
versar opened this issue Apr 30, 2018 · 2 comments

Comments

@versar
Copy link

versar commented Apr 30, 2018

The BQL documentation ( http://probcomp.csail.mit.edu/dev/bayesdb/doc/bql.html ) states the following in the section for: PROBABILITY DENSITY OF (<targets>) [GIVEN (<constraints>)] :

WARNING: The value this function returns is not a normalized probability in [0, 1], but rather a probability density with a normalization constant that is common to the column but may vary between columns. So it may take on values above 1.

Presumably, this also applies to ESTIMATE PREDICTIVE PROBABILITY, which also returns values greater than one.

As a user, it is challenging to make use of the results of ESTIMATE PREDICTIVE PROBABILITY for the most typical use cases (e.g., ranking the most improbable data in a .csv file -- including across multiple columns) unless there is more clarity about the right way to compute the normalization constant in BQL. I understand in theory what a PDF is; however, it wasn't obvious to me what is the right way to compute the normalization constant using a series of BQL expressions or other code so that the probability densities could be compared across columns.

I think this is a documentation issue that could affect many typical users. Some examples of solutions to this issue are:

  1. A link, reference, or brief explanation in the documentation of how to compute the normalization constant that is currently mentioned in the documentation (probably easiest/fastest).
  2. An example showing how to compute the constant through a sequence of BQL expressions, and therefore how to compare probability densities for variables in different columns. This could go into one of the tutorial notebooks if not the BQL or bayeslite function references directly.
  3. A feature that returns the normalization constant and/or normalized versions of PROBABILITY DENSITY and PREDICTIVE PROBABILITY.

For the moment, some basic information in the form of a reference or explanation about the normalization constant would be very helpful.

@fsaad
Copy link
Collaborator

fsaad commented May 1, 2018

Related:
probcomp/bayeslite#114: show probabilities in decibels:
probcomp/bayeslite#360: PROBABILITY OF should yield log densities, not densities

@fsaad
Copy link
Collaborator

fsaad commented May 1, 2018

I think this is a documentation issue that could affect many typical users.

It's true that the current documentation is problematic. Probability density values are tricky to describe: the phrase quoted above from the BQL documentation is poorly worded and borderline incorrect. It tries to explain the concept of a probability density value in layman's terms, but ends up giving the (incorrect) impressions that:

(i) the returned probability density values are not "normalized";
(ii) there is a normalization constant that needs to be computed separately.

Regarding (i): Every returned probability density value in BQL is correctly normalized, in the sense that it is a value of a non-negative function (the pdf, which integrates to unity) assessed at some point in the support of the random variable.

Regarding (ii): There is no such normalizing constant that needs to (or even could be) computed from the PDF values returned by PROBABILITY DENSITY (or PREDICTIVE PROBABILITY, which macro-expands into the former) BQL estimators.

normalized versions of PROBABILITY DENSITY and PREDICTIVE PROBABILITY.

It's not possible in the general setting to map pdf values into [0,1] so that they are comparable across different distributions. One can compare probabilities of actual events such as tail probabilities or variables taking values in a small neighborhood. However BQL does not currently expose a cumulative density function (cdf) interface, so these would have to be estimated by simple Monte Carlo integration using samples from SIMULATE ... GIVEN.

For the moment, some basic information in the form of a reference or explanation about the normalization constant would be very helpful.

To improve the documentation we should (i) remove mention of "normalizing constant", and (ii) note that PDF values can be thought of as "relative probabilities" in the sense that if pdf(x=a)/pdf(x=b) = 2, then, heuristically, x is "twice as likely" to take on value a as it is to take on value b. (Note this explanation is still quite slippery, however, as the intuition breaks down under injective transforms of random variables.)

ranking the most improbable data in a .csv file -- including across multiple columns
how to compare probability densities for variables in different columns

The design space is quite big here, especially for the CrossCat model class. Can discuss, but the scope will extend beyond this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants