Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binomial distribution #66

Merged
merged 3 commits into from
Mar 12, 2024
Merged

Binomial distribution #66

merged 3 commits into from
Mar 12, 2024

Conversation

eightysteele
Copy link
Contributor

Purpose

This PR adds support for the binomial distribution.

Log-likelihood

The log PMF for the binomial distribution gives the log probability of observing $k$ successes out of $n$ trials:

$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$

For computing that binomial coefficient, I used the existing log-gamma-fn. Otherwise for large values of $n$ and $k$ it would have overflown on factorials.

Testing

In distribution_test.clj I tested boundary conditions on $n$, $p$, and $k$ and verified some different properties of the distribution that should definitely hold (sum of probabilities, symmetry when $p=0$, mean and variance consistency). Both bb test:clj and bb test:cljs are passing for these.

I also spot checked a range of values against SciPy and Gen.jl (within a precision threshold of 1e-9 of expected values). One of theses tests is currently failing because the difference between expected and actual is right at the threshold boundary. If this is problematic, let me know and I can dig in!

      ;; TODO: failing test (off by 1.9e-9)
      ;; expected: (ish? -3.222306954272568 (dist/logpdf (->binomial 1000000 0.0001) 100))
      ;; actual: (not (ish? -3.222306954272568 -3.2223069561241857))
      (is (ish? -3.222306954272568 (dist/logpdf (->binomial 1000000 0.0001) 100)))

Next steps

A spot check and some feedback would be great. It's entirely possible that I'm missing something here. Happy to make changes.

Copy link
Collaborator

@sritchie sritchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, just a few comments!

(defrecord Binomial [^SplittableRandom rnd n p]
d/Sample
(sample [_]
(.nextBinomial rnd n p))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no way, awesome! Is this a thing? I don't see it in the javadocs... https://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sike! i just yanked out the binomial stuff from java-util. i'll follow up with a separate PR to get samples going. then we get benchmark it against kixi and commons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i also added binomial-gf-tests that generatively samples from a distribution, which would have caught .nextBinomial)

(>= v 0)
(>= n v)
(<= 0 p 1)]}
(cond
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might use case here to make this more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, done.

{:pre [(>= x 0)]}
(log-gamma-fn (inc x)))

(defn log-bico
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this private, or add ^:no-doc to keep it out of the public API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually make this a local fn with letfn instead of making it its own thing. All of these preconditions are going to run every time, if you put it inside binomial you can skip running those since you're calling it in a place where you know you pass the preconditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally. i made log-fact and log-bico local with letfn.

(is 0 (dist/logpdf (->binomial 10 1) 10)))

(testing "when p = 1 and v < n, probability is 0, log(0) = -Inf"
(is ##-Inf(dist/logpdf (->binomial 10 0) 1)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, missing a space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. should bb lint normally catch that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's still there, I'm not sure, maybe it should?

;; properties...
(testing "sum of probabilities equals 1"
(with-comparator (within 1e-9)
(let [n 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about making this a generative test? See the checking style from other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea. i changed a bunch of tests over to generative. done.

(is (dist/logpdf (->binomial n p) v)
(dist/logpdf (->binomial n p) (- n v)))))

(testing "mean and variance consistency where mu = n * p and variance = mu(1 - p)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say more, even if just in comments below, about what you are testing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i added comments with more context. the basic idea is to spot check a couple of well known properties of the distribution. if that seems weird though i can leave them out for sure.

(with-comparator (within 1e-9)
(is (ish? -7.133546882067904 (dist/logpdf (->binomial 1000000 0.5) 500000)))

;; TODO: failing test (off by 1.9e-9)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete these if unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. i dialed down the precision threshold from 1e-9 to 1e-6 and all of these tests are now passing. for spot checking against other libraries, that seems like a peaceful threshold but feel free to push back!

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 96.66667% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 70.78%. Comparing base (106b026) to head (fe62fbc).

Files Patch % Lines
src/gen/distribution/math/log_likelihood.cljc 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #66      +/-   ##
==========================================
+ Coverage   70.38%   70.78%   +0.39%     
==========================================
  Files          16       16              
  Lines        1084     1102      +18     
  Branches       23       24       +1     
==========================================
+ Hits          763      780      +17     
  Misses        298      298              
- Partials       23       24       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eightysteele
Copy link
Contributor Author

@sritchie thanks for the fast feedback! just pushed updates and responded to comments.

i took a look at that score function in webppl to see how it compared to the lanczos approximation approach in log-gamma-fn:

https://github.com/probmods/webppl/blob/dev/src/dists/binomial.ad.js#L66-L107

i took a quick stab at porting to clojure so that i could run a comparison (this can almost certainly be optimized):

(defn binomial
  [n p v]
  {:pre [(integer? n)
         (integer? v)
         (>= v 0)
         (>= n v)
         (<= 0 p 1)]}
  (letfn [(log-fact
            [x]
            (if (< x 2)
              0
              (reduce + (map #(Math/log %) (range 2 (inc x))))))
          (log-bico
            [n m o]
            (let [x (reduce + (map #(Math/log %) (range (inc o) (inc n))))]
              (- x (log-fact m))))]
    (let [[m o] (if (< v (- n v)) [v (- n v)] [(- n v) v])]
      (case p
        0 (if (= v 0)
            0.0     ;; log(1)
            ##-Inf) ;; log(0))
        1 (if (= v n)
            0.0     ;; log(1)
            ##-Inf) ;; log(0)
        (+ (log-bico n m o)
           (* v (Math/log p))
           (* (- n v) (Math/log (- 1 p))))))))

this particular implementation turned out to be pretty slow with a bit of precision loss as $n$ and $v$ got large (7 minutes to run bb test:clj compared to 16 seconds with log-gamma-fn).

here's my hypothesis. the runtime complexity of my original binomial is basically $O(1)$, driven by the lanczos approximation in log-gamma-fn. that's because it involves a fixed number of coefficients and iterations that do not vary with the input. this is assuming constant time for arithmetic operations.

for the binomial test above, the complexity is $O(n)$ from the linear scaling of log summations within the range $n - o$ and the factorial computation $O(m)$. with the optimization to minimize computation range, the worst-case scenario scales linearly with $n$, which bumps the computational cost for larger inputs.

all to say, it seems reasonable to keep the original implementation for now.

@eightysteele
Copy link
Contributor Author

eightysteele commented Feb 29, 2024

purely for my own learning, i worked out a slightly more efficient implementation that memoizes factorials and swaps map/reduce for loop/recur (to factor out intermediate collections). it's roughly 78% faster than that first version, but still a minute slower than the log-gamma-fn version.

(def memoized-log-fact
  (memoize
   (fn [x]
     (loop [i x
            acc 0.0]
       (if (< i 2)
         acc
         (recur (dec i) (+ acc (Math/log i))))))))

(defn log-bico
  [n m o]
  (let [numerator (loop [i (inc o)
                         acc 0.0]
                    (if (> i n)
                      acc
                      (recur (inc i) (+ acc (Math/log i)))))]
    (- numerator (memoized-log-fact m))))

(defn binomial
  [n p v]
  {:pre [(integer? n)
         (integer? v)
         (>= v 0)
         (>= n v)
         (<= 0 p 1)]}
  (let [[m o] (if (< v (- n v)) [v (- n v)] [(- n v) v])]
    (case p
      0 (if (= v 0) 0.0 ##-Inf)
      1 (if (= v n) 0.0 ##-Inf)
      (+ (log-bico n m o)
         (* v (Math/log p))
         (* (- n v) (Math/log (- 1 p)))))))

@eightysteele
Copy link
Contributor Author

@sritchie with log-gamma-fn in terms of kixi, all tests are now passing (including binomial comparisons to scipy and gen.jl with a precision threshold dialed up to 1e-12). it is my belief that the requested changes here have been addressed, but lmk!

parameterized by `n` (number of trials) and `p` (probability of success in
each trial) at the value `v` (number of successes)."
[n p v]
{:pre [(integer? n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in ClojureScript, number? checks if you have a goog.math/Integer... is that right? I want to check on how we did this in Emmy, I feel like there is some other predicate we should use. I could be wrong, flying blind here before a meeting..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not sure, but it doesn't look like cljs checks for goog.math/Integer:

https://github.com/clojure/clojurescript/blob/acbefb9b1e79b659b639919fbf96cb3726719e25/src/main/cljs/cljs/core.cljs#L2328-L2334

in emmy i don't see an integer? defined, but the value namespace does have predicates like real? and number? that do use goog.math/...:

https://github.com/mentat-collective/emmy/blob/main/src/emmy/value.cljc

BUT if i understand the compatibility layer between clj and cljs, if we use integer? in clojure code and then compile it to cljs, the cljs version of integer? should be used when the code runs in a js environment. if that's the case, then it's plausible that this is the right predicate to use?

@sritchie
Copy link
Collaborator

Thanks @eightysteele , this looks great!!

@sritchie sritchie merged commit 3c03777 into ChiSym:main Mar 12, 2024
4 checks passed
@eightysteele
Copy link
Contributor Author

thanks @sritchie! \o/

@eightysteele eightysteele deleted the eighty-binomial branch April 11, 2024 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants