-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binomial distribution #66
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, just a few comments!
src/gen/distribution/java_util.clj
Outdated
(defrecord Binomial [^SplittableRandom rnd n p] | ||
d/Sample | ||
(sample [_] | ||
(.nextBinomial rnd n p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no way, awesome! Is this a thing? I don't see it in the javadocs... https://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sike! i just yanked out the binomial stuff from java-util. i'll follow up with a separate PR to get samples going. then we get benchmark it against kixi and commons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i also added binomial-gf-tests
that generatively samples from a distribution, which would have caught .nextBinomial
)
(>= v 0) | ||
(>= n v) | ||
(<= 0 p 1)]} | ||
(cond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might use case
here to make this more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice, done.
{:pre [(>= x 0)]} | ||
(log-gamma-fn (inc x))) | ||
|
||
(defn log-bico |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this private, or add ^:no-doc
to keep it out of the public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually make this a local fn with letfn
instead of making it its own thing. All of these preconditions are going to run every time, if you put it inside binomial
you can skip running those since you're calling it in a place where you know you pass the preconditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally. i made log-fact
and log-bico
local with letfn
.
(is 0 (dist/logpdf (->binomial 10 1) 10))) | ||
|
||
(testing "when p = 1 and v < n, probability is 0, log(0) = -Inf" | ||
(is ##-Inf(dist/logpdf (->binomial 10 0) 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, missing a space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. should bb lint
normally catch that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's still there, I'm not sure, maybe it should?
test/gen/distribution_test.cljc
Outdated
;; properties... | ||
(testing "sum of probabilities equals 1" | ||
(with-comparator (within 1e-9) | ||
(let [n 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about making this a generative test? See the checking
style from other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea. i changed a bunch of tests over to generative. done.
test/gen/distribution_test.cljc
Outdated
(is (dist/logpdf (->binomial n p) v) | ||
(dist/logpdf (->binomial n p) (- n v))))) | ||
|
||
(testing "mean and variance consistency where mu = n * p and variance = mu(1 - p)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you say more, even if just in comments below, about what you are testing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i added comments with more context. the basic idea is to spot check a couple of well known properties of the distribution. if that seems weird though i can leave them out for sure.
test/gen/distribution_test.cljc
Outdated
(with-comparator (within 1e-9) | ||
(is (ish? -7.133546882067904 (dist/logpdf (->binomial 1000000 0.5) 500000))) | ||
|
||
;; TODO: failing test (off by 1.9e-9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete these if unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. i dialed down the precision threshold from 1e-9
to 1e-6
and all of these tests are now passing. for spot checking against other libraries, that seems like a peaceful threshold but feel free to push back!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #66 +/- ##
==========================================
+ Coverage 70.38% 70.78% +0.39%
==========================================
Files 16 16
Lines 1084 1102 +18
Branches 23 24 +1
==========================================
+ Hits 763 780 +17
Misses 298 298
- Partials 23 24 +1 ☔ View full report in Codecov by Sentry. |
@sritchie thanks for the fast feedback! just pushed updates and responded to comments. i took a look at that score function in webppl to see how it compared to the lanczos approximation approach in https://github.com/probmods/webppl/blob/dev/src/dists/binomial.ad.js#L66-L107 i took a quick stab at porting to clojure so that i could run a comparison (this can almost certainly be optimized): (defn binomial
[n p v]
{:pre [(integer? n)
(integer? v)
(>= v 0)
(>= n v)
(<= 0 p 1)]}
(letfn [(log-fact
[x]
(if (< x 2)
0
(reduce + (map #(Math/log %) (range 2 (inc x))))))
(log-bico
[n m o]
(let [x (reduce + (map #(Math/log %) (range (inc o) (inc n))))]
(- x (log-fact m))))]
(let [[m o] (if (< v (- n v)) [v (- n v)] [(- n v) v])]
(case p
0 (if (= v 0)
0.0 ;; log(1)
##-Inf) ;; log(0))
1 (if (= v n)
0.0 ;; log(1)
##-Inf) ;; log(0)
(+ (log-bico n m o)
(* v (Math/log p))
(* (- n v) (Math/log (- 1 p)))))))) this particular implementation turned out to be pretty slow with a bit of precision loss as here's my hypothesis. the runtime complexity of my original for the all to say, it seems reasonable to keep the original implementation for now. |
purely for my own learning, i worked out a slightly more efficient implementation that memoizes factorials and swaps map/reduce for loop/recur (to factor out intermediate collections). it's roughly 78% faster than that first version, but still a minute slower than the (def memoized-log-fact
(memoize
(fn [x]
(loop [i x
acc 0.0]
(if (< i 2)
acc
(recur (dec i) (+ acc (Math/log i))))))))
(defn log-bico
[n m o]
(let [numerator (loop [i (inc o)
acc 0.0]
(if (> i n)
acc
(recur (inc i) (+ acc (Math/log i)))))]
(- numerator (memoized-log-fact m))))
(defn binomial
[n p v]
{:pre [(integer? n)
(integer? v)
(>= v 0)
(>= n v)
(<= 0 p 1)]}
(let [[m o] (if (< v (- n v)) [v (- n v)] [(- n v) v])]
(case p
0 (if (= v 0) 0.0 ##-Inf)
1 (if (= v n) 0.0 ##-Inf)
(+ (log-bico n m o)
(* v (Math/log p))
(* (- n v) (Math/log (- 1 p))))))) |
@sritchie with |
parameterized by `n` (number of trials) and `p` (probability of success in | ||
each trial) at the value `v` (number of successes)." | ||
[n p v] | ||
{:pre [(integer? n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in ClojureScript, number?
checks if you have a goog.math/Integer
... is that right? I want to check on how we did this in Emmy, I feel like there is some other predicate we should use. I could be wrong, flying blind here before a meeting..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not sure, but it doesn't look like cljs checks for goog.math/Integer
:
in emmy i don't see an integer?
defined, but the value namespace does have predicates like real?
and number?
that do use goog.math/...
:
https://github.com/mentat-collective/emmy/blob/main/src/emmy/value.cljc
BUT if i understand the compatibility layer between clj and cljs, if we use integer? in clojure code and then compile it to cljs, the cljs version of integer? should be used when the code runs in a js environment. if that's the case, then it's plausible that this is the right predicate to use?
Thanks @eightysteele , this looks great!! |
thanks @sritchie! \o/ |
Purpose
This PR adds support for the binomial distribution.
Log-likelihood
The log PMF for the binomial distribution gives the log probability of observing$k$ successes out of $n$ trials:
For computing that binomial coefficient, I used the existing$n$ and $k$ it would have overflown on factorials.
log-gamma-fn
. Otherwise for large values ofTesting
In$n$ , $p$ , and $k$ and verified some different properties of the distribution that should definitely hold (sum of probabilities, symmetry when $p=0$ , mean and variance consistency). Both
distribution_test.clj
I tested boundary conditions onbb test:clj
andbb test:cljs
are passing for these.I also spot checked a range of values against SciPy and Gen.jl (within a precision threshold of
1e-9
of expected values). One of theses tests is currently failing because the difference between expected and actual is right at the threshold boundary. If this is problematic, let me know and I can dig in!Next steps
A spot check and some feedback would be great. It's entirely possible that I'm missing something here. Happy to make changes.