Feature/neg binomial dist #627

dawidkopczyk · 2023-04-03T20:43:57Z

This PR introduces Negative Binomial distribution partially answering request from #569.

Implementation details:

the formulas for deviance and llf largely follow h20 https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#negative-binomial and statsmodels with some details changed for y[i] == 0. The formula for llf can also be derived from https://en.wikipedia.org/wiki/Negative_binomial_distribution assuming r=1/theta.
distribution is parametrized by theta which defines the variance of negative binomial distribution v(mu) = mu + theta * mu^2
implemented cython functions for deviance, log-likelihood, log_eta_mu_deviance and log_rowwise_gradient_hessian

MarcAntoineSchmidtQC · 2023-04-04T13:32:12Z

Thank you for your contribution @dawidkopczyk. I'm reviewing the code. While doing that, I'm going to add a new benchmark problem so that we can confirm that we are converging to a similar solution to existing software (h2o, or statsmodels for unregularized models). I'll keep you posted.

MarcAntoineSchmidtQC

This is great! I've been able to add the negative binomial case to our benchmarking suite and compare it against h2o. We are converging to the same solution. In terms of performance, we are still ahead by a nice margin.

Happy to merge this very soon.

MarcAntoineSchmidtQC · 2023-04-05T17:43:49Z

src/glum/_distribution.py

+    upper_bound = np.Inf
+    include_upper_bound = False
+
+    def __init__(self, theta=1.0):


Is the default value of 1 sensible? h2o defaults to 1e-10, which is pretty different.

if we consider target y in negative binomial as a number of failures before 1 / theta -th success, we would have:

for theta = 1 case we model the number of failures before 1st success

for theta = 1e-10 case we model the number of failures before -th success, so generally it is a count of failures

On the other hand for theta = 1e-10 the variance is just mean (like in Poisson) whereas for theta = 1 the variance seems like binomial distribution.

I don't have a strong feeling about which default should be used, took 1.0 since 1e-10 seemed for me too close to disallowed 0.

Let's stick with 1 then. I saw that h2o uses 0.5 in their example.

MarcAntoineSchmidtQC · 2023-04-05T18:05:06Z

src/glum/_distribution.py

+    def theta(self, theta):
+
+        if not isinstance(theta, (int, float)):
+            raise TypeError(f"theta must be an int or float, input was {theta}")


I see that this is directly reused from the Tweedie distribution, so not a problem with this PR exactly. However, I think we should have a different approach for type-checking here. The problem is that int and float are too restrictive. The best demonstration of this is that this would not work:

dist = glum.NegativeBinomialDistribution() dist.theta = dist.theta # This fails.

The failure is due to the fact that we are returning a float32 and we are checking the input to be of type float (aka float64)

The simplest approach here would be to ask for forgiveness. Wrap line 1168 into a try except block and delegate the type-checking task to numpy.

Just let me know whether this should be solved in this PR and what is your suggested change

Let's solve this in a different PR.

Opened issue #631

MarcAntoineSchmidtQC · 2023-04-05T18:20:58Z

src/glum/_distribution.py

+        y, mu, sample_weight = _as_float_arrays(y, mu, sample_weight)
+        sample_weight = np.ones_like(y) if sample_weight is None else sample_weight
+
+        return negative_binomial_deviance(y, sample_weight, mu, theta=float(theta))


with the setter decorator on theta, we are sure we are always getting a float32. I don't think the conversion to a float is necessary here.

To make sure it works with the cython typing, the theta parameter can be set to be a C float (which is 32-bits), instead of floating.

Since p and dispersion arguments are passed as floating, would it be better to change that for all cases in a separate PR? I am not feeling comfortable to change it only for neg bin dist.

MarcAntoineSchmidtQC · 2023-04-05T18:30:36Z

src/glum/_distribution.py

+        )
+
+    @property
+    def lower_bound(self) -> float:


Just to avoid confusion, can you store lower_bound with upper_bound above as a simple class attribute? As far as I know, we won't need to do any computation here.

Same comment for include_lower_bound.

ok, good point

MarcAntoineSchmidtQC · 2023-04-05T21:59:42Z

src/glum/_distribution.py

+        )
+
+    def log_likelihood(self, y, mu, sample_weight=None, dispersion=1) -> float:
+        r"""Compute the log likelihood.


Can you add a link to a source for the actual formula you used in the negative_binomial_log_likelihood. The formula is different from what is listed on h2o, statsmodels, and Wikipedia (the sources you listed in the PR description.

I'm very confident that it is correct, but for future reference it would be easier not to make the derivation again.

ok, the same (although theta is called alpha) formula is presented in Eq (3) here https://content.wolfram.com/uploads/sites/19/2013/04/Zwilling.pdf so I will add as References in class.

MarcAntoineSchmidtQC · 2023-04-05T22:02:25Z

tests/glm/test_distribution.py

@@ -27,6 +28,7 @@
        (GammaDistribution(), 0),
        (InverseGaussianDistribution(), 0),
        (TweedieDistribution(power=1.5), 0),
+        (NegativeBinomialDistribution(theta=1.0), 0),


Ideally we should test a value other than 1. theta = 1 leads to a lot of simplifications in the formulas that may hide bugs.

ok, I will put 1.5 here

…feature/neg_binomial_dist

MarcAntoineSchmidtQC

Thanks a lot for your contribution @dawidkopczyk. This looks very good.

Dawid Kopczyk added 2 commits April 3, 2023 18:03

Allow 'negative.binomial' as family

ae5e84f

adds eta_mu_deviance

3ed86ad

dawidkopczyk requested review from tbenthompson, MarcAntoineSchmidtQC, xhochy, jtilly and lbittarello as code owners April 3, 2023 20:43

Merge branch 'main' into feature/neg_binomial_dist

7604a7a

MarcAntoineSchmidtQC requested changes Apr 5, 2023

View reviewed changes

Dawid Kopczyk added 2 commits April 7, 2023 13:37

after review fixes

e59a059

Merge remote-tracking branch 'origin/feature/neg_binomial_dist' into …

87cd7c0

…feature/neg_binomial_dist

dawidkopczyk requested a review from MarcAntoineSchmidtQC April 7, 2023 11:39

MarcAntoineSchmidtQC mentioned this pull request Apr 12, 2023

parameters do not have consistent types #631

Open

MarcAntoineSchmidtQC approved these changes Apr 12, 2023

View reviewed changes

MarcAntoineSchmidtQC merged commit 894669e into Quantco:main Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/neg binomial dist #627

Feature/neg binomial dist #627

dawidkopczyk commented Apr 3, 2023 •

edited

Loading

MarcAntoineSchmidtQC commented Apr 4, 2023

MarcAntoineSchmidtQC left a comment

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC Apr 12, 2023

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC Apr 12, 2023

MarcAntoineSchmidtQC Apr 12, 2023

MarcAntoineSchmidtQC Apr 5, 2023

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC Apr 5, 2023

dawidkopczyk Apr 7, 2023

MarcAntoineSchmidtQC left a comment

Feature/neg binomial dist #627

Feature/neg binomial dist #627

Conversation

dawidkopczyk commented Apr 3, 2023 • edited Loading

MarcAntoineSchmidtQC commented Apr 4, 2023

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

dawidkopczyk commented Apr 3, 2023 •

edited

Loading