-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/neg binomial dist #627
Feature/neg binomial dist #627
Conversation
Thank you for your contribution @dawidkopczyk. I'm reviewing the code. While doing that, I'm going to add a new benchmark problem so that we can confirm that we are converging to a similar solution to existing software (h2o, or statsmodels for unregularized models). I'll keep you posted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I've been able to add the negative binomial case to our benchmarking suite and compare it against h2o. We are converging to the same solution. In terms of performance, we are still ahead by a nice margin.
Happy to merge this very soon.
upper_bound = np.Inf | ||
include_upper_bound = False | ||
|
||
def __init__(self, theta=1.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default value of 1 sensible? h2o defaults to 1e-10, which is pretty different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we consider target y
in negative binomial as a number of failures before 1 / theta
-th success, we would have:
- for
theta = 1
case we model the number of failures before 1st success - for
theta = 1e-10
case we model the number of failures before -th success, so generally it is a count of failures
On the other hand for theta = 1e-10
the variance is just mean (like in Poisson) whereas for theta = 1
the variance seems like binomial distribution.
I don't have a strong feeling about which default should be used, took 1.0 since 1e-10 seemed for me too close to disallowed 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with 1 then. I saw that h2o uses 0.5 in their example.
def theta(self, theta): | ||
|
||
if not isinstance(theta, (int, float)): | ||
raise TypeError(f"theta must be an int or float, input was {theta}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is directly reused from the Tweedie distribution, so not a problem with this PR exactly. However, I think we should have a different approach for type-checking here. The problem is that int and float are too restrictive. The best demonstration of this is that this would not work:
dist = glum.NegativeBinomialDistribution()
dist.theta = dist.theta # This fails.
The failure is due to the fact that we are returning a float32 and we are checking the input to be of type float (aka float64)
The simplest approach here would be to ask for forgiveness. Wrap line 1168 into a try except block and delegate the type-checking task to numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just let me know whether this should be solved in this PR and what is your suggested change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's solve this in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened issue #631
y, mu, sample_weight = _as_float_arrays(y, mu, sample_weight) | ||
sample_weight = np.ones_like(y) if sample_weight is None else sample_weight | ||
|
||
return negative_binomial_deviance(y, sample_weight, mu, theta=float(theta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the setter decorator on theta, we are sure we are always getting a float32. I don't think the conversion to a float is necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure it works with the cython typing, the theta parameter can be set to be a C float (which is 32-bits), instead of floating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since p
and dispersion
arguments are passed as floating, would it be better to change that for all cases in a separate PR? I am not feeling comfortable to change it only for neg bin dist.
src/glum/_distribution.py
Outdated
) | ||
|
||
@property | ||
def lower_bound(self) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid confusion, can you store lower_bound
with upper_bound
above as a simple class attribute? As far as I know, we won't need to do any computation here.
Same comment for include_lower_bound
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good point
) | ||
|
||
def log_likelihood(self, y, mu, sample_weight=None, dispersion=1) -> float: | ||
r"""Compute the log likelihood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to a source for the actual formula you used in the negative_binomial_log_likelihood
. The formula is different from what is listed on h2o, statsmodels, and Wikipedia (the sources you listed in the PR description.
I'm very confident that it is correct, but for future reference it would be easier not to make the derivation again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, the same (although theta is called alpha) formula is presented in Eq (3) here https://content.wolfram.com/uploads/sites/19/2013/04/Zwilling.pdf so I will add as References in class.
tests/glm/test_distribution.py
Outdated
@@ -27,6 +28,7 @@ | |||
(GammaDistribution(), 0), | |||
(InverseGaussianDistribution(), 0), | |||
(TweedieDistribution(power=1.5), 0), | |||
(NegativeBinomialDistribution(theta=1.0), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should test a value other than 1. theta = 1 leads to a lot of simplifications in the formulas that may hide bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will put 1.5 here
…feature/neg_binomial_dist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your contribution @dawidkopczyk. This looks very good.
This PR introduces Negative Binomial distribution partially answering request from #569.
Implementation details:
r=1/theta
.theta
which defines the variance of negative binomial distributionv(mu) = mu + theta * mu^2