Added censored log-likelihood and some tests for it #1500

k-shep · 2023-10-19T18:29:04Z

No description provided.

…lihood tests

…od-for-censored-data

codecov · 2024-03-24T16:40:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (abcc8e5) to head (9be1b06).

Additional details and impacted files

@@            Coverage Diff            @@
##              main     #1500   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           97        97           
  Lines         9577      9665   +88     
=========================================
+ Hits          9577      9665   +88

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

k-shep · 2024-03-24T17:33:00Z

@DavAug and @rccreswell

I’ve had a go at the censored log-likelihood, which we discussed last year. I appreciate that there are quite a lot of tests, so I thought I’d spell out the cases I’m testing (for a variety of objects):

none of the data are censored;
some data are lower/upper censored and some aren’t censored;
some of the data are lower censored, some are upper censored and some are not censored;
all of the data are lower/upper censored.

I’m testing the likelihood and S1 are correct for these cases and the following object types (for the values):

list;
one dimensional array;
two dimensional array single (i.e. a 1D array in a 2D array format);
two dimensional array multi (i.e. a ‘proper’ 2D array).

DavAug

Hi @k-shep ,

Thank you for the great work! The likelihood looks very good 😊

I have some minor comments related to cosmetics, performance, and accuracy as well as one more fundamental comment on the justification for the likelihood calculation. I have to admit, I am not super close to this topic atm, but it appears to me that sensored values may get higher likelihood contributions than neighbouring unsensored values. I would just be curious to learn whether this is what we want and why that is a good thing 😊

I hope you have a nice evening and best wishes,
David

pints/_log_likelihoods.py

DavAug · 2024-03-25T18:15:59Z

pints/_log_likelihoods.py

+            scipy.stats.norm.cdf(x=self._values,
+                                 loc=self._problem.evaluate(x[:-self._no]),
+                                 scale=sigma)),
+            where=self._values == self._a)


Performance comment:
I would suggest to use the output variable from above here.

Content comment:

Shouldn't the values at which the CDF is evaluated be self._a and the mask be
where=self._values <= self._a?

How I understand the method is that self._a is the threshold below which measurements are considered sensored(?). Then I would want a likelihood contribution for each value below or equal to the threshold.

Content comment:

I am wondering whether it makes sense to add the full CDF contribution or whether it may need to be normalised (I realise that this may be beyond this PR).

Thinking out loud here: If there is no uncertainty around a particular measurement, we sum up its PDF value (I.e. it's likelihood). Intuitively, when a measurement becomes more uncertain I would want to smear out its PDF value, I.e. I would want to average its PDF value over a range of possible values. By using the CDF, you integrate the PDF and kind of do that but you overcount the contribution from the sensored data. I think you need to divide by the range you integrated over to get an average likelihood contribution, instead of the cumulative likelihood contribution from the sensored domain. Otherwise the contribution from an uncertain measurement suddenly contributes more to the likelihood than if it was certain (no sensoring). The infinite domain of the Gaussian distribution makes this of course a little bit difficult...

pints/_log_likelihoods.py

DavAug · 2024-03-25T21:37:28Z

pints/_log_likelihoods.py

+                    if b is not None:
+                        if self._values[j, i] > self._b[i]:
+                            raise ValueError(
+                                "A value is above the upper limit.")


Content comment:

Perhaps, I misunderstood the use of the likelihood. But isn't the purpose of the sensored likelihood that measurement values below the threshold exist but the measurement values can not be trusted because the measuring device has reached its resolution limit or equivalent?

I think you could do without this check.

I’ve removed this check. I’ve seen values at the limit of detection recorded as equal to it, but I think you’re correct that in general this may not be the case.

pints/_log_likelihoods.py

DavAug · 2024-03-26T13:55:02Z

Ok, just to follow up on my comment before: I had a quick look at the PK modelling reference and your implementation seems to be consistent with method M3 (except, perhaps, a factor of 2, but I am assuming this comes from the difference in the Phi and CDF definitions).

I am still not quite convinced that this is the best method to handle censored data, because it will effectively drive the model to go as far into the censored region as possible to maximise the likelihood. So, for a model with sufficient flexibility, you might get strange results where the model shoots off to minus infinity whenever there is a censored measurement below the censoring threshold (or to infinity for censored measurements above the censoring threshold). That doesn't really seem to be what we want.

Their M4 suggestion would do a little bit better by bounding the range of censored measurements from below for the lower censored measurements (and above for the upper censored measurements) but then you can already anticipate that the Gaussian likelihood will just be maximised when the model positions itself right in the middle between the lower bound and the censoring threshold whenever a censored measurement is present.

I think these shortcomings are very exciting and leave quite a bit of room for some methodological work 😊

DavAug · 2024-03-26T14:24:06Z

Also one more methodological comment:

Given that a censored measurement contributes with the CDF to the likelihood, as opposed to uncensored measurements contributing with the PDF, the likelihood will be larger the more censored measurements are present (by construction because the CDF is the integral over the PDF).

For the maximum likelihood approach in the PK modelling reference this matters maybe less (although I still don't think it's ideal), for Bayesian inference this will have the consequence that posterior distributions will be narrower than they should be (higher likelihood contributions have a similar effect to having more data). This feels very counterintuitive, as posteriors should probably be wider for uncensored measurements. Just something to think about. 😊

k-shep · 2024-03-27T14:40:24Z

Thanks @DavAug for the review!

I think I’ve made all the changes you suggested, and the code is much simpler now.

Just as a general comment, I had previously tried to copy the code or coding style from elsewhere in the file, so I think some of your suggestions might also be relevant for that code.

k-shep · 2024-03-27T14:46:08Z

Ok, just to follow up on my comment before: I had a quick look at the PK modelling reference and your implementation seems to be consistent with method M3 (except, perhaps, a factor of 2, but I am assuming this comes from the difference in the Phi and CDF definitions).

I am still not quite convinced that this is the best method to handle censored data, because it will effectively drive the model to go as far into the censored region as possible to maximise the likelihood. So, for a model with sufficient flexibility, you might get strange results where the model shoots off to minus infinity whenever there is a censored measurement below the censoring threshold (or to infinity for censored measurements above the censoring threshold). That doesn't really seem to be what we want.

Their M4 suggestion would do a little bit better by bounding the range of censored measurements from below for the lower censored measurements (and above for the upper censored measurements) but then you can already anticipate that the Gaussian likelihood will just be maximised when the model positions itself right in the middle between the lower bound and the censoring threshold whenever a censored measurement is present.

I think these shortcomings are very exciting and leave quite a bit of room for some methodological work 😊

Yes – my industrial supervisor recommended using the M3 method, so this is why I started implementing this censored log-likelihood. I think in the reference, they’re multiplying the whole likelihood (both censored and uncensored terms) by -2, although I’m not sure why.

Thank you for your thoughts on this – I think it is potentially an interesting thing to investigate and, if I have time, I’m hoping I might be able to do this later in my PhD. In the papers that I came across when I was looking at this before, I think people dealt with the censored data in different ways – I think I came across the use of the M3 method and also removing the censored data, and I believe when I tried to compare these for simulated data they gave different results in terms of parameter inference (via finding the MLE). There didn’t seem to be a consensus for the best way to deal with censored data, at least in the field I was looking at – I haven’t, however, looked too much into the general theory of this.

From looking at the experiments I tried before, it looks like your thoughts on censoring ‘narrowing’ the likelihood (and posterior) are also correct. The likelihood surface does seem to look flatter when values are not censored, compared to when they are.

DavAug

Hi @k-shep

Really good job, also with the extensive testing!

I have some minor additional cosmetic comments after a second pass, but I don't think there needs to be another review. If you can have a look at some of those, perhaps especially the ones in the doc string, I am happy for this to go in 😊

Thank you for the hard work!

DavAug · 2024-03-27T18:02:43Z

pints/_log_likelihoods.py

+    r""" Calculates a log-likelihood assuming independent Gaussian noise at
+    each time point, and that data above and below certain limits are censored.
+    In other words, any data values less than the lower limit are recorded as
+    equal to it; and any values greater than the upper limit are recorded as


Content comment:

I think, this is not quite correct since you are using the CDF at the threshold and not the PDF. So perhaps, "any value below the censoring threshold is assumed to equal all values below the censoring threshold" would be more accurate.

Same with the values above the threshold.

pints/_log_likelihoods.py

k-shep · 2024-03-28T15:16:23Z

Thanks @DavAug for reviewing this again. I’ve made the changes you’ve suggested. I have changed the documentation, although I’ve written it differently to how you suggested. Would you be OK just checking that you agree what I’ve written makes sense now?

DavAug · 2024-03-28T16:05:04Z

Thanks @DavAug for reviewing this again. I’ve made the changes you’ve suggested. I have changed the documentation, although I’ve written it differently to how you suggested. Would you be OK just checking that you agree what I’ve written makes sense now?

Hi @k-shep

Of course, no problem! Yes the docstring makes sense and reads very well! Good job 😊

k-shep · 2024-03-28T16:20:12Z

Thanks @DavAug :) - should I now just click the merge button?

DavAug · 2024-03-28T16:29:51Z

Thanks @DavAug :) - should I now just click the merge button?

Yes 😊

k-shep · 2024-04-16T16:12:48Z

Hi @DavAug, I just noticed that I didn't put this log-likelihood class in the correct position for it to be in alphabetical order in that file. Is the best way for me to correct this by opening a new issue and pull request, and making this change?

DavAug · 2024-04-16T16:19:42Z

Hi,

Good catch! Yes, create an issue for it and a PR when you have changed the order 😊

Thank you!

Best,
David

k-shep · 2024-04-17T18:02:56Z

Thanks @DavAug - I've created the PR for this here: #1526

k-shep added 7 commits October 19, 2023 19:28

Added censored log-likelihood and some tests for it

5bd61b8

Added some derivative tests that match the standard gaussian log-like…

b177a03

…lihood tests

Finish censored log-likelihood and tests

42c801d

Simplify code and remove commented out code

6b5e288

Correct docstrings in tests

d27d27b

Update changelog

28d6f00

Merge remote-tracking branch 'origin/master' into Issue-1481-Likeliho…

a569b20

…od-for-censored-data

k-shep added 3 commits March 24, 2024 16:53

Test print statement

6e35164

Add censored log-likelihood to documentation file

abd0002

Correct documentation

011c962

k-shep linked an issue Mar 24, 2024 that may be closed by this pull request

Likelihood for censored data #1481

Closed

k-shep requested review from DavAug and rccreswell March 24, 2024 17:33

DavAug reviewed Mar 25, 2024

View reviewed changes

Made changes suggested in review

513891e

DavAug approved these changes Mar 28, 2024

View reviewed changes

Add changes based on second review

9be1b06

k-shep merged commit f770761 into pints-team:main Mar 28, 2024
13 checks passed

k-shep deleted the Issue-1481-Likelihood-for-censored-data branch March 28, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added censored log-likelihood and some tests for it #1500

Added censored log-likelihood and some tests for it #1500

k-shep commented Oct 19, 2023

codecov bot commented Mar 24, 2024 •

edited

Loading

k-shep commented Mar 24, 2024

DavAug left a comment •

edited

Loading

DavAug Mar 25, 2024

DavAug Mar 25, 2024

DavAug Mar 25, 2024

DavAug Mar 25, 2024

k-shep Mar 27, 2024

DavAug commented Mar 26, 2024

DavAug commented Mar 26, 2024 •

edited

Loading

k-shep commented Mar 27, 2024

k-shep commented Mar 27, 2024

DavAug left a comment

DavAug Mar 27, 2024

DavAug Mar 27, 2024

k-shep commented Mar 28, 2024

DavAug commented Mar 28, 2024

k-shep commented Mar 28, 2024

DavAug commented Mar 28, 2024

k-shep commented Apr 16, 2024

DavAug commented Apr 16, 2024

k-shep commented Apr 17, 2024

Added censored log-likelihood and some tests for it #1500

Added censored log-likelihood and some tests for it #1500

Conversation

k-shep commented Oct 19, 2023

codecov bot commented Mar 24, 2024 • edited Loading

Codecov Report

k-shep commented Mar 24, 2024

DavAug left a comment • edited Loading

Choose a reason for hiding this comment

DavAug Mar 25, 2024

Choose a reason for hiding this comment

DavAug Mar 25, 2024

Choose a reason for hiding this comment

DavAug Mar 25, 2024

Choose a reason for hiding this comment

DavAug Mar 25, 2024

Choose a reason for hiding this comment

k-shep Mar 27, 2024

Choose a reason for hiding this comment

DavAug commented Mar 26, 2024

DavAug commented Mar 26, 2024 • edited Loading

k-shep commented Mar 27, 2024

k-shep commented Mar 27, 2024

DavAug left a comment

Choose a reason for hiding this comment

DavAug Mar 27, 2024

Choose a reason for hiding this comment

DavAug Mar 27, 2024

Choose a reason for hiding this comment

k-shep commented Mar 28, 2024

DavAug commented Mar 28, 2024

k-shep commented Mar 28, 2024

DavAug commented Mar 28, 2024

k-shep commented Apr 16, 2024

DavAug commented Apr 16, 2024

k-shep commented Apr 17, 2024

codecov bot commented Mar 24, 2024 •

edited

Loading

DavAug left a comment •

edited

Loading

DavAug commented Mar 26, 2024 •

edited

Loading