Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect PPV estimate from riskProfile function #6

Closed
dhibar opened this issue Nov 15, 2024 · 10 comments
Closed

Incorrect PPV estimate from riskProfile function #6

dhibar opened this issue Nov 15, 2024 · 10 comments

Comments

@dhibar
Copy link

dhibar commented Nov 15, 2024

I noticed that the PPV estimate for a given cutoff in the riskProfile() function output is incorrect while NPV (and separately sensitivity and specificity) all look to be correct. Below is a working toy example to illustrate the problem.

library(stats4phc)
library(dplyr)

OUTCOME = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 
            1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
            0, 0, 0, 1, 1, 0, 0, 0, 0)

SCORE = c(0.1717, 0.7474, 0.1368, 0.24947, 0.18512, 0.23712, 0.3914, 
          0.095, 0.59691, 0.56331, 0.5564, 0.18824, 0.17069, 0.1634, 0.6308, 
          0.1938, 0.37125, 0.19701, 0.13728, 0.0931, 0.35145, 0.17576, 
          0.1026, 0.49504, 0.41325, 0.30098, 0.57096, 0.54846, 0.25443, 
          0.14725, 0.1584, 0.30492, 0.49203, 0.12771, 0.2525, 0.28179, 
          0.22725, 0.57855, 0.17523, 0.13464, 0.22523, 0.26361, 0.12672, 
          0.14976, 0.49005, 0.73528, 0.15444, 0.47672, 0.0572, 0.15352)


CUTOFF = ifelse(SCORE > 0.41325, 1, 0)

comb_df <- data.frame(OUTCOME, SCORE, CUTOFF) %>%
  mutate(OUTCOME_f = factor(OUTCOME, levels = c("1", "0"))) %>%
  mutate(CUTOFF_f = factor(CUTOFF, levels = c("1", "0")))

myNPV <- comb_df %>%
  yardstick::npv(
    truth = OUTCOME_f,
    estimate = CUTOFF_f,
    event_level = "first"
  )

print(myNPV$.estimate)

myPPV <- comb_df %>%
  yardstick::ppv(
    truth = OUTCOME_f,
    estimate = CUTOFF_f,
    event_level = "first"
  )

print(myPPV$.estimate)

Our NPV and PPV for this example are 0.9189189 and 0.4615385 as expected.

With the riskProfile function we see a different estimate for PPV:

# risk profiling with stats4phc
p1cn <- riskProfile(outcome = comb_df$OUTCOME, score = comb_df$SCORE,
                    include = c("PPV", "1-NPV"), methods = c("cgam"))

s4p_perf <- p1cn$data %>%
  filter(method == "non-parametric") %>%
  filter(score == 0.41325)

print(s4p_perf$pvValue[s4p_perf$pv=="NPV"])

print(s4p_perf$pvValue[s4p_perf$pv=="PPV"])

Our NPV and PPV for this riskProfile example are 0.9189189 and 0.4285714 i.e. not the correct PPV!

If you take a deeper dive into the data output from the riskProfile function it looks like the PPV estimate at the next cutoff is the correct value for the previous cutoff so could be a merging or off-by-one type error (remember that the NPV is still correct for a given cutoff though).

# if you look at the next entry in the full output for PPV you can see that the correct PPV is assign to the next cutoff value
s4p_perf_all <- p1cn$data %>%
  filter(method == "non-parametric")

print(s4p_perf_all$pvValue[which(s4p_perf_all$pv=="PPV" & s4p_perf_all$score == 0.47672)])

Here's my session info as well JIC:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.3   stats4phc_0.1

loaded via a namespace (and not attached):
 [1] Matrix_1.6-1.1    gtable_0.3.4      coneproj_1.17     compiler_4.3.1    tidyselect_1.2.0  Rcpp_1.0.12       cgam_1.21         tidyr_1.3.0      
 [9] splines_4.3.1     scales_1.3.0      yaml_2.3.8        fastmap_1.1.1     boot_1.3-28.1     statmod_1.5.0     svGUI_1.0.1       lattice_0.21-9   
[17] ggplot2_3.5.0     R6_2.5.1          labeling_0.4.3    generics_0.1.3    knitr_1.44        MASS_7.3-60       backports_1.4.1   checkmate_2.2.0  
[25] tibble_3.2.1      nloptr_2.0.3      munsell_0.5.0     minqa_1.2.6       pillar_1.9.0      rlang_1.1.3       utf8_1.2.3        xfun_0.40        
[33] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35     grid_4.3.1        rstudioapi_0.16.0 svDialogs_1.1.0   lme4_1.1-35.1    
[41] lifecycle_1.0.4   nlme_3.1-163      vctrs_0.6.5       evaluate_0.22     pracma_2.4.2      glue_1.7.0        yardstick_1.3.1   fansi_1.0.5      
[49] colorspace_2.1-0  rmarkdown_2.25    purrr_1.0.2       htmltools_0.5.8.1 tools_4.3.1       pkgconfig_2.0.3  
@dgkf-roche
Copy link

dgkf-roche commented Nov 26, 2024

I spent some time today investigating this a bit and just want to check my understanding before diving much further.

In your {yardstick} example you use a cutoff of SCORE > 0.41325, where the stats4phc example filters on score == 0.41325. Switching the {yardstick} example to SCORE >= 0.41325 produces the same result with both.

Let me know if this is intentional and if you still expect the results to match up even with the slightly different ways of setting the score cutoff.

@dhibar
Copy link
Author

dhibar commented Nov 26, 2024

I think that the riskProfile function is also using > to define status at a given cutoff. Specifically I think this line is where it happens:

thresh.predictions <- lapply(score, function(x) as.numeric(score > x))

The PPV that function produces all look correct but the data that is returned is shifted off by one for PPV (but not NPV)

@dgkf-roche
Copy link

Ah, I see - I was looking at this part in your code:

s4p_perf <- p1cn$data %>%
  filter(method == "non-parametric") %>%
  filter(score == 0.41325)

Would you need to do filter(score > 0.41325) to compare the PPV, or are you expecting that this is already accounted for when the PPV is calculated against each of the riskProfile scores?

@dhibar
Copy link
Author

dhibar commented Nov 26, 2024

My underlying assumption is that for a given entry in the data returned by the riskProfile function it returns the PPV or NPV for the scenario of score > cutoff. So in the text you quoted in this step filter(score == 0.41325) I am hoping to get the PPV and NPV for that single cutoff i.e. for score > 0.41325. That assumption holds for NPV and also it works the same way in the Sens Spec plot outputs.

@dgkf-roche
Copy link

Thanks for confirming! I'll see what I can do to make the values align.

@dgkf-roche
Copy link

dgkf-roche commented Nov 27, 2024

I think I've nailed down the culprit, but I want your advice on what you want the behavior of the package to be.

You were right to hone in on the lines where the prevalence is injected as a ppv/npv. To manage this, the calculated thresholds were being truncated. The assumption that the last threshold should be equal to the prevalence seemed to be the issue. I think the original goal here was to ensure that there is always a PPV & NPV calculated at score == prevalence.

There are two paths I see here, and it depends on what you're looking for:

We stop forcing the inclusion of ppv/npv where score == prevalence

We can simply calculate the ppv/npv only at the scores that are provided:

  # calculate ppv/npv at all thresholds provided by score
  ppv <- vapply(thresh.predictions, ...)
  npv <- vapply(thresh.predictions, ...)

  threshold.data <- data.frame(
    score = score,
    percentile = ecdf(score)(score),
    PPV = ppv,
    NPV = npv
  )

This package has changed a lot since I last touched it, but I think the impact would be that ppv/npv curves on the predictiveness plots might not appear to hit exactly the prevalence at the extremes of the risk percentile.

Continue including additional score at prevalence, but pad both sides of ppv/npv vectors

I think that by padding both ends of both the ppv/npv and adding both the prevalence and 1-prevalence values to the calculated scores we will address the off-by-one issue while continuing to include this particularly valuable score boundaries.

  # calculate ppv/npv at all thresholds provided by score
  ppv <- vapply(thresh.predictions, ...)
  npv <- vapply(thresh.predictions, ...)

  score <- sort(c(score, prev, 1 - prev))
  threshold.data <- data.frame(
    score = score,
    percentile = ecdf(score)(score),
    PPV = c(prev, ppv, 0       ),  # NOTE: padding both sides of ppv/npv
    NPV = c(1,    npv, 1 - prev)
  )

The downside with this approach is that it might include redundant records in instances where the prevalence is already one of the scores.

slamao added a commit that referenced this issue Feb 14, 2025
@slamao
Copy link
Collaborator

slamao commented Feb 14, 2025

Hey guys! Sorry for completely missing on this issue, I have now updated my notifications in this repo ...

Did you manage to solve this? I have created a PR that I think should address this issue, see #7. It is a mix of Doug's suggestions actually, because I now think that the following line

thresh.predictions[1:(length(score) - 1)],
in nonParametricPV function wasn't exactly correct.

@dgkf-roche
Copy link

Thanks for the ping @slamao - this definitely fell off my radar through the holiday season. I'd be happy for you to pick it up and make a judgement call on what is the appropriate solution here since I'm really having to jog my memory on the methods trying to jump back into the codebase.

Please let me know if there's anything else you might need from me.

@dhibar
Copy link
Author

dhibar commented Feb 14, 2025

Thanks @dgkf-roche and @slamao ! I was out for two months on sabbatical so this fell off my radar as well.

@slamao I am testing out your proposed changes in #7 now and will come back to you

slamao added a commit that referenced this issue Feb 14, 2025
Address #6. 

Since it was chosen by design to assess `score > x` for each `x in
score`, I don't think we can naturally end up with a PPV at prevalence.

Other option would be to inflate the resulting dataset manually for this
boundary value, but than we would need to artificially add one more
score and one more percentile to match the number of rows in the
resulting data frame.
@slamao
Copy link
Collaborator

slamao commented Feb 14, 2025

Thanks for the ping @slamao - this definitely fell off my radar through the holiday season. I'd be happy for you to pick it up and make a judgement call on what is the appropriate solution here since I'm really having to jog my memory on the methods trying to jump back into the codebase.

Please let me know if there's anything else you might need from me.

Thanks for the trust 😄 and also thanks for your analysis above, it did help me to prepare the PR, which is a mix of your suggestions.

And sorry to both of you for reacting so late. It is nice to see at least some activity here :) hope you enjoyed your days off Derrek!

@slamao slamao closed this as completed Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants