Incorrect PPV estimate from riskProfile function #6

dhibar · 2024-11-15T18:13:22Z

I noticed that the PPV estimate for a given cutoff in the riskProfile() function output is incorrect while NPV (and separately sensitivity and specificity) all look to be correct. Below is a working toy example to illustrate the problem.

library(stats4phc)
library(dplyr)

OUTCOME = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 
            1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
            0, 0, 0, 1, 1, 0, 0, 0, 0)

SCORE = c(0.1717, 0.7474, 0.1368, 0.24947, 0.18512, 0.23712, 0.3914, 
          0.095, 0.59691, 0.56331, 0.5564, 0.18824, 0.17069, 0.1634, 0.6308, 
          0.1938, 0.37125, 0.19701, 0.13728, 0.0931, 0.35145, 0.17576, 
          0.1026, 0.49504, 0.41325, 0.30098, 0.57096, 0.54846, 0.25443, 
          0.14725, 0.1584, 0.30492, 0.49203, 0.12771, 0.2525, 0.28179, 
          0.22725, 0.57855, 0.17523, 0.13464, 0.22523, 0.26361, 0.12672, 
          0.14976, 0.49005, 0.73528, 0.15444, 0.47672, 0.0572, 0.15352)


CUTOFF = ifelse(SCORE > 0.41325, 1, 0)

comb_df <- data.frame(OUTCOME, SCORE, CUTOFF) %>%
  mutate(OUTCOME_f = factor(OUTCOME, levels = c("1", "0"))) %>%
  mutate(CUTOFF_f = factor(CUTOFF, levels = c("1", "0")))

myNPV <- comb_df %>%
  yardstick::npv(
    truth = OUTCOME_f,
    estimate = CUTOFF_f,
    event_level = "first"
  )

print(myNPV$.estimate)

myPPV <- comb_df %>%
  yardstick::ppv(
    truth = OUTCOME_f,
    estimate = CUTOFF_f,
    event_level = "first"
  )

print(myPPV$.estimate)

Our NPV and PPV for this example are 0.9189189 and 0.4615385 as expected.

With the riskProfile function we see a different estimate for PPV:

# risk profiling with stats4phc
p1cn <- riskProfile(outcome = comb_df$OUTCOME, score = comb_df$SCORE,
                    include = c("PPV", "1-NPV"), methods = c("cgam"))

s4p_perf <- p1cn$data %>%
  filter(method == "non-parametric") %>%
  filter(score == 0.41325)

print(s4p_perf$pvValue[s4p_perf$pv=="NPV"])

print(s4p_perf$pvValue[s4p_perf$pv=="PPV"])

Our NPV and PPV for this riskProfile example are 0.9189189 and 0.4285714 i.e. not the correct PPV!

If you take a deeper dive into the data output from the riskProfile function it looks like the PPV estimate at the next cutoff is the correct value for the previous cutoff so could be a merging or off-by-one type error (remember that the NPV is still correct for a given cutoff though).

# if you look at the next entry in the full output for PPV you can see that the correct PPV is assign to the next cutoff value
s4p_perf_all <- p1cn$data %>%
  filter(method == "non-parametric")

print(s4p_perf_all$pvValue[which(s4p_perf_all$pv=="PPV" & s4p_perf_all$score == 0.47672)])

Here's my session info as well JIC:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.3   stats4phc_0.1

loaded via a namespace (and not attached):
 [1] Matrix_1.6-1.1    gtable_0.3.4      coneproj_1.17     compiler_4.3.1    tidyselect_1.2.0  Rcpp_1.0.12       cgam_1.21         tidyr_1.3.0      
 [9] splines_4.3.1     scales_1.3.0      yaml_2.3.8        fastmap_1.1.1     boot_1.3-28.1     statmod_1.5.0     svGUI_1.0.1       lattice_0.21-9   
[17] ggplot2_3.5.0     R6_2.5.1          labeling_0.4.3    generics_0.1.3    knitr_1.44        MASS_7.3-60       backports_1.4.1   checkmate_2.2.0  
[25] tibble_3.2.1      nloptr_2.0.3      munsell_0.5.0     minqa_1.2.6       pillar_1.9.0      rlang_1.1.3       utf8_1.2.3        xfun_0.40        
[33] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35     grid_4.3.1        rstudioapi_0.16.0 svDialogs_1.1.0   lme4_1.1-35.1    
[41] lifecycle_1.0.4   nlme_3.1-163      vctrs_0.6.5       evaluate_0.22     pracma_2.4.2      glue_1.7.0        yardstick_1.3.1   fansi_1.0.5      
[49] colorspace_2.1-0  rmarkdown_2.25    purrr_1.0.2       htmltools_0.5.8.1 tools_4.3.1       pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

dgkf-roche · 2024-11-26T21:32:17Z

I spent some time today investigating this a bit and just want to check my understanding before diving much further.

In your {yardstick} example you use a cutoff of SCORE > 0.41325, where the stats4phc example filters on score == 0.41325. Switching the {yardstick} example to SCORE >= 0.41325 produces the same result with both.

Let me know if this is intentional and if you still expect the results to match up even with the slightly different ways of setting the score cutoff.

dhibar · 2024-11-26T22:16:28Z

I think that the riskProfile function is also using > to define status at a given cutoff. Specifically I think this line is where it happens:

stats4phc/R/PV.R

Line 32 in e868e23

thresh.predictions <- lapply(score, function(x) as.numeric(score > x))

The PPV that function produces all look correct but the data that is returned is shifted off by one for PPV (but not NPV)

dgkf-roche · 2024-11-26T22:19:33Z

Ah, I see - I was looking at this part in your code:

s4p_perf <- p1cn$data %>%
  filter(method == "non-parametric") %>%
  filter(score == 0.41325)

Would you need to do filter(score > 0.41325) to compare the PPV, or are you expecting that this is already accounted for when the PPV is calculated against each of the riskProfile scores?

dhibar · 2024-11-26T22:37:49Z

My underlying assumption is that for a given entry in the data returned by the riskProfile function it returns the PPV or NPV for the scenario of score > cutoff. So in the text you quoted in this step filter(score == 0.41325) I am hoping to get the PPV and NPV for that single cutoff i.e. for score > 0.41325. That assumption holds for NPV and also it works the same way in the Sens Spec plot outputs.

dgkf-roche · 2024-11-26T23:05:10Z

Thanks for confirming! I'll see what I can do to make the values align.

dgkf-roche · 2024-11-27T20:46:51Z

I think I've nailed down the culprit, but I want your advice on what you want the behavior of the package to be.

You were right to hone in on the lines where the prevalence is injected as a ppv/npv. To manage this, the calculated thresholds were being truncated. The assumption that the last threshold should be equal to the prevalence seemed to be the issue. I think the original goal here was to ensure that there is always a PPV & NPV calculated at score == prevalence.

There are two paths I see here, and it depends on what you're looking for:

We stop forcing the inclusion of ppv/npv where `score == prevalence`

We can simply calculate the ppv/npv only at the scores that are provided:

  # calculate ppv/npv at all thresholds provided by score
  ppv <- vapply(thresh.predictions, ...)
  npv <- vapply(thresh.predictions, ...)

  threshold.data <- data.frame(
    score = score,
    percentile = ecdf(score)(score),
    PPV = ppv,
    NPV = npv
  )

This package has changed a lot since I last touched it, but I think the impact would be that ppv/npv curves on the predictiveness plots might not appear to hit exactly the prevalence at the extremes of the risk percentile.

Continue including additional score at prevalence, but pad both sides of ppv/npv vectors

I think that by padding both ends of both the ppv/npv and adding both the prevalence and 1-prevalence values to the calculated scores we will address the off-by-one issue while continuing to include this particularly valuable score boundaries.

  # calculate ppv/npv at all thresholds provided by score
  ppv <- vapply(thresh.predictions, ...)
  npv <- vapply(thresh.predictions, ...)

  score <- sort(c(score, prev, 1 - prev))
  threshold.data <- data.frame(
    score = score,
    percentile = ecdf(score)(score),
    PPV = c(prev, ppv, 0       ),  # NOTE: padding both sides of ppv/npv
    NPV = c(1,    npv, 1 - prev)
  )

The downside with this approach is that it might include redundant records in instances where the prevalence is already one of the scores.

slamao · 2025-02-14T15:12:57Z

Hey guys! Sorry for completely missing on this issue, I have now updated my notifications in this repo ...

Did you manage to solve this? I have created a PR that I think should address this issue, see #7. It is a mix of Doug's suggestions actually, because I now think that the following line

stats4phc/R/PV.R

Line 35 in e868e23

thresh.predictions[1:(length(score) - 1)],

in nonParametricPV function wasn't exactly correct.

dgkf-roche · 2025-02-14T16:16:46Z

Thanks for the ping @slamao - this definitely fell off my radar through the holiday season. I'd be happy for you to pick it up and make a judgement call on what is the appropriate solution here since I'm really having to jog my memory on the methods trying to jump back into the codebase.

Please let me know if there's anything else you might need from me.

dhibar · 2025-02-14T16:25:01Z

Thanks @dgkf-roche and @slamao ! I was out for two months on sabbatical so this fell off my radar as well.

@slamao I am testing out your proposed changes in #7 now and will come back to you

Address #6. Since it was chosen by design to assess `score > x` for each `x in score`, I don't think we can naturally end up with a PPV at prevalence. Other option would be to inflate the resulting dataset manually for this boundary value, but than we would need to artificially add one more score and one more percentile to match the number of rows in the resulting data frame.

slamao · 2025-02-14T17:14:48Z

Thanks for the ping @slamao - this definitely fell off my radar through the holiday season. I'd be happy for you to pick it up and make a judgement call on what is the appropriate solution here since I'm really having to jog my memory on the methods trying to jump back into the codebase.

Please let me know if there's anything else you might need from me.

Thanks for the trust 😄 and also thanks for your analysis above, it did help me to prepare the PR, which is a mix of your suggestions.

And sorry to both of you for reacting so late. It is nice to see at least some activity here :) hope you enjoyed your days off Derrek!

slamao added a commit that referenced this issue Feb 14, 2025

#6: addressing shift in PPV by one row

2b0ece7

slamao mentioned this issue Feb 14, 2025

#6: addressing shift in PPV by one row #7

Merged

slamao closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect PPV estimate from riskProfile function #6

Incorrect PPV estimate from riskProfile function #6

dhibar commented Nov 15, 2024

dgkf-roche commented Nov 26, 2024 •

edited

Loading

dhibar commented Nov 26, 2024

dgkf-roche commented Nov 26, 2024

dhibar commented Nov 26, 2024

dgkf-roche commented Nov 26, 2024

dgkf-roche commented Nov 27, 2024 •

edited

Loading

slamao commented Feb 14, 2025

dgkf-roche commented Feb 14, 2025

dhibar commented Feb 14, 2025

slamao commented Feb 14, 2025

Incorrect PPV estimate from riskProfile function #6

Incorrect PPV estimate from riskProfile function #6

Comments

dhibar commented Nov 15, 2024

dgkf-roche commented Nov 26, 2024 • edited Loading

dhibar commented Nov 26, 2024

dgkf-roche commented Nov 26, 2024

dhibar commented Nov 26, 2024

dgkf-roche commented Nov 26, 2024

dgkf-roche commented Nov 27, 2024 • edited Loading

We stop forcing the inclusion of ppv/npv where score == prevalence

Continue including additional score at prevalence, but pad both sides of ppv/npv vectors

slamao commented Feb 14, 2025

dgkf-roche commented Feb 14, 2025

dhibar commented Feb 14, 2025

slamao commented Feb 14, 2025

dgkf-roche commented Nov 26, 2024 •

edited

Loading

dgkf-roche commented Nov 27, 2024 •

edited

Loading

We stop forcing the inclusion of ppv/npv where `score == prevalence`