Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised CI for props docs #367

Merged
merged 6 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 64 additions & 10 deletions Comp/r-sas_ci_for_prop.qmd

Large diffs are not rendered by default.

102 changes: 88 additions & 14 deletions R/ci_for_prop.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,15 @@ title: "Confidence Intervals for Proportions in R"

## Introduction

There are several ways to calculate a confidence interval (CI) for a proportion. You need to select the method based on if you have a 1 sample proportion (1 proportion calculated from 1 group of subjects), or if you have 2 samples and you want a CI for the difference in the 2 proportions. The difference in proportion can come from either 2 independent samples (different subjects in each of the 2 groups), or can be matched (the same subject with 1 result in 1 group and 1 result in the other group \[paired data\]).
The methods to use for calculating a confidence interval (CI) for a proportion depend on the type of proportion you have.

- 1 sample proportion (1 proportion calculated from 1 group of subjects)

- 2 sample proportions and you want a CI for the difference in the 2 proportions.

- If the 2 samples come from 2 independent samples (different subjects in each of the 2 groups)

- If the 2 samples are matched (i.e. the same subject has 2 results, one on each group \[paired data\]).

The method selected is also dependent on whether your proportion is close to 0 or 1 (or near to the 0.5 midpoint), and your sample size.

Expand Down Expand Up @@ -48,9 +56,11 @@ adcibc %>%

**The {cardx} package** is an extension of the {cards} package, providing additional functions to create Analysis Results Data Objects (ARDs)^1^. It was developed as part of {NEST} and pharmaverse. This package requires the binary endpoint to be a logical (TRUE/FALSE) vector or a numeric/integer coded as (0, 1) with 1 (TRUE) being the success you want to calculate the confidence interval for.

See [here](R:%20Functions%20for%20Calculating%20Proportion%20Confidence%20Intervals) for full description of the {cardx} proportions equations.

If calculating the CI for a difference in proportions, the package requires both the response and the treatment variable to be numeric/integer coded as (0, 1) (or logical vector).

Instead of the code presented below, you can use `ard_categorical_ci(data, variables=resp, method ='wilson')` for example. This invokes the code below but returns an analysis results dataset format as the output. Methods included are waldcc, wald, clopper-pearson, wilson, wilsoncc, strat_wilson, strat_wilsoncc, agresti-coull and jeffreys.
Instead of the code presented below, you can use `ard_categorical_ci(data, variables=resp, method ='wilson')` for example. This invokes the code below but returns an analysis results dataset (ARD) format as the output. Methods included are waldcc, wald, clopper-pearson, wilson, wilsoncc, strat_wilson, strat_wilsoncc, agresti-coull and jeffreys for one-sample proportions and methods for 2 independent samples, however currently does not have a method for 2 matched proportions.

Code example: `proportion_ci_clopper_pearson(<resp_var>,conf.level=0.95) %>% as_tibble()`

Expand Down Expand Up @@ -83,11 +93,11 @@ Code example for Clopper-pearson:\
1) x = 0 (0% responders), in which case the lower limit does not match.
2) x = n (100% responders), in which case the upper limit does not match.

Because of the relationship between the binomial distirbution and the beta distribution. This package uses quantiles of the beta distribution to derive exact confidence intervals.
Because of the relationship between the binomial distribution and the beta distribution. This package uses quantiles of the beta distribution to derive exact confidence intervals.

$$ B(\alpha/2;x, n-x+1) < p < B(1-\alpha/2; x+1, n-x)$$

RBesT equations are: \
RBesT equations are:\
pLow \<- qbeta(Low, r + (r == 0), n - r + 1)\
pHigh \<- qbeta(High, r + 1, n - r + ((n - r) == 0))

Expand All @@ -99,6 +109,8 @@ pHigh \<- qbeta(High, r + 1, n - r)

It is currently unclear why the RBesT script has the logical conditions (r==0) and ((n-r)==0. Therefore, we currently do not recommend using this package and suggest cardx or Hmisc is used instead.
statasaurus marked this conversation as resolved.
Show resolved Hide resolved

**The {ExactCIdiff} package** produces CIs for two dependent proportions (matched pairs) and two independent proportions (unmatched pairs).

## Methods for Calculating Confidence Intervals for a single proportion using cardx

For more technical derivation and reasons for use of each of the methods listed below, see the corresponding [SAS page](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html).
Expand Down Expand Up @@ -199,11 +211,59 @@ proportion_ci_jeffreys(act2,conf.level=0.95) %>%
```
```

## Methods for Calculating Confidence Intervals for a matched pair proportion
## Methods for Calculating Confidence Intervals for a matched pair proportion using {ExactCIdiff}

For more information about the detailed methods for calculating confidence intervals for a matched pair proportion see [here](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html#methods-for-calculating-confidence-intervals-for-a-matched-pair-proportion). When you have 2 measurements on the same subject, the 2 sets of measures are not independent and you have matched pair of responses.

To date we have not found an R package which calculates a CI for matched pair proportions using the normal approximation or Wilson methods although they can be done by hand using the equations provided on the SAS page link above.

**The {ExactCIdiff} package** produces exact CIs for two dependent proportions (matched pairs), claiming to be the first package in R to do this method. However, it should only be used when the sample size is not too large as it can be computationally intensive.\
NOTE that the {ExactNumCI} package should not be used for this task. More detail on these two packages can be found [here](RJ-2013-026.pdf).

Using a cross over study as our example, a 2 x 2 table can be formed as follows:

+-----------------------+---------------+---------------+--------------+
| | Placebo\ | Placebo\ | Total |
| | Response= Yes | Response = No | |
+=======================+===============+===============+==============+
| Active Response = Yes | r | s | r+s |
+-----------------------+---------------+---------------+--------------+
| Active Response = No | t | u | t+u |
+-----------------------+---------------+---------------+--------------+
| Total | r+t | s+u | N = r+s+t+u |
+-----------------------+---------------+---------------+--------------+

: The proportions of subjects responding on each treatment are:

Active: $\hat p_1 = (r+s)/n$ and Placebo: $\hat p_2= (r+t)/n$

When you have 2 measurements on the same subject, the 2 sets of measures are not independent and you have matched pair of responses.
Difference between the proportions for each treatment are: $D=p1-p2=(s-t)/n$

This section is work in progress. For more information about methods for calculating confidence intervals for a matched pair proportion see [here](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html#methods-for-calculating-confidence-intervals-for-a-matched-pair-proportion)
Suppose :

+-----------------------+---------------+---------------+------------------+
| | Placebo\ | Placebo\ | Total |
| | Response= Yes | Response = No | |
+=======================+===============+===============+==================+
| Active Response = Yes | r = 20 | s = 15 | r+s = 35 |
+-----------------------+---------------+---------------+------------------+
| Active Response = No | t = 6 | u = 5 | t+u = 11 |
+-----------------------+---------------+---------------+------------------+
| Total | r+t = 26 | s+u = 20 | N = r+s+t+u = 46 |
+-----------------------+---------------+---------------+------------------+

Active: $\hat p_1 = (r+s)/n$ =35/46 =0.761 and Placebo: $\hat p_2= (r+t)/n$ = 26/46 =0.565

Difference = 0.761-0.565 = 0.196, then PairedCI() function can provide an exact confidence interval as shown below

(-0.00339 to 0.38065)

```{r}
#| eval: FALSE
#ExactCIdiff: PairedCI(s, r+u, t, conf.level = 0.95)
CI<-PairedCI(15, 25, 6, conf.level = 0.95)$ExactCI
CI
```

## Methods for Calculating Confidence Intervals for 2 independent samples proportion using {cardx}

Expand All @@ -215,23 +275,37 @@ For more technical information see the corresponding [SAS page](https://psiaims.

#### Example code

`cardx::ard_stats_prop_test function` uses `stats::prop.test` which also allows a continuity correction to be applied. More research is needed into this method.
`cardx::ard_stats_prop_test function` uses `stats::prop.test` which also allows a continuity correction to be applied.

Although this website [here](https://rdrr.io/r/stats/prop.test.html) and this one [here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prop.test) both reference Newcombe for the CI that this function uses, replication of the results by hand and compared to SAS show that the results below match the Normal Approximation (Wald method) not the Newcome method? Further research is needed into this topic.
Although this website [here](https://rdrr.io/r/stats/prop.test.html) and this one [here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prop.test) both reference Newcombe for the CI that this function uses, replication of the results by hand and compared to SAS show that the results below match the Normal Approximation (Wald method).

Both the Treatment variable (ACT,PBO) and the Response variable (Yes,No) have to be numeric (0,1) or Logit (TRUE,FALSE) variables.

```{R}
adcibc2<-select(adcibc,trtn,respn)
cardx::ard_stats_prop_test(data=adcibc2, by=trtn, variables=respn, conf.level = 0.95, correct=FALSE)
cardx::ard_stats_prop_test(data=adcibc2, by=trtn, variables=respn, conf.level = 0.95, correct=TRUE)
The prop.test default with 2 groups, is the null hypothesis that the proportions in each group are the same and a 2-sided CI.

```{r}

indat1<- adcibc2 %>%
select(AVAL,TRTP) %>%
mutate(resp=if_else(AVAL>4,"Yes","No")) %>%
mutate(respn=if_else(AVAL>4,1,0)) %>%
mutate(trt=if_else(TRTP=="Placebo","PBO","ACT"))%>%
mutate(trtn=if_else(TRTP=="Placebo",1,0)) %>%
select(trt,trtn,resp, respn)

# cardx package required a vector with 0 and 1s for a single proportion CI
# To get the comparison the correct way around Placebo must be 1, and Active 0

indat<- select(indat1, trtn,respn)
cardx::ard_stats_prop_test(data=indat, by=trtn, variables=respn, conf.level = 0.95, correct=FALSE)
cardx::ard_stats_prop_test(data=indat, by=trtn, variables=respn, conf.level = 0.95, correct=TRUE)
```

### Wilson Method (Also known as the Score method or the Altman, Newcombe method^3^ )

For more technical information see the corresponding [SAS page](https://psiaims.github.io/CAMIS/SAS/ci_for_prop.html).

Further research is needed into this topic.
We have not yet found a package in R which produces these confidence intervals for 2 independent proportions.

## References

Expand Down
16 changes: 11 additions & 5 deletions R/logistic_regr.qmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Logistic Regression"
title: "Logistic Regression in R"
---

```{r}
Expand All @@ -8,7 +8,7 @@ title: "Logistic Regression"
library(tidyverse)
```

In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we respresent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !
In binary logistic regression, there is a single binary dependent variable, coded by an indicator variable. For example, if we represent a response as 1 and non-response as 0, then the corresponding probability of response, can be between 0 (certainly not a response) and 1 (certainly a response) - hence the labeling !

The logistic model models the log-odds of an event as a linear combination of one or more independent variables (explanatory variables). If we observed $(y_i, x_i),$ where $y_i$ is a Bernoulli variable and $x_i$ a vector of explanatory variables, the model for $\pi_i = P(y_i=1)$ is

Expand All @@ -31,14 +31,18 @@ glimpse(lung)

# Model Fit

We analyze the weight loss in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals.
We analyze the event of weight gain (or staying the same weight) in lung cancer patients in dependency of age, sex, ECOG performance score and calories consumed at meals. In the original data, a positive number for the `wt.loss` variable is a weight loss, negative number is a gain. We start by dichotomising the response such that a result \>0 is a weight loss, \<= weight gain and creating a factor variable `wt_grp`.

One of the most important things to remember is to ensure you tell R what your event is ! We want to model Events / Non-events, and hence your reference category for `wt_grp` dichotomous variable below is the weight loss level. Therefore, by telling R that your reference category is weight loss, you are effectively telling R that your Event = Weight Gain !

```{r}
lung2 <- survival::lung %>%
mutate(
wt_grp = factor(wt.loss > 0, labels = c("weight loss", "weight gain"))
)

#specify that weight loss should be used as baseline level (i.e we want to model weight gain as the event)
lung2$wt_grp <- relevel(lung2$wt_grp, ref='weight loss')

m1 <- glm(wt_grp ~ age + sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m1)
Expand All @@ -53,15 +57,17 @@ exp(confint(m1))

# Model Comparison

To compare two logistic models, one tests the difference in residual variances from both models using a $\chi^2$-distribution with a single degree of freedom (here at the $5$% level):
To compare two logistic models, the `residual deviances` (-2 \* log likelihoods) are compared against a $\chi^2$-distribution with degrees of freedom calculated using the difference in the two models' parameters. Below, the only difference is the inclusion/exclusion of age in the model, hence we test using $\chi^2$ with 1 df. Here testing at the $5$% level.

```{r}
m2 <- glm(wt_grp ~ sex + ph.ecog + meal.cal, data = lung2, family = binomial(link="logit"))
summary(m2)

anova(m1, m2, test = "Chisq")
anova(m1, m2, test = "LRT")
```

Stackexchange [here](https://stats.stackexchange.com/questions/59879/logistic-regression-anova-chi-square-test-vs-significance-of-coefficients-ano) has a good article describing this method and the difference between comparing 2 models using the likelihood ratio tests versus using wald tests and Pr\>chisq (from the maximum likelihood estimate). Note: `anova(m1, m2, test = "Chisq")` and using `test="LRT"` as above are synonymous in this context.

# Prediction

Predictions from the model for the log-odds of a patient with new data to experience a weight loss are derived using `predict()`:
Expand Down
Loading
Loading