-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPropensity_Score.qmd
213 lines (146 loc) · 9.12 KB
/
Propensity_Score.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
title: "Propensity Score"
format: html
---
## Propensity Score
> 参考自一篇公众号:https://mp.weixin.qq.com/s/nfHSlr1TZrPzU8U8wu76aQ
> https://statsnotebook.io/blog/analysis/matching/
> https://sejdemyr.github.io/r-tutorials/statistics/
倾向性评分(Propensity Score, PS)是一种控制混杂因素的统计学方法,通过倾向性评分的方法,可以把基线控制在可比的水平,这样就可以比较处理因素带来的差异了。
比如,现在要比较A,B两种方法治疗肥胖的效果,随机分两组,分别使用A,B两种方法半年,看哪个方法效果好。一般收集到的病人性别、年龄、身高、体重、家族史、生活习惯、收入水平、生活地区等都是不一样的,这些混杂因素不控制,你很难说清楚到底是两种方法的效果还是混杂因素导致的。
利用倾向性评分就可以控制这些混杂,但是控制混杂因素的方法非常多,不要拘泥于此。对于类似上面这种情况,你还可以用协方差分析、多因素分析(统计学中的3大回归!线性回归、逻辑回归、Cox回归)、分层分析等,也可以起到控制混杂因素的效果,这些在之前的推文中全都有涉及,大家可以自行查看。
倾向性评分最大的*优势是将多个混杂因素的影响用一个综合的值来表示*,即倾向性评分值(Propensity Score, PS),从而降低协变量的维度,因此该方法尤其适用于协变量较多的情况。
倾向性评分的一般步骤为:
估计 PS 值;
利用 PS 值均衡协变量分布;
均衡性检验及模型评价;
处理效应估计。
其中,*PS 值的估计是以处理因素作为因变量,其他混杂因素作为自变量*,通过建立一个模型(可以是传统的回归模型,也可以是机器学习方法)来估计每个研究对象接受处理因素的可能性。
目前用于估计 PS 值的方法有 logistic 回归,Probit 回归、神经网络、支持向量机、分类与回归数、Boosting 算法、SuperLearner 等。其中 logistic 回归是目前最常用的方法。
倾向性评分是一个分数(P值),自己并没有均衡协变量(混杂因素)的能力,利用 PS 值均衡组间协变量分布的方法有匹配(matching)、分层(stratification)、协变量调整(covariate adjustment)和加权(weighting)等。其中协变量调整又可以称为倾向性评分回归、倾向性评分矫正等。
```{r, include=FALSE}
library(tidyverse)
library(MatchIt)
library(lmtest)
library(sandwich)
library(cobalt)
```
### matching 匹配
After matching, the treatment group and the control group should have very similar characteristics. A simple regression model can be used to estimate the treatment effect on the outcome. Cluster-robust standard errors are required for correct inference.
Propensity score matching analysis involves two steps.
1. Match each smoker to a non-smoker based on propensity score, which is calculated based on a range of covariates.
2. Check if balance between smokers (treatment/exposure group) and non-smokers (control group) is achieved (i.e., both groups having similar characteristics). If balance is achieved, a simple regression analysis with cluster-robust standard error can be used to estimate the effect of smoking on psychological distress.
```{r}
currentDataset <- read_csv("./datasets/smoking_psyc_distress.csv")
currentDataset$remoteness <- factor(currentDataset$remoteness, exclude = c("", NA))
currentDataset |> head()
# Sex (0: Female; 1: Male)
# indigenous - Ingigenous status (0: Non-indigenous; 1: indigenous)
# high_school - Education level (0: not finished high school; 1: finished high school)
# partnered - Marital status (0: not partnered; 1: partnered)
# remoteness - region of residence (0: major cities; 1: inner regional; 2: outer regional)
# language - Language background (0: non-English speaking; 1: English speaking)
# Smoker - Smoking status (0: non-smoker; 1: smoker)
# risky_alcohol - Risky alcohol use (0: not risky; 1: risky)
# psyc_distress - Psychological distress. Measure ranges from 10 to 50.
# age - Age of the participants
```
原始数据基线资料表,首先检查一下入组数据的均衡与否,示例数据中sex、age为混杂因素,目的在于探讨smoke与否是否影响psyc_distress.
从下表中可以看到,
```{r}
library(tableone)
table2 <- CreateTableOne(vars = c('age', 'sex', 'psyc_distress'),
data = currentDataset,
factorVars = c('sex'),
strata = 'smoker',
smd=TRUE)
table2 <- print(table2,smd=TRUE,
showAllLevels = TRUE,
noSpaces = TRUE,
printToggle = FALSE)
table2
```
模型中因变量为研究中主要关注的处理因素,
```{r}
# 这里为了方便演示直接删掉了缺失值
data.complete <- na.omit(currentDataset)
# 因变量是处理因素,自变量是需要平衡的协变量
# distance参数为用logit方法计算PS值,based on纳入的变量
# method参数,match方法用的nearest
# The ratio parameter specifies that one to one matching is used (ratio = 1). A higher ratio will increase the number of samples from the control group that will be included. Therefore, less observations will be discarded. However a higher matching ratio by definition will produce worse matches.
match_obj <- matchit(smoker~age+sex+remoteness,
data = data.complete,
method = "nearest",
replace = FALSE,
distance = "logit" # 选择logistic回归 or glm
)
match_obj
```
```{r}
summary(match_obj)
summary(match_obj, standardize = TRUE)
```
*interpret:*`Std. Mean Diff.`大于0.1认为为substantial difference,`All Data`为smoker与否两组的所有观察,`Matched Data`为match之后的数据,`Std. Mean Diff.`的值在match之后都接近于0,表明a good balance is achieved.
sample size部分为,原始数据中样本的分布,以及match的样本数量,其中没有match的6052 non-smoker在后续的分析中可以discarded。Applied researchers sometimes are concerned about discarding a large number of unmatched participants. However, this is usually not an issue if all treated participants are matched, and good balance is archived.
关于这个倾向性评分匹配后数据的平衡性检验,文献中比较推荐使用SMD和VR(variance ratio),SMD<0.25说明均衡了,
VR>2.0或者VR<0.5说明很不均衡(越接近1越均衡)!
```{r}
#plotting the balance between smokers and non-smokers
plot(match_obj, type = "jitter", interactive = FALSE)
plot(summary(match_obj), abs = FALSE)
```
*interpret: *第一图为,PS值在四个分组中的分布,Propensity score matching will not be appropriate if there is not a satisfactory overlap in the propensity score distribution between the matched treated group and the matched untreated group.
第二幅图,match前后,`Std. Mean Diff.`的分布。可以看到match之后的黑点在0附近。
可通过以下方法获得算法估计的PS值:
```{r}
eps <- match_obj$distance
length(eps)
```
参考文档中计算了实际PS值(tps),所以我们可以画一个tps和估计ps的散点图,以tps为横坐标,以eps为纵坐标:
```{r, eval=FALSE}
# 摘抄自文档,此处仅作记录
# 去掉缺失值
tps.comp <- tps[complete.cases(data)]
Smoke.comp <- as.factor(Smoke[complete.cases(data)])
df <- data.frame(True = tps.comp, Estimated = eps, Smoke = Smoke.comp)
ggplot(df, aes(x = True, y = Estimated, colour = Smoke)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, colour = "#990000", linetype = "dashed") +
expand_limits(x = c(0, 1), y = c(0, 1))
# 对x.Age平方
m.out <- matchit(Smoke ~ I(x.Age^2) + x.Age + x.Gender,
data = data.complete
)
eps <- m.out$distance
tps.comp <- tps[complete.cases(data)]
Smoke.comp <- as.factor(Smoke[complete.cases(data)])
df <- data.frame(True = tps.comp, Estimated = eps, Smoke = Smoke.comp)
ggplot(df, aes(x = True, y = Estimated, colour = Smoke)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, colour = "#990000", linetype = "dashed") +
expand_limits(x = c(0, 1), y = c(0, 1))
```
#### 匹配后数据的平衡性检验
除了上述依据STM等值进行平衡性好坏与否的判断外,还可以利用`cobalt`包进行检验。
```{r}
cobalt::bal.tab(match_obj, m.threshold = 0.1, un = TRUE)
```
#### 匹配后的数据的后续分析
首先提取match后的数据,
```{r}
matched_data <- match.data(match_obj)
matched_data
```
However, for correct inference, *cluster-robust standard error* is needed.
```{r}
#Run regression model with psychological distress as the outcome, and smoker as the only predictor
#We need to specify the weights - Matched participants have a weight of 1, unmatched participants
res <- lm(psyc_distress ~ smoker, data = matched_data, weights = weights)
#Test the coefficient using cluster robust standard error
coeftest(res, vcov. = vcovCL, cluster = ~subclass)
#Calculate the confidence intervals based on cluster robust standard error
coefci(res, vcov. = vcovCL, cluster = ~subclass, level = 0.95)
```
### stratification
```{r}
```