Enhanced Inference for Finite Population Sampling-Based Prevalence Estimation with Misclassification Errors
This page aims to illustrate calculations and simulation studies for the manuscript “Enhanced Inference for Finite Population Sampling-Based Prevalence Estimation with Misclassification Errors”.
All functions are available in this github repository and can be loaded as follows.
source("FUN_RS_misclassification.R")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: viridisLite
source("simulation_main.R")
Here is a simple example to generate data for individual level imperfect test data based on given Sensitivity () and Specificity () parameters. Firstly we need to specify the total population size () and the true disease prevalence (), e.g., we set and . WLOG, we can let the first 100 () people to be diseased and the remains are health, i.e.,
N = 1000
p_case = 0.1
N_true = c(rep(1,round(N*p_case)),rep(0,round(N*(1-p_case))))
Then we could generate the imperfect testing data with the sampling rate 20% and 90% 95% by the function below.
p.RS = 0.2;Se = 0.9;Sp = 0.95
simu.RS = simu_sym_RS(N,N_true,p.RS,Se,Sp)
test.RS = simu.RS$test
testpos.RS = simu.RS$testpos
n.RS = sum(test.RS)
n_pos.RS = max(sum(testpos.RS),0.001)
c(n.RS, n_pos.RS)
## [1] 200 24
Here we generate a imperfect testing data set with size 200 and among this data set, 24 people are observed positive test results.
Next we could calculate the estimate of test positive frequency ( in paper), the bias-corrected disease prevalence ( in paper),
pi_hat = n_pos.RS/n.RS
pi_hat_c = max((pi_hat+Sp-1)/(Se+Sp-1),0)
c(pi_hat, pi_hat_c)
## [1] 0.12000000 0.08235294
as well as three types of standard errors of introduced in the paper:
# V1(pi) in eqn.(1)
V1_pi = pi_hat*(1-pi_hat)/n.RS
# V2(pi) in eqn.(2)
fpc2 = n.RS*(N-n.RS)/(N*(n.RS-1))
V2_pi = V1_pi*fpc2
# V3(pi) in eqn.(12)
V_pi_extra = (pi_hat_c*Se*(1-Se) + (1-pi_hat_c)*Sp*(1-Sp))/N
V3_pi = V2_pi+V_pi_extra
sqrt(c(V1_pi, V2_pi, V3_pi))
## [1] 0.02297825 0.02060395 0.02180648
By inserting eqn.(1), (2), (12) into eqn. (6), we could obtain three types of standard errors of as well.
se.vect = 1/(Se+Sp-1)*sqrt(c(V1_pi, V2_pi, V3_pi))
se.vect
## [1] 0.02703324 0.02423994 0.02565468
Then we can output the wald-type 95% CIs as well as their width based on three standard erros as follows.
# SE w/o fpc
c(pi_hat_c-1.96*se.vect[1],pi_hat_c+1.96*se.vect[1], 2*1.96*se.vect[1])
## [1] 0.0293678 0.1353381 0.1059703
# SE w/ fpc
c(pi_hat_c-1.96*se.vect[2],pi_hat_c+1.96*se.vect[2], 2*1.96*se.vect[2])
## [1] 0.03484266 0.12986322 0.09502055
# SE new
c(pi_hat_c-1.96*se.vect[3],pi_hat_c+1.96*se.vect[3], 2*1.96*se.vect[3])
## [1] 0.03206976 0.13263612 0.10056636
In addition to the wald type confidence interval calculated based on the standard errors estimated above, we introduced a Bayesian credible interval approach to improve the coverage performance below.
BC_interval = RS_BC(N,n_pos.RS,n.RS,Se,Sp)
c(BC_interval$BC_lower, BC_interval$BC_upper, BC_interval$BC_width)
## [1] 0.03806355 0.13805190 0.09998835
To reproduce the simulation results in table 1-3, we can loop for all testing scenario to calculate all numbers we need.
N = 500
Se = 0.9
Sp = 0.95
tmp = NULL
for(p_case in c(seq(0.1,0.5,0.2))){
for(p_stream2 in seq(0.1,0.5,0.2)){
#print(c(p_case,p_stream2))
res = find_p_p2(N, p_case,p_stream2,Se,Sp)
tmp = rbind(tmp,t(c(p_case,p_stream2,res$res.true,res$res.RS[1:5],res$res.BC_width,
res$res.BC_pct)))
}
}
colnames(tmp) = c('p_case','p_stream2','N_true','RS.mean','RS.sd','RS.avgse','RS.width','RS.CIpct',
'RS.BC_width','RS.BC_CIpct')
round(tmp[,],3)
## p_case p_stream2 N_true RS.mean RS.sd RS.avgse RS.width RS.CIpct
## [1,] 0.1 0.1 0.1 0.098 0.053 0.054 0.213 91.44
## [2,] 0.1 0.3 0.1 0.100 0.030 0.030 0.117 94.56
## [3,] 0.1 0.5 0.1 0.100 0.022 0.022 0.085 94.42
## [4,] 0.3 0.1 0.3 0.300 0.073 0.074 0.289 95.14
## [5,] 0.3 0.3 0.3 0.300 0.040 0.039 0.154 95.22
## [6,] 0.3 0.5 0.3 0.300 0.027 0.027 0.108 94.76
## [7,] 0.5 0.1 0.5 0.502 0.080 0.080 0.314 94.38
## [8,] 0.5 0.3 0.5 0.500 0.042 0.042 0.166 94.82
## [9,] 0.5 0.5 0.5 0.500 0.029 0.030 0.116 95.48
## RS.BC_width RS.BC_CIpct
## [1,] 0.204 95.20
## [2,] 0.116 94.74
## [3,] 0.084 94.64
## [4,] 0.281 95.24
## [5,] 0.152 95.08
## [6,] 0.107 94.56
## [7,] 0.305 94.90
## [8,] 0.164 94.88
## [9,] 0.115 94.94
Here we prepare the values to plot the figure 1.
se_rst = NULL
N_tmp = seq(120,2000,10)
Se = 0.9
Sp = 0.95
# observed imperfect testing sample
n.RS=100
p_case = 0.2
for(N_i in N_tmp){
tmp = RS_SE_compare_main(N=N_i, p_case=p_case, p.RS=n.RS/N_i, Se=Se, Sp=Sp)
se_rst = rbind(se_rst,tmp)
}
plot_data = data.frame(N = N_tmp, MLE_SE_new = se_rst[,4], MLE_SE_fpc = se_rst[,3], MLE_SE = se_rst[,2],
MLE_SE_true = se_rst[,1])
plot_data_long = plot_data %>% gather(SE_type, value, -c(N))
Finally we could draw the figure through ggplot function.
My_Theme = theme(
axis.text.x = element_text(size = 16),
axis.title.x = element_text(size = 20),
axis.text.y = element_text(size = 16),
axis.title.y = element_text(size = 20),
legend.position = c(0.8, 0.2),
legend.title=element_text(size=16),legend.text=element_text(size=12))
p<-ggplot(plot_data_long, aes(x=N, y=value, color = SE_type)) +
geom_line(aes(linetype=SE_type),linewidth = 1)+
scale_color_manual(values = viridis::viridis(6)[1:4])+
labs(x = "Population Size N", y = expression('SE('*hat(pi)[c]*')'))
p+theme_light()+My_Theme