dgirt is an R package for dynamic group-level IRT models, as developed in Caughey and Warshaw 2014:
Over the past eight decades, millions of people have been surveyed on their political opinions. Until recently, however, polls rarely included enough questions in a given domain to apply scaling techniques such as IRT models at the individual level, preventing scholars from taking full advantage of historical survey data. To address this problem, we develop a Bayesian group-level IRT approach that models latent traits at the level of demographic and/or geographic groups rather than individuals. We use a hierarchical model to borrow strength cross-sectionally and dynamic linear models to do so across time. The group-level estimates can be weighted to generate estimates for geographic units. This framework opens up vast new areas of research on historical public opinion, especially at the subnational level.
devtools::install_github("jamesdunham/dgirt")
Get updates by reinstalling. dgirt is in early stages and under development. See NEWS, last updated 2016-01-28.
wrangle
prepares datadgirt
fits modelspoststratify
reweights estimates
state_opinion
is a dataset included with dgirt
in which rows correspond to survey responses from individuals.
library(dgirt)
data(state_opinion)
Data formatted in this way need to be restructured with the wrangle
function. We'll pass state_opinion
to wrangle
's data
argument, which takes a list of data to be used in modeling. Our table of survey responses will be an element named level1
, for the lowest hierarchical level in the model. (Note: at the moment, there are limitations on model specifications. Level-one data is required and a second hierarchical level is optional.)
We'll use the vars
argument to identify variables of importance in data
(e.g. which represent item responses). vars
is a list of named character vectors with (at least) these elements:
items
: Names of item response variables indata$level1
.groups
: Names of respondent characteristic variables indata$level1
. (Note: at this time,wrangle
requires that the user exclude the geographic indicator fromgroups
and name it instead ingeo_id
. Modeling any group predictor is coming.)time_id
: Name of time period variable indata$level1
.geo_id
: Name of geographic identifier variable indata$level1
.survey_id
: Name of survey identifier variable indata$level1
.survey_weight
: Name of weight variable indata$level1
.
The names of the item response variables start with "Q_", so we'll pass them using grep
.
state_opinion_fmt = wrangle(
data = list(level1 = state_opinion),
vars = list(items = grep("^Q_", colnames(state_opinion), value = TRUE),
groups = c("race"),
time_id = "year",
geo_id = "state",
survey_id = "source",
survey_weight = "weight"),
filters = list(periods = c(2006:2010)))
This output omits verbose messages. wrangle
returns a list of objects that dgirt
expects as its first argument. We'll also set its n_iter
and n_chain
arguments to minimize its run time, but otherwise rely on the defaults.
dgirt()
calls rstan()
, which reports any problems it encounters when compiling the model and sampling. Reporting is verbose and not all messages indicate problems. If sampling is successful, dgirt()
returns an object of class stanfit
. (See rstan documentation.)
A short trial run is often a good idea.
dgirt_estimates = dgirt(state_opinion_fmt, n_iter = 3, n_chain = 1)
#>
#> SAMPLING FOR MODEL '84f408a951e786210b324c1e54eefa8c' NOW (CHAIN 1).
#>
#> Chain 1, Iteration: 1 / 3 [ 33%] (Warmup)
#> Chain 1, Iteration: 2 / 3 [ 66%] (Warmup)
#> Chain 1, Iteration: 3 / 3 [100%] (Sampling)#
#> # Elapsed Time: 0.123141 seconds (Warm-up)
#> # 0.013794 seconds (Sampling)
#> # 0.136935 seconds (Total)
#> #
We omit verbose messages here. Now, a longer run:
# Not run
dgirt_estimates = dgirt(state_opinion_fmt, n_iter = 2000, n_chain = 4)
To examine the the dgirt()
results we can use extract_dgirt()
, which attaches labels to the saved parameters according to the variable names originally passed to wrangle()
and any factor levels. Right now, extract_dgirt()
shows only the posterior means.
Note that dgirt()
returns a stanfit
object when method = "rstan"
(as above, the default) and a list of point estimates if method = "optimize"
(details below); extract_dgirt()
only works with stanfit
objects. (The inconsistency in return types isn't desirable and will change in the future.)
The group means can be found as theta_bar
.
dgirt_extract = extract_dgirt(dgirt_estimates, state_opinion_fmt)
head(dgirt_extract$theta_bar)
#> Source: local data frame [6 x 6]
#>
#> Var1 Var2 value year state race
#> (int) (int) (dbl) (dbl) (fctr) (fctr)
#> 1 1 1 -1.7944927 2006 AK white
#> 2 2 1 1.0435053 2007 AK white
#> 3 3 1 0.4599502 2008 AK white
#> 4 4 1 -0.4227409 2009 AK white
#> 5 5 1 1.0708391 2010 AK white
#> 6 1 2 0.2298131 2006 AL white
We can use the method
argument of dgirt
to choose an alternative to MCMC sampling if cmdstan
is available. See http://mc-stan.org/interfaces/cmdstan.html for installation instructions. For example, setting method = "optimize"
will call cmdstan optimize
.
First, a trial run.
optimize_estimates = dgirt(state_opinion_fmt, n_iter = 20, method = "optimize",
init_range = 0.5)
#> Started: Fri Feb 12 09:51:42 2016
#> Reading results from disk.
#> Ended: Fri Feb 12 09:51:44 2016
head(optimize_estimates$theta_bar)
#> Source: local data frame [6 x 5]
#>
#> param value year state race
#> (fctr) (dbl) (dbl) (fctr) (fctr)
#> 1 theta_bar.1.1 0.821362 2006 AK white
#> 2 theta_bar.2.1 1.200950 2007 AK white
#> 3 theta_bar.3.1 0.238978 2008 AK white
#> 4 theta_bar.4.1 0.225778 2009 AK white
#> 5 theta_bar.5.1 0.539578 2010 AK white
#> 6 theta_bar.1.2 0.649381 2006 AL white
And now a longer run.
# Not run
optimize_estimates = dgirt(state_opinion_fmt, n_iter = 20000,
method = "optimize", init_range = 0.5)
poststratify()
can reweight estimates from dgirt()
(if method = "optimize"
) or extract_dgirt()
(if method = "rstan"
, the default). postratify()
returns weighted means for groups or arbitrary aggregations of groups.
The state_demographics
dataset contains population proportions for demographic strata by year. At the moment, it's necessary to relabel the group factor levels in the dgirt()
results to match those in the population proportion data.
data(state_demographics)
head(state_demographics)
#> Source: local data frame [6 x 7]
#>
#> state year race female education age proportion
#> (fctr) (int) (fctr) (fctr) (int) (int) (dbl)
#> 1 AK 1960 white or hispanic male 1 1 8.857296e-05
#> 2 AL 1960 white or hispanic male 1 1 6.986948e-04
#> 3 AR 1960 white or hispanic male 1 1 3.831912e-04
#> 4 AZ 1960 white or hispanic male 1 1 3.518153e-04
#> 5 CA 1960 white or hispanic male 1 1 3.463380e-03
#> 6 CO 1960 white or hispanic male 1 1 3.543790e-04
state_demographics$race = factor(state_demographics$race, labels = c("white", "black", "other"))
Now we pass these data, the same groups
argument as used originally with wrangle
, and a vector of variable names as strata
that define aggregations of interest in the data. For exposition we'll set two optional variables. We give the name of the variable in the demographic data for the population proportion as prop_var
. And passing a variable name to summands
will test the demographic data for whether population proportions sum to one within groups defined by the values of that variable.
dgirt_extract$theta_bar$year = as.integer(dgirt_extract$theta_bar$year)
group_means = poststratify(
group_means = dgirt_extract$theta_bar,
targets = state_demographics,
groups = c("race"),
strata = c("state", "year"),
prop_var = "proportion",
summands = "year")
#> Warning in poststratify(group_means = dgirt_extract$theta_bar, targets =
#> state_demographics, : More rows of proportions than combinations of its
#> strata and grouping variables. Summing proportions over other variables.
head(group_means)
#> Source: local data frame [6 x 3]
#>
#> state year value
#> (fctr) (int) (dbl)
#> 1 AK 2006 0.01213206
#> 2 AK 2007 -0.68538500
#> 3 AK 2008 -0.38938174
#> 4 AK 2009 0.98444936
#> 5 AK 2010 1.18867906
#> 6 AL 2006 1.15451333
The same approach works after dgirt()
if method = "optimize"
.
optimize_estimates$theta_bar$year = as.integer(optimize_estimates$theta_bar$year)
optimize_group_means = poststratify(
group_means = optimize_estimates$theta_bar,
targets = state_demographics,
groups = c("race"),
strata = c("state", "year"),
prop_var = "proportion",
summands = "year")
#> Warning in poststratify(group_means = optimize_estimates$theta_bar, targets
#> = state_demographics, : More rows of proportions than combinations of its
#> strata and grouping variables. Summing proportions over other variables.
head(optimize_group_means)
#> Source: local data frame [6 x 3]
#>
#> state year value
#> (fctr) (int) (dbl)
#> 1 AK 2006 1.3835744
#> 2 AK 2007 0.5601803
#> 3 AK 2008 0.7081640
#> 4 AK 2009 0.2544385
#> 5 AK 2010 1.1107925
#> 6 AL 2006 0.9187339
We can quickly plot group means with plot_means
. It can handle the theta_bar
element of the value of poststratify()
. (Figures omitted.)
plot_means(group_means, "year", "state", jitter = TRUE)
Or that of dgirt_extract()
where dgirt(method = "rstan"
),
plot_means(dgirt_extract$theta_bar, "year", "state", jitter = TRUE)
Or the value of dgirt()
in the case of method = "optimize"
.
plot_means(optimize_estimates$theta_bar, "year", "state", jitter = TRUE)