diff --git a/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/README.md b/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/README.md deleted file mode 100644 index 4e02ecde..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/README.md +++ /dev/null @@ -1,7 +0,0 @@ -# Convenience .html File Viewer - -Github, unfortunately, does not automatically render .html files. The RMarkdown files are saved as .html files, meaning that readers must either download these files and view them on their own computer or access an alternative method for viewing them in their browswer. Toward this end, hyperlinks to the files in their correct format have been created below so that readers may view them easily. Click the links below to view the rendered .html versions of the RMarkdown supplemental materials. - -[Multiple Imputation and Final Sample Inclusion Criteria](https://htmlpreview.github.io/?https://github.com/w-goette/eIRT-CERAD/blob/main/ItemCovStudy/Markdown%20Documents/Supplemental%20Inclusion%20Criteria%20File.html) - -[Detailed Model Fitting Results](https://htmlpreview.github.io/?https://github.com/w-goette/eIRT-CERAD/blob/main/ItemCovStudy/Markdown%20Documents/Supplemental%20Models%20File.html) diff --git a/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/Supplemental Inclusion Criteria File.html b/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/Supplemental Inclusion Criteria File.html deleted file mode 100644 index d388c49c..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/Markdown Documents/Supplemental Inclusion Criteria File.html +++ /dev/null @@ -1,11192 +0,0 @@ - - - - - - - - - - Imputation and Inclusion Criteria File - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- - - -
-
-
- - - -
-
- - - -

Imputation and Inclusion Criteria File

- - - - - - - -
-

Purpose of the Document

-

This document serves to provide further detail regarding the multiple imputation and multivariate, multivariable regression methods used to define the sample’s inclusion criteria. The HRS and HCAP data are rich with many important clinic data points; however, they do not include clinical diagnoses of cognitive status. As the aim of the CERAD eIRT study was to model memory, it was important to consider only those cases whose cognitive status was normal. Differences in response styles between normal cognitive controls and various dementia etiologies are well-known, so accidental inclusion of individuals with a neurocognitive disorder could affect the parameters of the models. In other words, it is reasonable to expect, based on existing literature, that the model predicting and describing item-level responses to the CERAD list learning test are different for those with and without neurocognitive disorders.

-

Based on this a priori consideration, the inclusion criteria were an important starting point. Readers who have already reviewed the AsPredicted pre-registration of the study’s hypotheses may have realized that the inclusion criteria outlined there and those discussed in the manuscript are different. Indeed, all of the models were fit on two different datasets because the initial criteria (based on a uniform cutoff of the MMSE) resulted in over-exclusion of Black participants (\(\chi^2(2) = 29.57, p < .001, V = 0.13\), with Black-identifying participants having a Pearson residual of -2.76 for being labeled non-impaired compared to 3.15 for being labeled impaired). The fact that racial and ethnic minorities tend to score, on average, lower than non-Hispanic White individuals is well documented and should have been a consideration in the initial inclusion criteria. Regardless, it was determined that an alternative, more methodologically robust inclusion criteria ought to be established to satisfy the need to exclude individuals with possible neurocognitive impairments without under-representing individuals of minority backgrounds by over-emphasizing, raw, un-adjusted cognitive test scores.

-

In order to pursue a more equitable inclusion criterion, multiple regression was considered as a basic starting place. Within neuropsychology, regression-based norms are relatively commonplace with these equations often including race, age, education, and sex as predictors of cognitive test performance. The benefit of the HRS and HCAP database is that many more socioeconomic and sociocontextual variables that could be used to inform the regression model. A further modeling hurdle was the fact that only raw scores were available for analysis. Raw scores on testing can often violate assumptions of OLS regression as they are clearly bounded, count variables with (often inherent) skew. Additionally, it is well-known that neuropsychological/cognitive tests tend to be correlated with one another even if they measure purportedly different cognitive domains/skills. A solution to these problems was to use multivariate, multiple regression that allowed for correlation among the residuals of each cognitive test. In order to allow for this correlation among residuals within the brms syntax, the likelihood of the regression had to be either multivariate normal or Student t. Since the raw data are unlikely to be multivariate normal and may be subject to outliers, a more robust multivariate Student t distribution was fit. The advantage of this specification is that the degrees of freedom for the Student t distribution can be specified as an unknown parameter with a weakly informative prior and thus estimated directly from the data themselves. The ability to directly model the shape of the Student t distribution does help some in capturing potentially skewed distributions where the central tendency of the distribution is heavily left or right shifted but then fades rapidly into narrower tails.

-

While there is very little missing data in the HRS and HCAP data sources, it was important to be able to estimate cognitive testing scores for every participant so as to not introduce potential selection bias by omitting individuals with less overall complete data. Since a goal was to include a large number of sociocontextual demographic information, this meant using multiple imputation in order to get reasonable guesses for the missing data and then propagating this uncertainty to the Bayesian-estimated posteriors. Once this information was obtained, there was then a need to use the regression to identify those with questionable cognitive status. This was done by comparing the observed scores to their predictions, standardizing this difference, and then dichotomizing this difference into either “normal” (no more than a standard deviation below the expected score) or “abnormal” (a standard deviation or more below the expected score). The result of this procedure is a series of dichotomous 0/1 variables for each cognitive test. In order to maintain an empirically-based understanding of the sample, these dichotomous variables were then subjected to latent class analysis to identify homogeneous groups of individuals based on their relative cognitive performances across all tests. Inclusion was then based on membership to the latent class that appears to correspond to a cognitively normal group of individuals.

-

This document provides the details of each of these steps since there was not space in the manuscript to describe this process in adequate detail. Accompanying this document is also the R script file that was created during the study process. Someone who acquires access to the appropriate files through the HRS should be able to run the script file (with some minimal alterations to ensure that the files are read from the correct files) and then get results that match those shown here (within some margin of error due to randomness in some of the methods, though the use of the same seed in R and brms should reduce the impact of this).

-

Note that by default all the code used to generate the results in this document is not shown. As some readers may be interested in this code, it is possible to turn on its display by toggling the switch in the top right of this page. Every output from the document also has a toggle button (a small button with the word “code”) that, when pressed, will show the code used to produce the output. Additionally, the raw Markdown file (*.rmd) is available on the github with the R script.

-
-

Overview of Sections

-

While the preceding section describes the general motivations for each step of the inclusion criterion analyses, it is still expedient to spell out what each section of this document will do so that readers can jump to which ever section is of interest and anticipate what to find there.

-

The first section overviews the multiple imputation process and results. Shown in this section are general descriptions of the data’s missingness as well as diagnostic analyses of the multiple imputation results. These diagnostic plots include visual inspection of convergence, overlay of imputed and observed values, and propensity score plots for missing versus observed cases as a function of each variable.

-

The second section then provides details on the multivariate, multiple regression. This includes discussion of the included predictors, their influence in prediction of each outcome, posterior predictive checks of the model, and general model performance (i.e., \(R^2\)).

-

Finally, the last section provides the results of the latent class analysis. This includes the number of classes identified, the response patterns (i.e., patterns of 0/1 across cognitive tests) that correspond to each latent class, and the reasoning for choosing the latent class that we did as the “cognitively normal” group.

-
-
-

What’s Being Done

-

Since the corresponding R script for running all the models and analyses of this study are included, it can be helpful to see what objects from that script are being used here in case someone wants to replicate the results. The code below shows the objects being called for this document, and they are named in accordance to the fit and saved objects in the R script. Readers may need to toggle the “code” button below to see this output. Alternatively, all the raw code for this Markdown document is also uploaded on Github, so every code block can be seen from that RMD file.

-
#read in needed data
-df_resp <- readRDS("Data_wide.rds")
-df_imp <- readRDS("Data_imputed.rds")
-CogRegression <- readRDS("Fitted Models/cognitiveDiagnosisRegression.rds")
-LCAres <- readRDS("LatentClassModel.rds")
-
-#load in required packages
-#wrapping in suppressPackageStartupMessages() done just to reduce print out in document
-suppressPackageStartupMessages(library(brms))
-suppressPackageStartupMessages(library(mice))
-suppressPackageStartupMessages(library(tidyverse))
-suppressPackageStartupMessages(library(ggplot2))
-suppressPackageStartupMessages(library(kableExtra))
-
-
-
-

Multiple Imputation Results

-

Multiple imputation is a robust method for estimating unobserved/missing data points by capitalizing on the relationship between missing values and the observed data. In the simplest case, imputation involves substituting some general value (e.g., the mean or median) for each missing value. This case is a highly conservative method for accounting for missing data as it assumes that the best we can do for guessing the possible value of the data is to assume that it is likely close to some central tendency of the data. Regression methods are a logical next step for making more informed guesses for missing data by drawing on the fact that variables are correlated and thus information about one can be approximated by looking at other observed data points. These simple imputation methods still have significant methodological challenges. The largest challenge to overcome is that simply estimating a value for a missing variable and treating this estimation as a “close enough” observation will introduce bias. While powerful, regression-based methods have multiple sources of error: there is uncertainty in the prediction itself (i.e., prediction error or residuals) and there is uncertainty about the true values of the regression predictors (i.e., estimation error or sampling error). One method for addressing this bias is to iteratively impute missing values so that there are a range of different estimates given for a single missing value. This range of estimated values can then be passed to statistical analyses in order to introduce empirically the uncertainty of the true unobserved value. This study utilized the mice package for R in order to perform multiple imputation of missing data points. The mice package performs multiple imputation by chained equations, and details of the underlying methodology can be found in [van Buuren and Groothuis-Oudshoorn (2011)] (https://www.jstatsoft.org/article/view/v045i03) and [van Buuren (2018)] (https://stefvanbuuren.name/fimd/). The remainder of this section describes the missingness of the data of interest and the imputation methods and results.

-

The first issue with missing data is that only some missing data matters. In the case of the regressions to predict missing cognitive raw scores, we were interested only in those variables that might be usefully predictive of most cognitive tests. Historically in neuropsychology, these variables have been demographic variables like age, sex, race, and years of education. Given more recent research on racial and ethnic differences in raw scores across neuropsychological tests, it also made sense to extend these variables to broader socioeconomic and contextual factors like rurality, annual income, and parental educational status. While the HRS data resources include thousands of possible variables to choose from, the general recommendation in multiple imputation is to choose between 15 and 25 variables, including all variables that will appear in the complete-data model (van Buuren, 2018). We make no effort to say that the variables chosen were the best variables for including in the model; however, we believe them to be superior to standard demographic regression-based models and they achieved the intended goal of producing a classification standard that did not systematically over-exclude racial minorities. Further research on what variables are most useful is certainly needed, and hopefully, this serves as a starting place for researchers.

-

The data variables selected for the regression model and thus the multiple imputation were all of the cognitive tests, standard demographic variables, and then sociocontextual variables. The list below summarizes these groupings of variables:

-

Cognitive Tests

-
    -
  • MMSE
  • -
  • Animals
  • -
  • CERAD List Learning (immediate recall, delayed recall, and discriminability)
  • -
  • WMS Logical Memory Story B (immediate recall, delayed recall, and recognition)
  • -
  • CERAD Constructional Praxis (immediate and delayed recall)
  • -
  • Symbol Digit Modalities Test (total correct)
  • -
  • Trail Making Test Parts A and B (seconds to complete)
  • -
  • Raven’s Progressive Matrices (Blocks A and B)
  • -
-

Demographic Variables

-
    -
  • Age
  • -
  • Sex
  • -
  • Ethnicity
  • -
  • Race
  • -
  • Years of education
  • -
  • Impaired (by informant report of prior cognitive diagnosis or on the Blessed Dementia Rating Scale)
  • -
-

Sociocontextual Variables

-
    -
  • Rurality (urban, suburban, or exurban)
  • -
  • Total value of assets
  • -
  • Financial wealth
  • -
  • Total annual income
  • -
  • Ratio of financial means to household’s federal poverty line
  • -
  • Maternal years of education
  • -
  • Paternal years of education
  • -
-

The resulting list of variables include 27 variables. With a reduced dataset ready for the multiple imputation, the next consideration is the number of imputations needed to reasonably approximate the missing data. While greater number of imputations are always better, there is a sharp computational demand trade-off. The imputation process itself can be fairly intensive, meaning that running more iterations or imputations can require a sizable amount of computing time and power. Furthermore, each imputed dataset has to be run in the model. Since the model is a Bayesian model, this means that each dataset must be run over four chains for a set number of warmup and post-warmup samples. Those familiar with Bayesian methods will recognize the computational demand potentially inherent in running the same model multiple times for different imputed datasets. For general illustration purposes, some of the IRT models ran for this study required 15 hours to fit, so if this had to be done on imputed data, then this is 15 hours times the number of imputation to fit a single model. To determine the number of reasonable imputations needed, it is useful to evaluate the overall level of missingness. Shown below is a figure to visualize the overall level and pattern of missingness.

-
empty <- md.pattern(df_resp[, c("MMSE", "Animals", "CERADdel", "LMIM", "CPIM", "SDMT", "CPDM", "LMDM", "LMrcg", "TMTA", "TMTB", "CERADimm", "CERADdisc",
-                        "Ravens", "UrbanRural", "Age", "TotalAssets", "FinancialWealth", "TotalIncome", "PovertyRatioInst", "Gender", "Ethnicity",
-                        "Race", "Education", "MaternalEdu", "PaternalEdu", "Impaired")])
-

-

Due to the size of the dataset, the missingness plot can be challenging to read. Each pattern of missing data is listed on a unique row and each column corresponds to one of the variables of interest. The cell produced from an intersecting row and column is colored blue if that participant has an observation for that variable, and it is colored red, if that observation is missing. The numbers on the left count the number of participants with that row’s pattern of missing data while the numbers on the left count the number of missing variables in that pattern. Numbers at the bottom of the table correspond to the number of cases missing the respective variable. As can be seen from a visual gestalt of the figure is that most participants have the majority of data. Still, the figure does not tell us much about the level of missingness since it is very cluttered. One potentially useful value to know for each value is the number of patterns of missing data occur as a function of whether a given variable is missing. The following table shows the number of patterns of missing data corresponding to each variable:

-
matrix(apply(empty[, 1:27], 2, function(x) {sum(x == 0)}),
-       dimnames = list(c("MMSE", "Rurality", "Age", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Sex", "Impaired", "Animals", "Race", "Ethnicity", "Education", "CERAD Immediate", "CERAD Discriminability", "CERAD Delayed", "Constructional Praxis Immediate", "Logical Memory Immediate", "Ravens", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Symbol Digit Modality", "Trails B", "Maternal Education", "Paternal Education"), "Number of Patterns")) %>%
-  kable(caption = "Patterns of Missing Data by Variable", align = rep('c', 27)) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Patterns of Missing Data by Variable -
- -Number of Patterns -
-MMSE - -1 -
-Rurality - -1 -
-Age - -1 -
-Total Assets - -1 -
-Financial Wealth - -1 -
-Total Income - -1 -
-Ratio to Poverty Line - -1 -
-Sex - -1 -
-Impaired - -1 -
-Animals - -1 -
-Race - -1 -
-Ethnicity - -2 -
-Education - -3 -
-CERAD Immediate - -11 -
-CERAD Discriminability - -11 -
-CERAD Delayed - -18 -
-Constructional Praxis Immediate - -19 -
-Logical Memory Immediate - -26 -
-Ravens - -23 -
-Constructional Praxis Delayed - -27 -
-Logical Memory Delayed - -40 -
-Logical Memory Recognition - -48 -
-Trails A - -39 -
-Symbol Digit Modality - -51 -
-Trails B - -57 -
-Maternal Education - -25 -
-Paternal Education - -34 -
-

Just as a quick point of clarification, the md.pattern() function from mice does alter the order of the variables, so this is why there is no clear logical order of the variables in the table above. Rather than re-arrange these rows, however, the order was retained as it corresponds (from top to bottom) to the columns in the preceding figure (from left to right). As can be seen from this simple table, missing Trails B and Symbol Digits Modality were associated with the most patterns of missingness. The next variable that, when missing, was associated with a fair number of missingness patterns was the delayed trials of Logical Memory. These patterns generally suggest that those with greater cognitive impairment were more likely to have missing data elsewhere since these tests are among the most challenging included in the battery. Another metric to consider is which pattern(s) correspond to the most amount of missing data. The table below shows the missingness patterns that correspond to missing the most number of variables.

-
matrix(empty[which(empty[, 28] == max(empty[1:104, 28])), 1:27],
-       nrow = 2, ncol = 27, byrow = FALSE,
-       dimnames = list(c("Pattern 1", "Pattern 2"), c("MMSE", "Rurality", "Age", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Sex", "Impaired", "Animals", "Race", "Ethnicity", "Education", "CERAD Immediate", "CERAD Discriminability", "CERAD Delayed", "Constructional Praxis Immediate", "Logical Memory Immediate", "Ravens", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Symbol Digit Modality", "Trails B", "Maternal Education", "Paternal Education"))) %>%
-  kable(caption = "Patterns with the Most Missing Data", align = rep('c', 27)) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Patterns with the Most Missing Data -
- -MMSE - -Rurality - -Age - -Total Assets - -Financial Wealth - -Total Income - -Ratio to Poverty Line - -Sex - -Impaired - -Animals - -Race - -Ethnicity - -Education - -CERAD Immediate - -CERAD Discriminability - -CERAD Delayed - -Constructional Praxis Immediate - -Logical Memory Immediate - -Ravens - -Constructional Praxis Delayed - -Logical Memory Delayed - -Logical Memory Recognition - -Trails A - -Symbol Digit Modality - -Trails B - -Maternal Education - -Paternal Education -
-Pattern 1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -1 - -1 -
-Pattern 2 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -1 - -0 - -1 - -1 - -1 - -1 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -0 - -1 - -1 -
-

Consistent with the previous observation, it appears that a likely explanation for missingness is the level of cognitive impairment. As demonstrated in the table above, there are two different patterns that have the most missing variables (missing 12 variables out of the 27 examined), and these patterns clearly occur among the cognitive tests. While not certain, a reasonable guess for why this would be the case is either a participant declines the cognitive tests or would be considered too impaired to complete or attempt the tests. This indicates that it’s important to include the informant-based report of impairment in the imputation step since this seems likely to be related to the missingness. Up until this point, however, we have not actually looked at the overall rate of missingness in the dataset, just patterns of it. The following table summarizes the percentages of missing data points for each variable of interest:

-
Missingness <- function(x) {
-  sum(is.na(x)) / length(x) * 100
-} #function computes percent of missing data
-
-matrix(sapply(df_resp[, c("MMSE", "Animals", "CERADimm", "CERADdel", "CERADdisc", "LMIM",  "LMDM", "LMrcg", "CPIM", "CPDM", "SDMT", "TMTA", "TMTB", "Ravens", "Gender", "Ethnicity", "Race", "Education", "Age", "UrbanRural", "TotalAssets", "FinancialWealth", "TotalIncome", "PovertyRatioInst", "MaternalEdu", "PaternalEdu", "Impaired")], function(x) Missingness(x)), nrow = 27, ncol = 1,
-       dimnames = list(c("MMSE", "Animals", "CERAD Immediate", "CERAD Delayed", "CERAD Discriminability", "Logical Memory Immediate", "Logical Memory Delayed", "Logical Memory Recognition", "Constructional Praxis Immediate", "Constructional Praxis Delayed", "Symbol Digits Modality", "Trails A", "Trail B", "Ravens", "Sex", "Ethnicity", "Race", "Education", "Age", "Rurality", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Maternal Education", "Paternal Education", "Impaired"), "Percent Missing")) %>%
-  kable(caption = "Percentage of Missing Data by Variable", digits = 2, align = rep('c', 27)) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Percentage of Missing Data by Variable -
- -Percent Missing -
-MMSE - -0.00 -
-Animals - -0.03 -
-CERAD Immediate - -0.41 -
-CERAD Delayed - -0.60 -
-CERAD Discriminability - -0.41 -
-Logical Memory Immediate - -1.17 -
-Logical Memory Delayed - -2.72 -
-Logical Memory Recognition - -3.00 -
-Constructional Praxis Immediate - -1.14 -
-Constructional Praxis Delayed - -1.33 -
-Symbol Digits Modality - -4.96 -
-Trails A - -3.03 -
-Trail B - -6.19 -
-Ravens - -1.20 -
-Sex - -0.00 -
-Ethnicity - -0.06 -
-Race - -0.03 -
-Education - -0.09 -
-Age - -0.00 -
-Rurality - -0.00 -
-Total Assets - -0.00 -
-Financial Wealth - -0.00 -
-Total Income - -0.00 -
-Ratio to Poverty Line - -0.00 -
-Maternal Education - -7.51 -
-Paternal Education - -13.10 -
-Impaired - -0.00 -
-

As is clear from the table above, missingness in the database itself is quite scarce considering the total number of individuals in the database. The overall average percentage of missing data is less than 2% for the whole sample, and of the 3167 participants in the sample, a total of 2431 (76.76%) of the sample have no missing data in any of these variables. To further clarify the rate of completeness in this particular dataset, 3075 participants (97.10% of the total sample) are missing data for just 3 or fewer variables. van Buuren (2018) notes that between 5 and 20 imputations are sufficient for “moderate missingness.” While moderate levels of missingness is not defined, it is likely fair to conclude that the current dataset has no worse that moderate missingness and may even have “mild” levels of missingness. While not a hard rule, van Buuren (2018) does suggest the following guidance: “if calculation is not prohibitive, we may set m [number of imputations] to the average percentage of missing data.” In this case, that average is less than 2%, so the total number of imputations was set to 5 to be on the low end of the recommended imputations for moderate missingness.

-

Multiple imputation generating 5 different datasets with unique imputed values for every missing value was then conducted. There are many different ways of imputing reasonable values, and the mice package offers a wide array of these methods. In this case, the default methods provided by the mice package made sense for the missing data types. To make this step more transparent to readers, the table below shows the imputation method used for each variable that required imputation.

-
methods <- df_imp$method
-methods <- methods[which(methods != "")]
-methods <- methods[c(1, 11, 2, 12, 3, 7:8, 4, 6, 5, 9:10, 13:18)]
-methods <- ifelse(methods == "pmm", "Predictive Mean Matching", ifelse(methods == "logreg", "Logistic Regression", "Polytomous Logistic Regression"))
-matrix(methods, nrow = 18, ncol = 1, dimnames = list(c("Animals", "CERAD Immediate", "CERAD Delayed", "CERAD Discriminability", "Logical Memory Immediate", "Logical Memory Delayed", "Logical Memory Recognition", "Constructional Praxis Immediate", "Constructional Praxis Delayed", "Symbol Digits Modality", "Trails A", "Trails B", "Ravens", "Ethnicity", "Race", "Education", "Maternal Education", "Paternal Education"), "Method")) %>%
-  kable(caption = "Imputation Method by Missing Variable", align = 'c') %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Imputation Method by Missing Variable -
- -Method -
-Animals - -Predictive Mean Matching -
-CERAD Immediate - -Predictive Mean Matching -
-CERAD Delayed - -Predictive Mean Matching -
-CERAD Discriminability - -Predictive Mean Matching -
-Logical Memory Immediate - -Predictive Mean Matching -
-Logical Memory Delayed - -Predictive Mean Matching -
-Logical Memory Recognition - -Predictive Mean Matching -
-Constructional Praxis Immediate - -Predictive Mean Matching -
-Constructional Praxis Delayed - -Predictive Mean Matching -
-Symbol Digits Modality - -Predictive Mean Matching -
-Trails A - -Predictive Mean Matching -
-Trails B - -Predictive Mean Matching -
-Ravens - -Predictive Mean Matching -
-Ethnicity - -Logistic Regression -
-Race - -Polytomous Logistic Regression -
-Education - -Predictive Mean Matching -
-Maternal Education - -Predictive Mean Matching -
-Paternal Education - -Predictive Mean Matching -
-

The methods are most easily separated into those for continuous versus discrete variables. For all of the continuous variables (all variables except race and ethnicity), the method used is predictive mean matching. This method is a particularly flexible and useful imputation method that is well-suited for this particular application. A common issue with standard regression methods for count variables (e.g., total number of correct responses on a test) is that it is easy to estimate impossible values. For example, a regression might predict a score that is less than 0 or that exceeds the maximum number of points possible. Additionally, a regression-based predicted score is almost always some fractional value that is not possible in count data. Predictive mean matching essentially truncates estimates to observed data by taking the estimate and then create a subset of candidate values from the complete cases with predicted values closest that predicted value of the missing observation. From this subset of candidate values, one is randomly selected to replace the missing value. In this way, the imputed values are only ever values that are actually observable.

-

The categorical variables with missing data points were just race and ethnicity. In the case of ethnicity, a simple logistic regression is useful for predicting the missing value since the variable is dichotomous (Hispanic or non-Hispanic). In the case of race, there are three levels of the factor (White, Black, and Other), so a multinomial/polytomous expansion of the logistic regression is needed. Importantly, these methods assume no ordering of the categories and thus are appropriate for purely nominal data types.

-

With these methods and the number of imputations in mind, it is important to visually inspect the resulting imputations. The first step is to consider whether the imputation chains mixed. The trace plots of the chains are shown below.

-
plot(df_imp)
-

-

There are some points of clarification to make for understanding the plots. First, the color of each line corresponds to one of the imputations, so each plot has 5 different colored lines. Second, the standard deviations of the estimated values are also estimated. In cases where there was only 1 case with a missing value (i.e., Animals and race), it is not possible to estimate the standard deviation of the imputed values. Third, interpretation of the categorical variables needs to be done cautiously since they are different than the continuous variables. The plots can look very jagged and seem like they mix poorly until we recall that these can only be values of 0 or 1 (for ethnicity - 0, 1, or 2 for race). Thus, it may seem like there is poor mixing of chains, but this is potentially just an artifact of the limited ranges of estimates. Interpretation of the trace plots for multiple imputation are similar to that of MCMC chains; however, there are far fewer iterations to examine. Overall, these plots do not suggest any issues with estimates getting caught at certain values and generally converge despite the relatively small number of iterations.

-

Once we are comfortable with the mixing of the imputation chains, it is important to also examine what those imputed distributions look like in comparison to the observed data. The plots below overlay the observed and missing distributions of each variable that had missing data.

-
densityplot(df_imp)
-

-

Consistent with our earlier hypotheses, the missing data distributions all are generally left shifted with a few notable exceptions. In the case of the Trail Making Tests, there is bimodal distribution of the missing data (red) with one mode slightly above the mean time to complete and then another mode at the discontinuation time (300 seconds or 5 minutes). Unlike all the other cognitive tests, worse scores on the Trail Making Test are these right shifted distributions. With that exception in mind, the imputed data distributions for all of the cognitive tests suggest that those with missing values were also the most impaired individuals in the sample. The other important exception to the left-shifted rules are for the demographic variables that were estimated. In these plots, there is little difference between the observed and missing data distributions, suggesting that educational achievement of the participant or their parents were not related to missingness.

-

Further examination of the missingness patterns can be examined by visualizing the probability that a participant has missing data as a function of each variable. This is done by computing the propensity score for having missing data (i.e., the probability that a case is missing data) and then plotting that propensity score against each of the imputed variables. For clarity, propensity scores are the estimated probability of group membership as predicted from a logistic regression. In this case, the logistic regression is based on all the same predictors as the multiple imputation. The plots below provide this visual description of missingness propensity.

-
#run model for propensity scores of missing vs. complete observations
-prop <- with(df_imp, glm(ici(df_imp) ~ MMSE + Animals + CERADimm + CERADdel + CERADdisc + LMIM + LMDM + LMrcg + CPIM + CPDM + SDMT + TMTA + TMTB + Ravens + Gender + Ethnicity + Race + Education + Age + UrbanRural + TotalAssets + FinancialWealth + TotalIncome + PovertyRatioInst + MaternalEdu + PaternalEdu + Impaired,
-                         family = binomial))
-
-#generate the propensity scores
-ps <- rep(rowMeans(sapply(prop$analyses, fitted.values)),
-          df_imp$m + 1)
-
-#generate plots
-xyplot(df_imp, Animals ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Animal Fluency Raw Score")
-

-
xyplot(df_imp, CERADimm ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "CERAD Words Recalled (Immediate)")
-

-
xyplot(df_imp, CERADdel ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "CERAD Words Recalled (Delayed)")
-

-
xyplot(df_imp, CERADdisc ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "CERAD Recognition Discriminability")
-

-
xyplot(df_imp, LMIM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Logical Memory Recall (Immediate)")
-

-
xyplot(df_imp, LMDM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Logical Memory Recall (Delayed)")
-

-
xyplot(df_imp, LMrcg ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Logical Memory Recognition")
-

-
xyplot(df_imp, CPIM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Constructional Praxis (Immediate)")
-

-
xyplot(df_imp, CPDM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Constructional Praxis (Delayed)")
-

-
xyplot(df_imp, SDMT ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Symbol Digits Modality Test")
-

-
xyplot(df_imp, TMTA ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Trail Making Test Part A")
-

-
xyplot(df_imp, TMTB ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Trail Making Test Part B")
-

-
xyplot(df_imp, Education ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Years of Education")
-

-
xyplot(df_imp, MaternalEdu ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Maternal Educational Attainment")
-

-
xyplot(df_imp, PaternalEdu ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Paternal Educational Attainment")
-

-
xyplot(df_imp, Race ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Race")
-

-
xyplot(df_imp, Ethnicity ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Ethnicity")
-

-

To clarify the observed plots, each grid (numbered 0-5) correspond to either the observed data (0) or one of the imputed datasets (1-5). The blue diamonds all correspond to an observed data point while the red circles each represent an imputed data point. Note that the blue diamonds across plots 0-5 are all identical. Potentially the easiest plot to examine first is the one for Animal fluency since there is just a single missing data point that was imputed. In this plot, it is relatively clear that this observation had a low propensity score for being missing and its imputed estimate was that the person would have scored somewhere in the average range compared to the rest of the sample. The 0 grid for Animal fluency is also useful to examine in the context of the working hypothesis that missingness was related to cognitive impairment. Looking at the left-hand side of the plot (i.e., the lowest propensity for having missing observations), there is a good overall range of possible Animal fluency scores, but as we go to the right (i.e., higher probabilities of missing data), we see that the range of Animal fluency scores becomes progressively restricted. This overall shape is mirrored in the other cognitive test plots and suggests that certainly one reason for missing data is cognitive impairment; however, there also appears to be other factors as well since there are plenty of cases with very poor scores who still are given low propensity estimations for being missing.

-

Another important takeaway from the above plots is the fact that there doesn’t seem to be a clear pattern of missingness by race or ethnicity. Examination of these plots demonstrate that, broadly speaking, individuals of all racial and ethnic groups had similarly wide propensity scores estimated. Similarly, there was no a heavy concentration of high propensity scores that separated the racial or ethnic groups. Some additional exploration is possible to further inspect the reason for missing data in the sample. The plots below show the propensity scores against MMSE raw scores and informant-based report of impairment.

-
xyplot(df_imp, MMSE ~ ps, pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "MMSE Raw Score")
-

-
xyplot(df_imp, Impaired ~ ps, pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2),
-       xlab = "Probability of Missing Observation",
-       ylab = "Informant Report of Impairment")
-

-

Again, the plots suggest there is a relationship between having impairment and missing data; however, this is clearly not the only cause of missingness. If this were the case, then the differences in propensity would be much clearer. In the case, of informant report in particular, there is clear evidence that those without impairment are still sometimes estimated as having a high probability of missing data. It’s possible that this reflects a multivariate dependency between objective cognitive testing and functional impairment; however, it’s also possible that there are other sociodemographic variables or neuropsychiatric factors that might explain missingness. Regardless of the cause of any missingness not at random, the multiple imputation methods used and described here appear to be robust and reasonable approximations.

-
-
-

Cognitive Prediction Regression

-

With the imputed data prepared, the regression model to predict raw scores and serve as a standard for “abnormal” cognitive test performance could be fit. As with the regression models fit in the main study, these models were fit in brms. The regression model included all of the same variables used in the multiple imputation model. The multivariate outcomes were all of the cognitive tests with the exception of Raven’s Progressive Matrices. The remaining variables (including Raven’s Progressive Matrices) were used as predictors.

-

Priors for the regression were specified per outcome with generic priors for the multivariate component. All regression coefficient priors were specified as normal distributions with mean of 0 and standard deviation of 1. Under this parameterization, this means that we’re placing an approximately 68% probability that regression coefficients are no larger than |1| and a 95% probability that regression coefficients are no larger than |2|. The data in this case are large, so the priors can easily be dominated by the observations in the case where these priors are incorrect; however, given the scales of the predictors and outcomes, these priors on the regression coefficients appeared reasonable. For priors on the intercepts, each outcome variable was considered separately. These intercept priors were specified as Student t distributions with 3 degrees of freedom, centered on the approximate mean of the raw distribution, and standard deviation 3x the magnitude of the raw standard deviation. In cases where the regression coefficients are all 0, the intercept of a regression will be the mean of the outcome variable, so centering the prior around this “null” value is a reasonable first guess for its true value. The scale of the intercept is then multiplied by 3 from the raw distribution to ensure that prior is wide and thus weakly informative. Again, the data are very large and could overpower even fairly strong priors, but this was still undertaken as a disciplined prior specification for these purposes.

-

A special caveat and warning should be provided for readers and those who intend to reproduce the analyses. In order to ultimately compute the residuals of the model, the number of iterations and warmup samples had to be left at the default number 1,000 each per chain. Running more iterations or more warmups ultimately caused the predictive errors to be too large to run within R. As currently specified, the resulting array of residuals was 6.6 GB in size. To explain why this is the case, one needs to remember that each model is run on 4 chains (1000 warmup + 1000 post-warmup iterations). Each model then has to be run on the 5 imputed datasets (resulting in 40,000 total iterations). In the end, a total of 20,000 posterior samples are saved for inference, so when residuals need to be computed, the 14 outcome variables all produce a combined multivariate predictive distribution based on these 20,000 posterior samples. In short, the model had to be fit on fewer than ideal iterations. While such few iterations is not a problem for Stan’s adaptive Hamiltonian Monte Carlo estimator with respect to convergence or performance, it does limit the ability to reliably estimate the tails of distributions. Fortunately, the primary aim of this particular model is to characterize the central tendency since the interest is ultimately below average performances. For any reader seeking to replicate the methods here, it is important to consider computing resources before running the model and before trying to run the model for more iterations. The personal laptop used for these analyses has an Intel i7-9750H CPU @ 2.60GHz, 32GB of RAM, and runs all 64-bit applications.

-

With this background in mind, it is now appropriate to examine the model. First, we examine the validity of the modeling process by ensuring that the chains mixed and that the posterior was adequately able to be explored/sampled. This is done by analyzing the \(\hat{R}\), effective sample sizes, trace plots, and diagnostic plots for the model. Details about these plots are provided in the other supplementary file addressing the model fits, so readers are encouraged to review that document for explanations of the following plots.

-
mcmc_plot(CogRegression, type = "rhat_hist", binwidth = 0.0001)
-
## Warning: Dropped 13 NAs from 'new_rhat(rhat)'.
-

-
mcmc_plot(CogRegression, type = "neff_hist", binwidth = 0.1)
-
## Warning: Dropped 13 NAs from 'new_neff_ratio(ratio)'.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_MMSE")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_Animals")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADimm")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADdel")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADdisc")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMIM")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMDM")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMrcg")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CPIM")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CPDM")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_SDMT")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_TMTA")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "b_TMTB")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "trace", regex_pars = "sigma_")
-
## No divergences to plot.
-

-
mcmc_plot(CogRegression, type = "nuts_divergence")
-

-
mcmc_plot(CogRegression, type = "nuts_treedepth")
-
## Warning: Groups with fewer than two data points have been dropped.
-
## Warning: Groups with fewer than two data points have been dropped.
-

-

One issue that comes from fitting multiple datasets with the same model is that there can be false positives on the \(\hat{R}\) statistic (see [here] (https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html) for details). We can confirm that there is no issue with chain mixing from visual inspection of the trace plots and then also calling the \(\hat{R}\) statistics directly from the model itself as done in the table below.

-
CogRegression$rhats %>%
-  select(starts_with("b_")) %>%
-  kable(caption = "Chain Convergence by Imputed Dataset", align = 'c', digits = 4) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Chain Convergence by Imputed Dataset -
-b_MMSE_Intercept - -b_Animals_Intercept - -b_CERADdel_Intercept - -b_LMIM_Intercept - -b_CPIM_Intercept - -b_SDMT_Intercept - -b_CPDM_Intercept - -b_LMDM_Intercept - -b_LMrcg_Intercept - -b_TMTA_Intercept - -b_TMTB_Intercept - -b_CERADimm_Intercept - -b_CERADdisc_Intercept - -b_MMSE_Ravens - -b_MMSE_UrbanRural.L - -b_MMSE_UrbanRural.Q - -b_MMSE_UrbanRural.C - -b_MMSE_Age - -b_MMSE_TotalAssets - -b_MMSE_FinancialWealth - -b_MMSE_TotalIncome - -b_MMSE_PovertyRatioInst - -b_MMSE_GenderFemale - -b_MMSE_EthnicityHispanic - -b_MMSE_RaceBlack - -b_MMSE_RaceOther - -b_MMSE_Education - -b_MMSE_MaternalEdu - -b_MMSE_PaternalEdu - -b_MMSE_ImpairedYes - -b_Animals_Ravens - -b_Animals_UrbanRural.L - -b_Animals_UrbanRural.Q - -b_Animals_UrbanRural.C - -b_Animals_Age - -b_Animals_TotalAssets - -b_Animals_FinancialWealth - -b_Animals_TotalIncome - -b_Animals_PovertyRatioInst - -b_Animals_GenderFemale - -b_Animals_EthnicityHispanic - -b_Animals_RaceBlack - -b_Animals_RaceOther - -b_Animals_Education - -b_Animals_MaternalEdu - -b_Animals_PaternalEdu - -b_Animals_ImpairedYes - -b_CERADdel_Ravens - -b_CERADdel_UrbanRural.L - -b_CERADdel_UrbanRural.Q - -b_CERADdel_UrbanRural.C - -b_CERADdel_Age - -b_CERADdel_TotalAssets - -b_CERADdel_FinancialWealth - -b_CERADdel_TotalIncome - -b_CERADdel_PovertyRatioInst - -b_CERADdel_GenderFemale - -b_CERADdel_EthnicityHispanic - -b_CERADdel_RaceBlack - -b_CERADdel_RaceOther - -b_CERADdel_Education - -b_CERADdel_MaternalEdu - -b_CERADdel_PaternalEdu - -b_CERADdel_ImpairedYes - -b_LMIM_Ravens - -b_LMIM_UrbanRural.L - -b_LMIM_UrbanRural.Q - -b_LMIM_UrbanRural.C - -b_LMIM_Age - -b_LMIM_TotalAssets - -b_LMIM_FinancialWealth - -b_LMIM_TotalIncome - -b_LMIM_PovertyRatioInst - -b_LMIM_GenderFemale - -b_LMIM_EthnicityHispanic - -b_LMIM_RaceBlack - -b_LMIM_RaceOther - -b_LMIM_Education - -b_LMIM_MaternalEdu - -b_LMIM_PaternalEdu - -b_LMIM_ImpairedYes - -b_CPIM_Ravens - -b_CPIM_UrbanRural.L - -b_CPIM_UrbanRural.Q - -b_CPIM_UrbanRural.C - -b_CPIM_Age - -b_CPIM_TotalAssets - -b_CPIM_FinancialWealth - -b_CPIM_TotalIncome - -b_CPIM_PovertyRatioInst - -b_CPIM_GenderFemale - -b_CPIM_EthnicityHispanic - -b_CPIM_RaceBlack - -b_CPIM_RaceOther - -b_CPIM_Education - -b_CPIM_MaternalEdu - -b_CPIM_PaternalEdu - -b_CPIM_ImpairedYes - -b_SDMT_Ravens - -b_SDMT_UrbanRural.L - -b_SDMT_UrbanRural.Q - -b_SDMT_UrbanRural.C - -b_SDMT_Age - -b_SDMT_TotalAssets - -b_SDMT_FinancialWealth - -b_SDMT_TotalIncome - -b_SDMT_PovertyRatioInst - -b_SDMT_GenderFemale - -b_SDMT_EthnicityHispanic - -b_SDMT_RaceBlack - -b_SDMT_RaceOther - -b_SDMT_Education - -b_SDMT_MaternalEdu - -b_SDMT_PaternalEdu - -b_SDMT_ImpairedYes - -b_CPDM_Ravens - -b_CPDM_UrbanRural.L - -b_CPDM_UrbanRural.Q - -b_CPDM_UrbanRural.C - -b_CPDM_Age - -b_CPDM_TotalAssets - -b_CPDM_FinancialWealth - -b_CPDM_TotalIncome - -b_CPDM_PovertyRatioInst - -b_CPDM_GenderFemale - -b_CPDM_EthnicityHispanic - -b_CPDM_RaceBlack - -b_CPDM_RaceOther - -b_CPDM_Education - -b_CPDM_MaternalEdu - -b_CPDM_PaternalEdu - -b_CPDM_ImpairedYes - -b_LMDM_Ravens - -b_LMDM_UrbanRural.L - -b_LMDM_UrbanRural.Q - -b_LMDM_UrbanRural.C - -b_LMDM_Age - -b_LMDM_TotalAssets - -b_LMDM_FinancialWealth - -b_LMDM_TotalIncome - -b_LMDM_PovertyRatioInst - -b_LMDM_GenderFemale - -b_LMDM_EthnicityHispanic - -b_LMDM_RaceBlack - -b_LMDM_RaceOther - -b_LMDM_Education - -b_LMDM_MaternalEdu - -b_LMDM_PaternalEdu - -b_LMDM_ImpairedYes - -b_LMrcg_Ravens - -b_LMrcg_UrbanRural.L - -b_LMrcg_UrbanRural.Q - -b_LMrcg_UrbanRural.C - -b_LMrcg_Age - -b_LMrcg_TotalAssets - -b_LMrcg_FinancialWealth - -b_LMrcg_TotalIncome - -b_LMrcg_PovertyRatioInst - -b_LMrcg_GenderFemale - -b_LMrcg_EthnicityHispanic - -b_LMrcg_RaceBlack - -b_LMrcg_RaceOther - -b_LMrcg_Education - -b_LMrcg_MaternalEdu - -b_LMrcg_PaternalEdu - -b_LMrcg_ImpairedYes - -b_TMTA_Ravens - -b_TMTA_UrbanRural.L - -b_TMTA_UrbanRural.Q - -b_TMTA_UrbanRural.C - -b_TMTA_Age - -b_TMTA_TotalAssets - -b_TMTA_FinancialWealth - -b_TMTA_TotalIncome - -b_TMTA_PovertyRatioInst - -b_TMTA_GenderFemale - -b_TMTA_EthnicityHispanic - -b_TMTA_RaceBlack - -b_TMTA_RaceOther - -b_TMTA_Education - -b_TMTA_MaternalEdu - -b_TMTA_PaternalEdu - -b_TMTA_ImpairedYes - -b_TMTB_Ravens - -b_TMTB_UrbanRural.L - -b_TMTB_UrbanRural.Q - -b_TMTB_UrbanRural.C - -b_TMTB_Age - -b_TMTB_TotalAssets - -b_TMTB_FinancialWealth - -b_TMTB_TotalIncome - -b_TMTB_PovertyRatioInst - -b_TMTB_GenderFemale - -b_TMTB_EthnicityHispanic - -b_TMTB_RaceBlack - -b_TMTB_RaceOther - -b_TMTB_Education - -b_TMTB_MaternalEdu - -b_TMTB_PaternalEdu - -b_TMTB_ImpairedYes - -b_CERADimm_Ravens - -b_CERADimm_UrbanRural.L - -b_CERADimm_UrbanRural.Q - -b_CERADimm_UrbanRural.C - -b_CERADimm_Age - -b_CERADimm_TotalAssets - -b_CERADimm_FinancialWealth - -b_CERADimm_TotalIncome - -b_CERADimm_PovertyRatioInst - -b_CERADimm_GenderFemale - -b_CERADimm_EthnicityHispanic - -b_CERADimm_RaceBlack - -b_CERADimm_RaceOther - -b_CERADimm_Education - -b_CERADimm_MaternalEdu - -b_CERADimm_PaternalEdu - -b_CERADimm_ImpairedYes - -b_CERADdisc_Ravens - -b_CERADdisc_UrbanRural.L - -b_CERADdisc_UrbanRural.Q - -b_CERADdisc_UrbanRural.C - -b_CERADdisc_Age - -b_CERADdisc_TotalAssets - -b_CERADdisc_FinancialWealth - -b_CERADdisc_TotalIncome - -b_CERADdisc_PovertyRatioInst - -b_CERADdisc_GenderFemale - -b_CERADdisc_EthnicityHispanic - -b_CERADdisc_RaceBlack - -b_CERADdisc_RaceOther - -b_CERADdisc_Education - -b_CERADdisc_MaternalEdu - -b_CERADdisc_PaternalEdu - -b_CERADdisc_ImpairedYes -
-1.0000 - -1.0009 - -1.0022 - -1.0022 - -0.9995 - -1.0005 - -0.9997 - -1.0023 - -1.0004 - -1.0010 - -1.0008 - -1.0020 - -1.0009 - -1.0001 - -1.0011 - -1.0012 - -1.0017 - -0.9998 - -0.9997 - -0.9996 - -1.0010 - -1.0011 - -0.9996 - -1.0019 - -1.0009 - -1.0006 - -1.0000 - -1.0002 - -0.9996 - -1.0003 - -0.9999 - -1.0004 - -1.0006 - -1.0021 - -1.0005 - -1.0010 - -1.0016 - -1.0008 - -1.0004 - -0.9997 - -1.0004 - -1.0002 - -0.9997 - -0.9995 - -0.9999 - -0.9998 - -1.0001 - -1.0003 - -1.0005 - -1.0004 - -1.0030 - -1.0018 - -1.0015 - -1.0004 - -1.0026 - -1.0021 - -0.9994 - -1.0044 - -1.0025 - -1.0020 - -0.9996 - -1.0007 - -1.0033 - -0.9999 - -1.0006 - -1.0007 - -1.0007 - -1.0019 - -1.0021 - -1.0012 - -0.9998 - -1.0004 - -1.0002 - -1.0002 - -1.0007 - -1.0017 - -1.0003 - -1.0016 - -1.0000 - -0.9996 - -1.0002 - -0.9997 - -1.0013 - -1.0005 - -1.0009 - -0.9995 - -0.9998 - -0.9999 - -0.9991 - -0.9992 - -0.9997 - -1.0017 - -1.0002 - -1.0008 - -0.9993 - -0.9996 - -1.0011 - -0.9998 - -1.0004 - -1.0016 - -1.0019 - -1.0013 - -0.9998 - -0.9996 - -1.0000 - -1.0004 - -1.0001 - -0.9998 - -1.0014 - -1.0003 - -1.0002 - -0.9994 - -1.0000 - -0.9998 - -1.0002 - -0.9998 - -1.0032 - -1.0022 - -1.0014 - -1.0000 - -0.9999 - -0.9998 - -1.0005 - -1.0005 - -0.9999 - -1.0015 - -1.0013 - -1.0010 - -0.9992 - -1.0004 - -1.0018 - -1.0001 - -1.0009 - -1.0002 - -1.0003 - -1.0019 - -1.0015 - -1.0011 - -0.9994 - -1.0003 - -1.0002 - -1.0010 - -1.0022 - -1.0026 - -1.0008 - -1.0011 - -0.9999 - -1.0005 - -1.0003 - -0.9999 - -0.9999 - -0.9997 - -1.0006 - -0.9998 - -1.0013 - -1.0005 - -1.0002 - -1.0002 - -1.0002 - -1.0001 - -1.0004 - -1.0004 - -1.0004 - -1.0003 - -0.9996 - -0.9998 - -0.9998 - -1.0009 - -1.0010 - -1.0006 - -1.0000 - -0.9999 - -0.9999 - -1.0002 - -1.0002 - -0.9997 - -1.0001 - -1.0000 - -0.9997 - -1.0001 - -0.9998 - -0.9992 - -0.9997 - -0.9997 - -1.0022 - -1.0017 - -1.0014 - -0.9998 - -0.9999 - -1.0001 - -1.0001 - -0.9998 - -1.0000 - -1.0006 - -0.9995 - -0.9996 - -0.9993 - -1.0006 - -0.9998 - -0.9998 - -1.0002 - -1.0013 - -1.0010 - -1.0029 - -1.0014 - -1.0023 - -1.0009 - -1.0039 - -1.0031 - -0.9995 - -1.0035 - -1.0020 - -1.0013 - -0.9999 - -1.0010 - -1.0022 - -1.0005 - -0.9998 - -1.0003 - -1.0000 - -1.0028 - -1.0007 - -1.0022 - -1.0008 - -1.0005 - -1.0005 - -0.9995 - -1.0032 - -1.0007 - -1.0005 - -0.9994 - -1.0001 - -1.0020 - -1.0006 -
-1.0011 - -1.0004 - -1.0028 - -1.0035 - -1.0003 - -0.9997 - -1.0022 - -1.0036 - -1.0020 - -0.9998 - -1.0001 - -1.0025 - -1.0024 - -1.0011 - -0.9995 - -0.9998 - -0.9999 - -1.0021 - -1.0000 - -0.9997 - -1.0015 - -1.0014 - -0.9997 - -1.0005 - -1.0003 - -1.0000 - -1.0006 - -0.9997 - -1.0006 - -1.0000 - -1.0012 - -0.9997 - -0.9995 - -1.0001 - -1.0010 - -0.9998 - -0.9998 - -0.9999 - -1.0000 - -0.9998 - -0.9997 - -1.0008 - -0.9995 - -0.9997 - -1.0003 - -0.9998 - -1.0002 - -1.0016 - -1.0000 - -0.9994 - -1.0014 - -1.0032 - -1.0000 - -1.0003 - -1.0012 - -1.0012 - -1.0002 - -1.0010 - -1.0007 - -1.0010 - -1.0014 - -1.0007 - -1.0005 - -1.0026 - -1.0038 - -1.0005 - -1.0000 - -1.0007 - -1.0025 - -0.9999 - -1.0005 - -1.0012 - -1.0016 - -1.0002 - -0.9996 - -1.0011 - -1.0000 - -1.0017 - -1.0033 - -1.0014 - -1.0006 - -1.0008 - -1.0005 - -1.0004 - -1.0011 - -1.0004 - -0.9999 - -0.9999 - -0.9996 - -0.9996 - -0.9996 - -0.9999 - -0.9998 - -1.0003 - -1.0017 - -0.9998 - -0.9999 - -0.9995 - -1.0004 - -0.9996 - -1.0002 - -0.9996 - -1.0006 - -1.0011 - -1.0008 - -1.0006 - -1.0008 - -0.9995 - -1.0005 - -0.9998 - -1.0004 - -1.0002 - -1.0005 - -0.9995 - -1.0013 - -1.0013 - -0.9992 - -0.9999 - -1.0009 - -1.0024 - -1.0008 - -1.0001 - -0.9998 - -0.9997 - -0.9995 - -0.9999 - -1.0002 - -1.0003 - -1.0014 - -0.9999 - -1.0002 - -1.0005 - -1.0053 - -1.0005 - -0.9993 - -1.0006 - -1.0030 - -0.9994 - -1.0008 - -1.0027 - -1.0026 - -1.0007 - -0.9998 - -1.0011 - -1.0013 - -1.0025 - -1.0038 - -1.0018 - -1.0008 - -1.0023 - -1.0000 - -0.9995 - -1.0005 - -1.0014 - -0.9996 - -0.9996 - -1.0003 - -1.0003 - -1.0000 - -0.9995 - -1.0011 - -0.9999 - -1.0008 - -1.0043 - -1.0023 - -1.0000 - -0.9995 - -0.9994 - -0.9997 - -0.9998 - -0.9999 - -1.0002 - -1.0001 - -1.0004 - -1.0003 - -0.9996 - -0.9998 - -1.0002 - -1.0002 - -0.9999 - -0.9999 - -0.9996 - -1.0006 - -1.0017 - -1.0005 - -1.0010 - -1.0001 - -1.0005 - -1.0008 - -1.0012 - -1.0001 - -1.0003 - -1.0002 - -1.0007 - -0.9998 - -0.9999 - -0.9998 - -1.0010 - -1.0005 - -1.0005 - -1.0001 - -1.0000 - -0.9996 - -1.0013 - -1.0033 - -1.0004 - -1.0016 - -1.0013 - -1.0012 - -0.9993 - -1.0014 - -1.0009 - -1.0006 - -1.0016 - -0.9998 - -1.0010 - -1.0024 - -1.0008 - -0.9997 - -0.9998 - -1.0007 - -1.0031 - -1.0007 - -1.0000 - -1.0022 - -1.0020 - -0.9995 - -1.0004 - -1.0010 - -1.0004 - -1.0014 - -1.0001 - -1.0002 - -1.0015 -
-1.0007 - -1.0003 - -1.0020 - -1.0007 - -1.0002 - -1.0004 - -1.0011 - -1.0000 - -1.0000 - -1.0004 - -0.9997 - -1.0020 - -1.0016 - -1.0009 - -1.0005 - -1.0005 - -0.9994 - -1.0007 - -1.0015 - -0.9996 - -0.9998 - -0.9996 - -1.0003 - -1.0021 - -1.0001 - -1.0000 - -0.9998 - -0.9994 - -0.9998 - -1.0016 - -1.0021 - -1.0000 - -1.0001 - -1.0007 - -1.0007 - -1.0001 - -0.9997 - -0.9997 - -0.9998 - -0.9995 - -1.0002 - -0.9999 - -0.9999 - -1.0003 - -1.0000 - -0.9998 - -1.0007 - -1.0036 - -0.9999 - -0.9998 - -0.9997 - -1.0010 - -1.0035 - -1.0020 - -1.0027 - -1.0027 - -1.0004 - -1.0013 - -1.0013 - -0.9998 - -1.0011 - -1.0001 - -1.0001 - -1.0020 - -1.0027 - -1.0005 - -0.9999 - -1.0005 - -1.0008 - -1.0021 - -1.0006 - -1.0012 - -1.0012 - -0.9999 - -1.0018 - -1.0009 - -1.0000 - -1.0002 - -1.0003 - -0.9997 - -1.0015 - -1.0006 - -0.9999 - -1.0000 - -0.9998 - -0.9999 - -1.0001 - -0.9996 - -1.0025 - -1.0019 - -0.9997 - -0.9996 - -0.9997 - -0.9998 - -1.0000 - -1.0001 - -1.0003 - -0.9995 - -1.0002 - -1.0003 - -1.0002 - -1.0011 - -1.0002 - -1.0002 - -1.0013 - -0.9996 - -0.9996 - -0.9998 - -1.0001 - -1.0002 - -1.0005 - -0.9994 - -0.9994 - -1.0001 - -1.0014 - -1.0016 - -1.0000 - -0.9999 - -1.0001 - -1.0003 - -1.0025 - -0.9998 - -1.0012 - -1.0007 - -1.0008 - -0.9998 - -0.9999 - -0.9997 - -0.9998 - -1.0003 - -0.9995 - -1.0004 - -1.0032 - -1.0001 - -1.0000 - -1.0008 - -1.0009 - -1.0023 - -1.0013 - -1.0011 - -1.0012 - -0.9998 - -1.0028 - -1.0009 - -1.0007 - -0.9997 - -1.0005 - -0.9993 - -1.0029 - -1.0011 - -1.0002 - -1.0001 - -1.0005 - -1.0002 - -1.0017 - -1.0000 - -1.0007 - -1.0004 - -0.9997 - -1.0011 - -1.0005 - -1.0001 - -0.9996 - -0.9998 - -0.9996 - -1.0013 - -1.0003 - -1.0000 - -1.0003 - -0.9996 - -0.9999 - -1.0002 - -0.9997 - -0.9996 - -0.9996 - -1.0000 - -1.0001 - -0.9997 - -1.0001 - -0.9993 - -0.9994 - -0.9995 - -1.0011 - -0.9996 - -1.0011 - -1.0010 - -1.0000 - -0.9994 - -1.0000 - -1.0000 - -1.0000 - -1.0001 - -1.0002 - -1.0001 - -0.9995 - -1.0019 - -0.9991 - -0.9995 - -1.0000 - -1.0007 - -1.0046 - -1.0005 - -1.0004 - -0.9996 - -1.0008 - -1.0026 - -1.0014 - -1.0015 - -1.0011 - -1.0007 - -1.0007 - -0.9998 - -0.9995 - -1.0005 - -0.9995 - -0.9996 - -1.0024 - -1.0021 - -0.9996 - -0.9997 - -0.9992 - -1.0007 - -1.0030 - -1.0007 - -1.0015 - -1.0013 - -0.9996 - -1.0007 - -1.0009 - -0.9995 - -1.0002 - -0.9992 - -0.9994 - -1.0009 -
-1.0000 - -0.9999 - -1.0003 - -1.0021 - -0.9994 - -1.0005 - -1.0006 - -1.0011 - -1.0006 - -0.9997 - -0.9998 - -0.9999 - -1.0010 - -0.9998 - -1.0008 - -1.0005 - -1.0005 - -0.9998 - -0.9996 - -0.9997 - -1.0012 - -1.0013 - -1.0014 - -1.0004 - -0.9999 - -1.0002 - -1.0003 - -1.0007 - -0.9992 - -0.9999 - -0.9997 - -0.9997 - -0.9998 - -1.0019 - -1.0000 - -0.9998 - -0.9998 - -0.9994 - -0.9995 - -0.9999 - -0.9994 - -0.9995 - -0.9997 - -1.0000 - -1.0005 - -0.9996 - -0.9996 - -1.0000 - -1.0011 - -1.0014 - -1.0045 - -1.0011 - -1.0000 - -1.0010 - -1.0011 - -1.0008 - -1.0002 - -1.0004 - -0.9999 - -1.0008 - -1.0018 - -1.0011 - -1.0000 - -0.9994 - -1.0021 - -1.0007 - -1.0010 - -1.0026 - -1.0036 - -0.9999 - -1.0002 - -1.0002 - -1.0002 - -1.0016 - -1.0021 - -1.0008 - -1.0007 - -1.0020 - -0.9993 - -0.9998 - -0.9993 - -0.9995 - -0.9999 - -1.0003 - -1.0002 - -0.9994 - -0.9997 - -0.9995 - -1.0000 - -0.9999 - -1.0001 - -0.9996 - -0.9997 - -1.0001 - -1.0007 - -1.0002 - -1.0007 - -0.9998 - -0.9991 - -0.9998 - -1.0005 - -1.0006 - -1.0000 - -1.0002 - -1.0008 - -1.0004 - -1.0006 - -1.0013 - -1.0011 - -1.0002 - -1.0001 - -1.0017 - -1.0001 - -0.9998 - -1.0001 - -0.9994 - -1.0006 - -1.0012 - -1.0009 - -1.0013 - -0.9995 - -0.9998 - -1.0016 - -1.0015 - -1.0001 - -0.9998 - -0.9998 - -1.0001 - -1.0006 - -1.0013 - -0.9998 - -0.9993 - -1.0015 - -1.0012 - -1.0012 - -1.0028 - -1.0023 - -0.9997 - -1.0000 - -1.0007 - -1.0007 - -1.0005 - -1.0014 - -1.0005 - -1.0013 - -1.0019 - -1.0001 - -1.0001 - -0.9994 - -1.0019 - -1.0004 - -1.0002 - -1.0001 - -1.0018 - -0.9994 - -0.9993 - -1.0004 - -1.0004 - -1.0002 - -1.0007 - -1.0004 - -1.0004 - -1.0008 - -0.9999 - -1.0000 - -1.0002 - -0.9999 - -1.0006 - -1.0005 - -1.0002 - -0.9994 - -1.0003 - -1.0014 - -1.0003 - -1.0004 - -1.0002 - -1.0022 - -0.9999 - -1.0012 - -1.0007 - -1.0006 - -0.9996 - -1.0002 - -0.9995 - -1.0007 - -1.0006 - -0.9997 - -0.9998 - -0.9993 - -1.0004 - -1.0001 - -1.0008 - -0.9994 - -1.0002 - -0.9997 - -1.0006 - -1.0013 - -1.0011 - -1.0002 - -1.0003 - -1.0005 - -1.0011 - -1.0010 - -1.0034 - -1.0009 - -0.9994 - -1.0001 - -1.0004 - -1.0002 - -1.0006 - -0.9996 - -0.9999 - -1.0003 - -1.0011 - -1.0018 - -1.0009 - -0.9993 - -1.0008 - -1.0003 - -1.0008 - -1.0024 - -1.0009 - -0.9998 - -0.9996 - -1.0011 - -1.0007 - -1.0011 - -0.9997 - -1.0000 - -0.9996 - -1.0008 - -1.0009 - -0.9994 - -0.9994 -
-1.0005 - -0.9993 - -1.0022 - -1.0023 - -0.9995 - -1.0007 - -0.9996 - -1.0019 - -1.0003 - -1.0008 - -1.0001 - -1.0017 - -1.0003 - -1.0005 - -1.0007 - -1.0004 - -0.9993 - -1.0003 - -1.0002 - -1.0000 - -0.9998 - -0.9996 - -0.9996 - -1.0003 - -0.9998 - -1.0007 - -0.9996 - -0.9993 - -0.9995 - -1.0020 - -1.0001 - -1.0007 - -1.0002 - -1.0003 - -0.9994 - -0.9999 - -1.0016 - -1.0001 - -1.0001 - -1.0006 - -0.9996 - -1.0002 - -1.0000 - -1.0006 - -0.9993 - -0.9994 - -1.0028 - -1.0028 - -1.0011 - -0.9994 - -1.0010 - -1.0026 - -1.0012 - -1.0004 - -0.9999 - -0.9998 - -1.0017 - -1.0016 - -1.0022 - -1.0007 - -1.0007 - -0.9998 - -1.0008 - -1.0037 - -1.0016 - -1.0017 - -1.0007 - -1.0004 - -1.0032 - -1.0002 - -1.0020 - -0.9995 - -0.9994 - -1.0009 - -1.0007 - -1.0014 - -0.9995 - -1.0012 - -1.0001 - -1.0001 - -1.0031 - -0.9992 - -0.9996 - -0.9996 - -0.9993 - -0.9997 - -0.9994 - -0.9992 - -1.0004 - -1.0007 - -0.9998 - -1.0002 - -0.9998 - -0.9998 - -0.9998 - -1.0011 - -0.9995 - -1.0007 - -1.0016 - -0.9994 - -0.9993 - -1.0010 - -1.0008 - -1.0001 - -1.0005 - -1.0001 - -0.9998 - -1.0009 - -1.0002 - -0.9995 - -1.0003 - -0.9994 - -0.9993 - -0.9997 - -1.0019 - -1.0008 - -1.0019 - -1.0006 - -1.0003 - -1.0000 - -1.0000 - -0.9998 - -1.0000 - -0.9998 - -1.0007 - -1.0007 - -1.0009 - -0.9999 - -1.0008 - -1.0005 - -0.9999 - -1.0050 - -1.0010 - -1.0015 - -1.0006 - -1.0008 - -1.0026 - -0.9998 - -1.0015 - -1.0003 - -1.0001 - -1.0014 - -1.0011 - -1.0018 - -0.9999 - -1.0016 - -1.0001 - -1.0005 - -1.0047 - -1.0013 - -1.0024 - -1.0015 - -1.0011 - -1.0008 - -0.9999 - -0.9999 - -0.9993 - -0.9992 - -1.0001 - -1.0005 - -1.0001 - -1.0000 - -1.0007 - -1.0000 - -0.9998 - -1.0029 - -1.0011 - -1.0000 - -1.0002 - -1.0001 - -1.0004 - -0.9995 - -1.0004 - -1.0005 - -1.0000 - -1.0003 - -1.0004 - -1.0000 - -1.0009 - -0.9994 - -0.9994 - -0.9996 - -1.0021 - -1.0000 - -0.9997 - -1.0001 - -1.0012 - -1.0003 - -0.9998 - -1.0005 - -1.0003 - -0.9999 - -1.0008 - -0.9997 - -0.9996 - -1.0011 - -1.0000 - -0.9994 - -1.0002 - -1.0017 - -1.0022 - -1.0000 - -0.9992 - -1.0008 - -1.0020 - -1.0012 - -0.9999 - -1.0002 - -1.0001 - -1.0012 - -1.0010 - -1.0018 - -1.0008 - -1.0010 - -0.9998 - -1.0012 - -1.0042 - -1.0019 - -1.0003 - -0.9994 - -0.9999 - -1.0013 - -1.0003 - -0.9997 - -0.9998 - -0.9996 - -1.0019 - -1.0015 - -1.0007 - -1.0007 - -1.0008 - -1.0000 - -1.0002 - -1.0028 -
-

Note that for convenience, not all of the values are shown above. Since the trace plots for every estimated predictor is shown above, it was determined to not include a full table summarizing the \(\hat{R}\) for every parameter across the five imputed datasets. For readers interested in the values of the various predictors, the following provides a description of the posterior estimates of each predictor in the model.

-
matrix(fixef(CogRegression), ncol = 4,
-       dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability", rep(c("Ravens", "Rurality: Suburban", "Rurality: Exurban", "Rurality: Not Classified", "Age", "Total Assests", "Financial Wealth", "Total Income", "Poverty Line Ratio", "Sex: Female", "Ethnicity: Hispanic", "Race: Black", "Race: Other", "Education", "Maternal Education", "Paternal Education", "Impaired: Yes"), 13)), c("Estimate", "Std. Error", "95% CI LB", "95% CI UB"))) %>%
-  kable(caption = "Regression Coefficients Estimated in the Multivariate Regression", align = 'c', digits = 2) %>%
-  kable_classic(full_width = FALSE, position = "center") %>%
-  pack_rows("Intercepts", 1, 13) %>%
-  pack_rows("MMSE", 14, 30) %>%
-  pack_rows("Animals", 31, 47) %>%
-  pack_rows("CERAD Delayed", 48, 64) %>%
-  pack_rows("Logical Memory Immediate", 65, 81) %>%
-  pack_rows("Constructional Praxis Immediate", 82, 98) %>%
-  pack_rows("Symbol Digits Modality", 99, 115) %>%
-  pack_rows("Constructional Praxis Delayed", 116, 132) %>%
-  pack_rows("Logical Memory Delayed", 133, 149) %>%
-  pack_rows("Logical Memory Recognition", 150, 166) %>%
-  pack_rows("Trails A", 167, 183) %>%
-  pack_rows("Trails B", 184, 200) %>%
-  pack_rows("CERAD Immediate", 201, 217) %>%
-  pack_rows("CERAD Discriminability", 218, 234)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Regression Coefficients Estimated in the Multivariate Regression -
- -Estimate - -Std. Error - -95% CI LB - -95% CI UB -
-Intercepts -
-MMSE - -24.51 - -0.59 - -23.35 - -25.69 -
-Animals - -16.09 - -1.42 - -13.31 - -18.87 -
-CERAD Delayed - -6.98 - -0.56 - -5.87 - -8.09 -
-Logical Memory Immediate - -10.73 - -1.12 - -8.55 - -12.93 -
-Constructional Praxis Immediate - -4.50 - -0.50 - -3.52 - -5.47 -
-Symbol Digits Modality - -43.03 - -2.11 - -38.92 - -47.18 -
-Constructional Praxis Delayed - -5.54 - -0.69 - -4.20 - -6.90 -
-Logical Memory Delayed - -10.54 - -1.22 - -8.16 - -12.93 -
-Logical Memory Recognition - -11.72 - -0.63 - -10.49 - -12.96 -
-Trails A - -42.31 - -5.76 - -31.08 - -53.74 -
-Trails B - -134.03 - -10.94 - -112.69 - -155.49 -
-CERAD Immediate - -18.62 - -1.05 - -16.56 - -20.68 -
-CERAD Discriminability - -10.52 - -0.44 - -9.67 - -11.38 -
-MMSE -
-Ravens - -0.27 - -0.02 - -0.25 - -0.30 -
-Rurality: Suburban - --0.19 - -0.20 - --0.58 - -0.22 -
-Rurality: Exurban - --0.09 - -0.16 - --0.41 - -0.22 -
-Rurality: Not Classified - --0.03 - -0.10 - --0.22 - -0.17 -
-Age - --0.04 - -0.01 - --0.05 - --0.03 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.00 - -0.02 - --0.04 - -0.04 -
-Sex: Female - -0.81 - -0.08 - -0.65 - -0.97 -
-Ethnicity: Hispanic - --0.30 - -0.19 - --0.68 - -0.08 -
-Race: Black - --0.37 - -0.13 - --0.62 - --0.12 -
-Race: Other - --0.61 - -0.23 - --1.06 - --0.17 -
-Education - -0.17 - -0.02 - -0.13 - -0.20 -
-Maternal Education - -0.00 - -0.02 - --0.04 - -0.03 -
-Paternal Education - -0.00 - -0.01 - --0.03 - -0.02 -
-Impaired: Yes - --0.28 - -0.08 - --0.44 - --0.13 -
-Animals -
-Ravens - -0.46 - -0.03 - -0.39 - -0.53 -
-Rurality: Suburban - -0.45 - -0.47 - --0.47 - -1.38 -
-Rurality: Exurban - --0.03 - -0.37 - --0.75 - -0.71 -
-Rurality: Not Classified - --0.06 - -0.24 - --0.53 - -0.40 -
-Age - --0.13 - -0.01 - --0.16 - --0.10 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.00 - -0.05 - --0.09 - -0.10 -
-Sex: Female - -0.32 - -0.20 - --0.07 - -0.72 -
-Ethnicity: Hispanic - -0.37 - -0.45 - --0.51 - -1.25 -
-Race: Black - --0.93 - -0.30 - --1.52 - --0.34 -
-Race: Other - --1.43 - -0.53 - --2.46 - --0.42 -
-Education - -0.32 - -0.04 - -0.23 - -0.40 -
-Maternal Education - -0.03 - -0.04 - --0.05 - -0.10 -
-Paternal Education - -0.03 - -0.04 - --0.04 - -0.10 -
-Impaired: Yes - --0.52 - -0.19 - --0.89 - --0.14 -
-CERAD Delayed -
-Ravens - -0.16 - -0.01 - -0.13 - -0.18 -
-Rurality: Suburban - -0.12 - -0.19 - --0.25 - -0.49 -
-Rurality: Exurban - -0.07 - -0.15 - --0.22 - -0.36 -
-Rurality: Not Classified - -0.10 - -0.10 - --0.09 - -0.28 -
-Age - --0.07 - -0.01 - --0.08 - --0.06 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.01 - -0.02 - --0.03 - -0.06 -
-Sex: Female - -1.23 - -0.08 - -1.08 - -1.38 -
-Ethnicity: Hispanic - -0.00 - -0.18 - --0.35 - -0.35 -
-Race: Black - --0.08 - -0.12 - --0.32 - -0.15 -
-Race: Other - -0.06 - -0.21 - --0.36 - -0.47 -
-Education - -0.11 - -0.02 - -0.08 - -0.15 -
-Maternal Education - -0.01 - -0.02 - --0.02 - -0.04 -
-Paternal Education - --0.03 - -0.01 - --0.06 - --0.01 -
-Impaired: Yes - --0.42 - -0.08 - --0.57 - --0.27 -
-Logical Memory Immediate -
-Ravens - -0.34 - -0.03 - -0.28 - -0.39 -
-Rurality: Suburban - -0.51 - -0.39 - --0.26 - -1.26 -
-Rurality: Exurban - -0.33 - -0.31 - --0.27 - -0.93 -
-Rurality: Not Classified - -0.31 - -0.19 - --0.07 - -0.68 -
-Age - --0.12 - -0.01 - --0.14 - --0.10 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.03 - -0.04 - --0.05 - -0.11 -
-Sex: Female - -1.26 - -0.16 - -0.95 - -1.57 -
-Ethnicity: Hispanic - --0.55 - -0.36 - --1.26 - -0.16 -
-Race: Black - --0.16 - -0.24 - --0.64 - -0.30 -
-Race: Other - --0.79 - -0.42 - --1.61 - -0.04 -
-Education - -0.26 - -0.04 - -0.18 - -0.32 -
-Maternal Education - -0.08 - -0.03 - -0.02 - -0.15 -
-Paternal Education - --0.03 - -0.03 - --0.09 - -0.02 -
-Impaired: Yes - --0.19 - -0.15 - --0.49 - -0.11 -
-Constructional Praxis Immediate -
-Ravens - -0.25 - -0.01 - -0.22 - -0.27 -
-Rurality: Suburban - --0.17 - -0.17 - --0.50 - -0.16 -
-Rurality: Exurban - -0.13 - -0.13 - --0.13 - -0.39 -
-Rurality: Not Classified - --0.15 - -0.08 - --0.31 - -0.02 -
-Age - -0.00 - -0.01 - --0.01 - -0.01 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.02 - -0.02 - --0.01 - -0.05 -
-Sex: Female - --0.07 - -0.07 - --0.20 - -0.06 -
-Ethnicity: Hispanic - --0.21 - -0.16 - --0.53 - -0.09 -
-Race: Black - --0.47 - -0.11 - --0.69 - --0.26 -
-Race: Other - -0.16 - -0.18 - --0.20 - -0.52 -
-Education - -0.08 - -0.02 - -0.05 - -0.11 -
-Maternal Education - --0.03 - -0.01 - --0.06 - --0.01 -
-Paternal Education - -0.02 - -0.01 - -0.00 - -0.04 -
-Impaired: Yes - -0.01 - -0.07 - --0.12 - -0.14 -
-Symbol Digits Modality -
-Ravens - -0.95 - -0.05 - -0.85 - -1.05 -
-Rurality: Suburban - --0.07 - -0.71 - --1.45 - -1.31 -
-Rurality: Exurban - -0.00 - -0.57 - --1.11 - -1.11 -
-Rurality: Not Classified - --0.20 - -0.36 - --0.90 - -0.50 -
-Age - --0.41 - -0.02 - --0.45 - --0.37 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - --0.05 - -0.08 - --0.20 - -0.10 -
-Sex: Female - -2.84 - -0.29 - -2.26 - -3.41 -
-Ethnicity: Hispanic - --1.60 - -0.67 - --2.91 - --0.29 -
-Race: Black - --3.52 - -0.45 - --4.40 - --2.64 -
-Race: Other - --1.35 - -0.79 - --2.91 - -0.21 -
-Education - -0.68 - -0.07 - -0.55 - -0.82 -
-Maternal Education - --0.02 - -0.06 - --0.13 - -0.09 -
-Paternal Education - -0.03 - -0.05 - --0.07 - -0.13 -
-Impaired: Yes - --1.08 - -0.29 - --1.64 - --0.51 -
-Constructional Praxis Delayed -
-Ravens - -0.32 - -0.02 - -0.28 - -0.35 -
-Rurality: Suburban - -0.14 - -0.23 - --0.32 - -0.59 -
-Rurality: Exurban - -0.18 - -0.19 - --0.18 - -0.54 -
-Rurality: Not Classified - --0.11 - -0.12 - --0.34 - -0.12 -
-Age - --0.05 - -0.01 - --0.07 - --0.04 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.00 - -0.03 - --0.05 - -0.05 -
-Sex: Female - -0.00 - -0.10 - --0.19 - -0.19 -
-Ethnicity: Hispanic - -0.13 - -0.22 - --0.30 - -0.57 -
-Race: Black - --0.13 - -0.15 - --0.43 - -0.15 -
-Race: Other - -0.03 - -0.26 - --0.47 - -0.53 -
-Education - -0.10 - -0.02 - -0.06 - -0.14 -
-Maternal Education - --0.04 - -0.02 - --0.07 - -0.00 -
-Paternal Education - -0.00 - -0.02 - --0.03 - -0.04 -
-Impaired: Yes - --0.34 - -0.09 - --0.53 - --0.16 -
-Logical Memory Delayed -
-Ravens - -0.34 - -0.03 - -0.28 - -0.40 -
-Rurality: Suburban - -0.55 - -0.42 - --0.27 - -1.36 -
-Rurality: Exurban - -0.41 - -0.33 - --0.24 - -1.05 -
-Rurality: Not Classified - -0.28 - -0.21 - --0.13 - -0.68 -
-Age - --0.15 - -0.01 - --0.18 - --0.13 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.03 - -0.04 - --0.06 - -0.12 -
-Sex: Female - -1.41 - -0.17 - -1.07 - -1.74 -
-Ethnicity: Hispanic - --0.33 - -0.39 - --1.08 - -0.42 -
-Race: Black - --0.37 - -0.26 - --0.89 - -0.13 -
-Race: Other - --0.69 - -0.46 - --1.58 - -0.23 -
-Education - -0.28 - -0.04 - -0.20 - -0.35 -
-Maternal Education - -0.07 - -0.03 - -0.00 - -0.14 -
-Paternal Education - --0.02 - -0.03 - --0.08 - -0.04 -
-Impaired: Yes - --0.22 - -0.17 - --0.55 - -0.10 -
-Logical Memory Recognition -
-Ravens - -0.13 - -0.02 - -0.10 - -0.17 -
-Rurality: Suburban - -0.15 - -0.21 - --0.28 - -0.57 -
-Rurality: Exurban - --0.03 - -0.17 - --0.36 - -0.31 -
-Rurality: Not Classified - -0.18 - -0.11 - --0.03 - -0.39 -
-Age - --0.05 - -0.01 - --0.07 - --0.04 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.05 - -0.02 - -0.00 - -0.09 -
-Sex: Female - -0.49 - -0.09 - -0.32 - -0.66 -
-Ethnicity: Hispanic - --0.42 - -0.20 - --0.81 - --0.03 -
-Race: Black - --0.13 - -0.14 - --0.40 - -0.13 -
-Race: Other - --0.64 - -0.24 - --1.12 - --0.16 -
-Education - -0.08 - -0.02 - -0.04 - -0.12 -
-Maternal Education - -0.04 - -0.02 - -0.00 - -0.07 -
-Paternal Education - --0.01 - -0.02 - --0.05 - -0.02 -
-Impaired: Yes - --0.20 - -0.09 - --0.37 - --0.03 -
-Trails A -
-Ravens - --1.95 - -0.14 - --2.23 - --1.67 -
-Rurality: Suburban - --2.58 - -1.93 - --6.40 - -1.15 -
-Rurality: Exurban - --1.21 - -1.51 - --4.21 - -1.69 -
-Rurality: Not Classified - --0.63 - -0.97 - --2.55 - -1.27 -
-Age - -0.49 - -0.06 - -0.38 - -0.61 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.02 - -0.21 - --0.39 - -0.43 -
-Sex: Female - --2.11 - -0.79 - --3.67 - --0.56 -
-Ethnicity: Hispanic - -2.35 - -1.82 - --1.22 - -5.90 -
-Race: Black - -4.99 - -1.23 - -2.58 - -7.40 -
-Race: Other - -2.05 - -2.15 - --2.12 - -6.24 -
-Education - --0.43 - -0.18 - --0.77 - --0.07 -
-Maternal Education - --0.02 - -0.16 - --0.32 - -0.29 -
-Paternal Education - --0.03 - -0.14 - --0.30 - -0.24 -
-Impaired: Yes - -1.86 - -0.78 - -0.34 - -3.37 -
-Trails B -
-Ravens - --2.76 - -0.26 - --3.28 - --2.24 -
-Rurality: Suburban - -2.06 - -3.79 - --5.45 - -9.40 -
-Rurality: Exurban - --0.32 - -2.97 - --6.17 - -5.52 -
-Rurality: Not Classified - -0.46 - -1.90 - --3.27 - -4.23 -
-Age - -0.76 - -0.11 - -0.53 - -0.98 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.31 - -0.40 - --0.48 - -1.09 -
-Sex: Female - --2.78 - -1.56 - --5.86 - -0.26 -
-Ethnicity: Hispanic - -1.87 - -3.55 - --5.06 - -8.85 -
-Race: Black - -12.14 - -2.36 - -7.50 - -16.80 -
-Race: Other - -3.07 - -4.09 - --4.92 - -11.10 -
-Education - --1.42 - -0.35 - --2.10 - --0.75 -
-Maternal Education - --0.06 - -0.30 - --0.64 - -0.51 -
-Paternal Education - --0.30 - -0.27 - --0.84 - -0.23 -
-Impaired: Yes - -4.36 - -1.54 - -1.29 - -7.36 -
-CERAD Immediate -
-Ravens - -0.36 - -0.03 - -0.31 - -0.41 -
-Rurality: Suburban - -0.34 - -0.35 - --0.36 - -1.04 -
-Rurality: Exurban - -0.08 - -0.28 - --0.47 - -0.63 -
-Rurality: Not Classified - -0.31 - -0.18 - --0.03 - -0.66 -
-Age - --0.12 - -0.01 - --0.14 - --0.09 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.05 - -0.04 - --0.03 - -0.13 -
-Sex: Female - -2.52 - -0.14 - -2.24 - -2.81 -
-Ethnicity: Hispanic - --0.17 - -0.33 - --0.82 - -0.48 -
-Race: Black - -0.69 - -0.22 - -0.25 - -1.12 -
-Race: Other - --0.20 - -0.39 - --0.97 - -0.56 -
-Education - -0.22 - -0.03 - -0.16 - -0.29 -
-Maternal Education - --0.03 - -0.03 - --0.08 - -0.03 -
-Paternal Education - --0.01 - -0.03 - --0.05 - -0.04 -
-Impaired: Yes - --0.67 - -0.14 - --0.94 - --0.38 -
-CERAD Discriminability -
-Ravens - -0.11 - -0.01 - -0.09 - -0.13 -
-Rurality: Suburban - -0.21 - -0.15 - --0.07 - -0.50 -
-Rurality: Exurban - -0.12 - -0.12 - --0.11 - -0.35 -
-Rurality: Not Classified - -0.11 - -0.07 - --0.03 - -0.26 -
-Age - --0.04 - -0.00 - --0.05 - --0.03 -
-Total Assests - -0.00 - -0.00 - -0.00 - -0.00 -
-Financial Wealth - -0.00 - -0.00 - -0.00 - -0.00 -
-Total Income - -0.00 - -0.00 - -0.00 - -0.00 -
-Poverty Line Ratio - -0.03 - -0.02 - -0.00 - -0.06 -
-Sex: Female - -0.40 - -0.06 - -0.29 - -0.52 -
-Ethnicity: Hispanic - --0.15 - -0.14 - --0.42 - -0.12 -
-Race: Black - -0.04 - -0.09 - --0.14 - -0.23 -
-Race: Other - -0.17 - -0.16 - --0.14 - -0.49 -
-Education - -0.03 - -0.01 - -0.01 - -0.06 -
-Maternal Education - --0.01 - -0.01 - --0.03 - -0.01 -
-Paternal Education - --0.02 - -0.01 - --0.04 - -0.00 -
-Impaired: Yes - --0.20 - -0.06 - --0.32 - --0.09 -
-

Readers are cautioned to avoid over-interpreting the results of the table above. The objective of this analysis was to produce a means of reasonably estimating cognitive test performance in a way that reduces the risk of biased over-exclusion of individuals from minority backgrounds. As a result, the scales for some variables make results seem meaningless. For example, when rounded to two decimal places, the coefficients for the various financial variables are essentially zero an thus seemingly irrelevant. In reality, however, these very small coefficients are sensible considering that, for example, the average financial wealth in the sample is $178,415.80 (SD = 678,850.10). Considering that the largest raw score being predicted is the 300 second cutoff for the Trail Making Test, the regression coefficient scale is necessarily very small. If the intention was presenting this model and testing whether certain variable mediated, for example, the effects of race, then there would need to be appropriate transformation of the data to facilitate adequate communication of these variables’ importance. In this case, such transformations do not change final predictions, so they were not made a priori. While this reduced some data cleaning efforts (e.g., no need to consider how to scale data across multiply imputed datasets), the downside is that it is difficult to understand what the variables might convey for applications like clinical practice. Thus, again, readers are encouraged to cautiously evaluate the above results with this in mind.

-

Another aspect to consider from the regression is how well it recovers the data generation process that led to the observed data to begin with. A particularly useful way of checking whether a model may be misspecified is via posterior predictive checks. Again, details of posterior predictive checks are discussed in the manuscript and also in the other supplementary material. Readers less familiar with these plots are encouraged to review those documents. Shown below are the posterior predictive checks for each of the outcomes in the multivariate regression.

-
#note that functions are wrapped in suppressWarnings() to avoid reprinting the same default warning from 'brms'
-#Warning reads: "Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented"
-#Warning stems from 'brms' storing just one of the imputed datasets, which then gets used for the post-processing (meaning the other 4 datasets used to derive the posterior are not stored and thus unused in post-processing)
-
-suppressWarnings(pp_check(CogRegression, resp = "MMSE", nsamples = 25) +
-  xlim(-5, 35)) #used to zoom into actual MMSE range
-
## Warning: Removed 8992 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "CERADimm", nsamples = 25) +
-  xlim(-5, 35))
-
## Warning: Removed 9992 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "CERADdel", nsamples = 25) +
-  xlim(-5, 15))
-
## Warning: Removed 10351 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "CERADdisc", nsamples = 25) +
-  xlim(-15, 15))
-
## Warning: Removed 8093 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "LMIM", nsamples = 25) +
-  xlim(-5, 28))
-
## Warning: Removed 12645 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "LMDM", nsamples = 25) +
-  xlim(-5, 30))
-
## Warning: Removed 13950 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "LMrcg", nsamples = 25) +
-  xlim(-5, 20))
-
## Warning: Removed 9686 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "CPIM", nsamples = 25) +
-  xlim(-5, 16))
-
## Warning: Removed 9237 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "CPDM", nsamples = 25) +
-  xlim(-5, 26))
-
## Warning: Removed 8993 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "Animals", nsamples = 25) +
-  xlim(-5, 50))
-
## Warning: Removed 10082 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "SDMT", nsamples = 25) +
-  xlim(-5, 75))
-
## Warning: Removed 10815 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "TMTA", nsamples = 25) +
-  xlim(-5, 305))
-
## Warning: Removed 12718 rows containing non-finite values (stat_density).
-

-
suppressWarnings(pp_check(CogRegression, resp = "TMTB", nsamples = 25) +
-  xlim(-5, 305))
-
## Warning: Removed 19280 rows containing non-finite values (stat_density).
-

-

As would be expected, the model predicts some outcomes better than others. This said, the model does a generally good job of capturing the rough shape of most of the outcomes and approximates the central tendency of each test fairly well. The central tendency is the more valuable aspect of this regression as the ultimate goal is to identify individuals who are below average rather than assigning or estimating some quantile at which a performance falls. The model fit could likely be improved in a few ways. First, some models clearly have a mixture of distributions. These mixtures might be as simple as two classes: a “can do the task” group and a “can’t do the task” group. These might be the case for Trails B and CERAD Delayed Recall. This being said, mixture models in brms are challenging to run, and finding the correct number of mixtures to evaluate is non-trivial. Second, the models could benefit from being respecified from a Student t likelihood. While better than a standard normal distribution for these data, the Student t distribution still makes assumptions that are not appropriate for the bounded count data that clearly dominates the majority of these outcomes. Some likelihoods that may be more useful would be the binomial, skew normal, and potentially the log-normal. Alternative likelihoods were not explored because brms currently only supports estimations of the residual correlations when the data are fit on a multivariate normal or Student t likelihood. Third, a distributional model that estimates a separate sigma (residual variance) for each outcome may lead to even better/individualized prediction. Again, the current results are sufficient for the purposes needed, but it is important that readers recognize that these methods should not be used to base inferences as this was not then intent of the methods.

-

As has been highlighted in this document, one of the benefits of the multivariate multiple regression is the ability to estimate the correlation among residuals. This multivariate nature is important for analyzing cognitive test performance in a battery rather than based on a single test. Having that information about the correlation among residuals helps improve accuracy given the desired aim to base cognitive diagnosis classification based on residuals from the model’s predictions. Toward this end, the correlation matrix for the residuals is shown here:

-
matrix(VarCorr(CogRegression)$residual__$cor[, 1, ],
-       nrow = 13, ncol = 13, byrow = FALSE, dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"), c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"))) %>%
-  kable(caption = "Residual Correlations from the Multivariate Regression", align = 'c', digits = 2) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Residual Correlations from the Multivariate Regression -
- -MMSE - -Animals - -CERAD Delayed - -Logical Memory Immediate - -Constructional Praxis Immediate - -Symbol Digits Modality - -Constructional Praxis Delayed - -Logical Memory Delayed - -Logical Memory Recognition - -Trails A - -Trails B - -CERAD Immediate - -CERAD Discriminability -
-MMSE - -1.00 - -0.20 - -0.33 - -0.25 - -0.19 - -0.32 - -0.24 - -0.22 - -0.23 - --0.38 - --0.31 - -0.35 - -0.35 -
-Animals - -0.20 - -1.00 - -0.30 - -0.23 - -0.13 - -0.30 - -0.23 - -0.27 - -0.17 - --0.19 - --0.27 - -0.32 - -0.18 -
-CERAD Delayed - -0.33 - -0.30 - -1.00 - -0.37 - -0.10 - -0.32 - -0.36 - -0.42 - -0.29 - --0.20 - --0.23 - -0.71 - -0.50 -
-Logical Memory Immediate - -0.25 - -0.23 - -0.37 - -1.00 - -0.03 - -0.18 - -0.19 - -0.80 - -0.52 - --0.14 - --0.18 - -0.35 - -0.25 -
-Constructional Praxis Immediate - -0.19 - -0.13 - -0.10 - -0.03 - -1.00 - -0.22 - -0.44 - -0.05 - -0.05 - --0.18 - --0.18 - -0.12 - -0.04 -
-Symbol Digits Modality - -0.32 - -0.30 - -0.32 - -0.18 - -0.22 - -1.00 - -0.28 - -0.22 - -0.15 - --0.48 - --0.52 - -0.30 - -0.19 -
-Constructional Praxis Delayed - -0.24 - -0.23 - -0.36 - -0.19 - -0.44 - -0.28 - -1.00 - -0.23 - -0.18 - --0.21 - --0.23 - -0.29 - -0.27 -
-Logical Memory Delayed - -0.22 - -0.27 - -0.42 - -0.80 - -0.05 - -0.22 - -0.23 - -1.00 - -0.48 - --0.13 - --0.18 - -0.36 - -0.27 -
-Logical Memory Recognition - -0.23 - -0.17 - -0.29 - -0.52 - -0.05 - -0.15 - -0.18 - -0.48 - -1.00 - --0.14 - --0.17 - -0.25 - -0.25 -
-Trails A - --0.38 - --0.19 - --0.20 - --0.14 - --0.18 - --0.48 - --0.21 - --0.13 - --0.14 - -1.00 - -0.49 - --0.22 - --0.20 -
-Trails B - --0.31 - --0.27 - --0.23 - --0.18 - --0.18 - --0.52 - --0.23 - --0.18 - --0.17 - -0.49 - -1.00 - --0.25 - --0.14 -
-CERAD Immediate - -0.35 - -0.32 - -0.71 - -0.35 - -0.12 - -0.30 - -0.29 - -0.36 - -0.25 - --0.22 - --0.25 - -1.00 - -0.43 -
-CERAD Discriminability - -0.35 - -0.18 - -0.50 - -0.25 - -0.04 - -0.19 - -0.27 - -0.27 - -0.25 - --0.20 - --0.14 - -0.43 - -1.00 -
-

The final details to provide regarding the regression model is a gross summary of its performance. The following table summarizes the Bayesian \(R^2\) statistic for each outcome estimated.

-
matrix(posterior_summary(CogRegression$criteria$bayes_R2, probs = c(0.025, 0.05, 0.90, 0.975)),
-       nrow = 13, ncol = 6, byrow = FALSE,
-       dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"), c("Estimate", "Std. Error", "95% CI LB", "90% CI LB", "90% CI UB", "95% CI UB"))) %>%
-  kable(caption = "Model Performance from the Multivariate Regression", align = 'c', digits = c(2, 2, 4, 4, 4, 4)) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Model Performance from the Multivariate Regression -
- -Estimate - -Std. Error - -95% CI LB - -90% CI LB - -90% CI UB - -95% CI UB -
-MMSE - -0.19 - -0.01 - -0.1666 - -0.1701 - -0.2031 - -0.2111 -
-Animals - -0.23 - -0.01 - -0.1991 - -0.2033 - -0.2444 - -0.2535 -
-CERAD Delayed - -0.23 - -0.01 - -0.2016 - -0.2057 - -0.2448 - -0.2539 -
-Logical Memory Immediate - -0.23 - -0.01 - -0.1992 - -0.2037 - -0.2449 - -0.2544 -
-Constructional Praxis Immediate - -0.26 - -0.01 - -0.2344 - -0.2389 - -0.2811 - -0.2907 -
-Symbol Digits Modality - -0.36 - -0.01 - -0.3355 - -0.3396 - -0.3800 - -0.3888 -
-Constructional Praxis Delayed - -0.25 - -0.01 - -0.2201 - -0.2245 - -0.2670 - -0.2767 -
-Logical Memory Delayed - -0.24 - -0.01 - -0.2111 - -0.2152 - -0.2571 - -0.2662 -
-Logical Memory Recognition - -0.13 - -0.01 - -0.1027 - -0.1066 - -0.1409 - -0.1493 -
-Trails A - -0.07 - -0.01 - -0.0555 - -0.0573 - -0.0747 - -0.0789 -
-Trails B - -0.07 - -0.01 - -0.0535 - -0.0553 - -0.0733 - -0.0778 -
-CERAD Immediate - -0.24 - -0.01 - -0.2140 - -0.2180 - -0.2565 - -0.2653 -
-CERAD Discriminability - -0.08 - -0.01 - -0.0644 - -0.0666 - -0.0896 - -0.0952 -
-

As is relatively apparent from the table, the predictors do not explain large proportions of variance in expected scores. This is particularly true for the Trail Making Test. On the one hand, this is good and makes sense. We do not expect cognitive test performance to be entirely due (or even primarily due) to demographic and background factors. On the other hand, this is not ideal for producing highly accurate predictions of performance. Regardless, the residual distributions are ultimately the point of interest here as this is used as the metric for inferring cognitive impairment on a given test.

-
-
-

Latent Class Analysis Results

-

The latent class analysis was the ultimate end goal of the entire process. In the absence of any clinically defined diagnostic status, there needed to be a method for detecting these unobserved groups of individuals. There are two primary methods for doing this: latent profile analysis and latent class analysis. Both are identical in their basic purpose: utilize assumed parametric statistical properties to class individuals into an arbitrary number of homogeneous classes. The distinction between the two methods is that latent profile analysis takes as input continuous variables (e.g., raw scores) while latent class analysis takes dichotomous variables as input. A special note is made here that we are deliberately excluding methods like cluster analyses as these impose no statistical assumptions and are instead grouping observed data based on differences within the data. There is no statistical framework there to make inferences as to whether the clusters extend to other samples or settings. While clustering methods are ideal for many large data and machine learning settings where the available data either represent the true population or are so large that sampling error is negligible, a deliberate effort to rely on inferential methods was made for this particular application as the HRS very carefully samples a population of interest.

-

Latent class analysis was selected over latent profile analysis because the interest was on clinically-defined diagnostic groups. Arguably, using latent profile analysis could be considered easier since it would not have required the development of a separate regression equation to predict performance. With latent profile analysis, there may be many different latent classes identified but there is no guarantee that these classes correspond to individuals with cognitive impairment; instead, latent profiles may correspond to groups of individuals with relative strengths in certain areas. Relying only on dichotomous indicators of impaired vs not impaired ensures that the model is only making decisions about that aspect of the testing performance. Additionally, recent work by Jak, Bondi and colleagues has emphasized the utility of a simple >1SD below expectations dichotomizing criterion of cognitive data for detecting mild cognitive impairment (MCI). Detection of individuals with likely dementia is not necessarily difficult given the battery of tests administered; however, discriminating between MCI and normal cognition is trickier, so a method built on existing literature to detect MCI was preferred as well.

-

The primary challenge of any latent class analysis is the number of latent classes to model. The BayesLCA package provides a convenient method for deriving the number of latent classes via a variational Bayes estimator. Specifically, the Dirichlet prior for class membership/mixture can be set to some value less than 1 and then the number of classes to be fit can be set to some value larger than what is expected (i.e., the model can be overfit). The result is that any unused classes are returned as empty and only those number of classes needed to describe the data are filled. In this case, the Dirichlet mixing prior was specified as 1/10 and 10 latent classes were fit. The resulting method resulted in 8 uniquely estimated latent classes. Based on these 8 different classes, class 1 was used as the “cognitively normal” group as this was the most clearly non-impaired group identified. To help understand this decision-making process, we can look at the probability of group membership as a function of each test being scored as either a 0 (not impaired relative to expected performance) or 1 (impaired relative to expected performance).

-
matrix(LCAres$itemprob[1:8, ], nrow = 8, ncol = 13, byrow = FALSE,
-       dimnames = list(paste("Class", 1:8), c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"))) %>%
-  kable(caption = "Probabilities of Class Membership by Dichotomous Test Performance", align = 'c', digits = 2) %>%
-  kable_classic(full_width = FALSE, position = "center")
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-Probabilities of Class Membership by Dichotomous Test Performance -
- -MMSE - -Animals - -CERAD Delayed - -Logical Memory Immediate - -Constructional Praxis Immediate - -Symbol Digits Modality - -Constructional Praxis Delayed - -Logical Memory Delayed - -Logical Memory Recognition - -Trails A - -Trails B - -CERAD Immediate - -CERAD Discriminability -
-Class 1 - -0.03 - -0.03 - -0.03 - -0.03 - -0.06 - -0.05 - -0.04 - -0.03 - -0.04 - -0 - -0.00 - -0.02 - -0.03 -
-Class 2 - -0.26 - -0.10 - -0.75 - -0.19 - -0.03 - -0.14 - -0.40 - -0.08 - -0.22 - -0 - -0.01 - -0.42 - -0.56 -
-Class 3 - -0.04 - -0.19 - -0.08 - -0.69 - -0.11 - -0.11 - -0.12 - -0.67 - -0.39 - -0 - -0.00 - -0.04 - -0.02 -
-Class 4 - -0.86 - -0.15 - -0.26 - -0.05 - -0.31 - -0.53 - -0.03 - -0.01 - -0.19 - -0 - -0.01 - -0.43 - -0.58 -
-Class 5 - -0.86 - -0.51 - -0.78 - -0.93 - -0.24 - -0.51 - -0.97 - -0.47 - -0.46 - -0 - -0.00 - -0.86 - -0.63 -
-Class 6 - -0.16 - -0.39 - -0.97 - -0.65 - -0.01 - -0.18 - -0.60 - -0.93 - -0.42 - -0 - -0.00 - -0.54 - -0.53 -
-Class 7 - -0.14 - -0.09 - -0.15 - -0.02 - -0.95 - -0.26 - -0.92 - -0.20 - -0.03 - -0 - -0.00 - -0.09 - -0.04 -
-Class 8 - -0.99 - -0.83 - -0.91 - -0.47 - -0.32 - -0.99 - -0.11 - -0.06 - -0.88 - -0 - -0.00 - -1.00 - -0.98 -
-

As can be seen from these probabilities, Class 1 is the only one in which there is a small probability of being included in if there is any test performance that is below expected. The most clear contrast from above is that of Class 8 where probability of membership increases dramatically for each (though most strongly among the memory tests) abnormal test result. All of the other classes demonstrate an increase in probability of membership with each successive abnormal test result. In the case of Class 2, it seems that abnormal CERAD performance defines its membership. In Class 3, it is abnormal Logical Memory performance that describes its members. Classes 4 and 5 are both more likely in cases where MMSE is abnormal, but Class 5 also seems to have more gross impairment across all memory tests. Class 6 id defined by an amnestic profile but intact general cognition (i.e., MMSE). Class 7 seems to be more visuospatial in nature as the Constructional Praxis test defines its membership. Based on inspection of these results, it was determined to consider any case grouped into Class 1 as being cognitively normal and all other individuals as being impaired. As a final note of clarification, cognitive normal assignment did not trump informant report in regards to being included or not. In other words, individuals who were found to be cognitively normal using the methods described throughout this document but whose informant reported any impairment were still classified as impaired and thus excluded from the study.

-
- - - -
-
-
-
-
- - - - - - - - - - - - diff --git a/eIRT-CERAD-main/ItemCovStudy/R Script Files/1 - Data Creation and Setup.R b/eIRT-CERAD-main/ItemCovStudy/R Script Files/1 - Data Creation and Setup.R deleted file mode 100644 index 4f823b29..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/R Script Files/1 - Data Creation and Setup.R +++ /dev/null @@ -1,336 +0,0 @@ -################################################# -## DATA CREATION FROM HRS AND RAND DATA FILES -################################################# - -#load in needed packages -library(asciiSetupReader) #needed to read in HRS data -library(haven) #needed to read in RAND data -library(tidyverse) #needed for data wrangling -library(reshape2) #provides simpler wide-to-long transformation - -######################## -## CREATE DATA FOR STUDY -######################## - -#read respondent and informant files into R -df_resp <- read_ascii_setup(data = "Data Files/HC16HP_R.da", setup_file = "Data Files/HC16HP_R.sas", use_value_labels = TRUE, use_clean_names = TRUE) -df_info <- read_ascii_setup(data = "Data Files/HC16HP_I.da", setup_file = "Data Files/HC16HP_I.sas", use_value_labels = TRUE, use_clean_names = TRUE) - -#get rid of cases with diagnoses -stroke <- which(df_info$INFORMANT_R_DIAGNOSED_WITH_STROKE == 1) -parkin <- which(df_info$INFORMANT_R_DIAGNOSED_WITH_PARKINSON_S == 1) -alzhie <- which(df_info$INFORMANT_R_DIAGNOSED_WITH_ALZHEIMER_S == 1) -memory <- which(df_info$INFORMANT_R_DIAGNOSED_WITH_MEM_PROBLEMS == 1) -impaired <- unique(c(stroke, parkin, alzhie, memory)) - -df_imp <- df_resp[impaired, ] -df_resp <- df_resp[-impaired, ] -df_info <- df_info[-impaired, ] - -#get rid of cases with functional impairment related to Blessed -Blessed1 <- which(df_info[, 159] > 1 & df_info[, 160] == 2 | df_info[, 160] == 3) -Blessed2 <- which(df_info[, 161] > 1 & df_info[, 162] == 2 | df_info[, 162] == 3) -Blessed3 <- which(df_info[, 163] > 1 & df_info[, 164] == 2 | df_info[, 164] == 3) -Blessed4 <- which(df_info[, 165] > 1 & df_info[, 166] == 2 | df_info[, 166] == 3) -Blessed5 <- which(df_info[, 167] > 1 & df_info[, 168] == 2 | df_info[, 168] == 3) -Blessed6 <- which(df_info[, 169] > 1 & df_info[, 170] == 2 | df_info[, 170] == 3) -Blessed7 <- which(df_info[, 171] > 1 & df_info[, 172] == 2 | df_info[, 172] == 3) -Blessed8 <- which(df_info[, 173] > 1 & df_info[, 174] == 2 | df_info[, 174] == 3) -blessed <- unique(c(Blessed1, Blessed2, Blessed3, Blessed4, Blessed5, Blessed6, Blessed7, Blessed8)) - -df_blsd <- df_resp[blessed, ] -df_resp <- df_resp[-blessed, ] - -#merge the data -df_resp$Impaired <- rep("No", nrow(df_resp)) -df_imp$Impaired <- rep("Yes", nrow(df_imp)) -df_blsd$Impaired <- rep("Yes", nrow(df_blsd)) -df_resp <- rbind(df_resp, df_imp) -df_resp <- rbind(df_resp, df_blsd) -df_resp$Impaired <- as.factor(df_resp$Impaired) - -#clean up data environment -rm(df_info, df_imp, df_blsd, alzhie, blessed, Blessed1, Blessed2, Blessed3, Blessed4, Blessed5, Blessed6, Blessed7, Blessed8, impaired, memory, parkin, stroke) - -#get just the CERAD data -df_resp <- df_resp[which(df_resp$CERAD_WORD_LIST_IMMEDIATE_COMPLETION_STATUS == 1), ] #keep only cases who completed CERAD trials -df_resp <- df_resp[which(df_resp$IWER_CHECKPOINT_IW_LANGUAGE == 1), ] #keep only those who completed the interview in English -df_resp <- df_resp[, c(1:2, 51:60, 66:75, 81:90, 14, 49, 64, 79, 95, 125, 149, 176:177, 200, 213, 220, 242, 269, 327, 346, 355, 364, 395)] - -#combine some variables -df_resp$CERADImm <- rowSums(df_resp[, 34:36]) -df_resp$CERADDisc <- df_resp[, 40]-(10-df_resp[, 41]) -df_resp[, c(34:36, 40:41)] <- NULL - -#recode missing values -df_resp[, 37] <- ifelse(df_resp[, 37] == 97, NA, df_resp[, 37]) #Constructional Praxis - 97 = cannot draw -df_resp[, 39] <- ifelse(df_resp[, 39] == 97, NA, df_resp[, 39]) #Constructional Praxis - 91 = cannot draw -df_resp[, 43] <- ifelse(df_resp[, 43] == 997, 300, df_resp[, 43]) #TMTA - 997 = could not complete in 5 minutes -df_resp[, 44] <- ifelse(df_resp[, 44] == 997, 300, df_resp[, 44]) #TMTB - 997 = could not complete in 5 minutes - -#get demographic variables from broader HRS data set -df_geogr <- read_ascii_setup(data = "Data Files/HRSXREGION16.da", setup_file = "Data Files/HRSXREGION16.sas", use_value_labels = TRUE, use_clean_names = TRUE) - -df_sesvb <- read_sas(data = "Data Files/randhrs1992_2016v2.sas7bdat") - -#simplify the demographic variables to just those of interest -df_sesvb <- select(df_sesvb, c(1, contains("R13") | contains("H13") | starts_with("RA"))) #R13 = respondent answers in 2016, H13 = household answers in 2016, RA = cross-wave stable variables - -df_geogr <- df_geogr[, c(1:2, 22, 96)] -df_sesvb <- df_sesvb[, c(1, 14:15, 30, 35, 38, 50:52, 62:67, 70, - 73, 78, 83, 88, 93, 98, 103, 108, - 113, 118, 126:127, 215:216, 280:281, 283, - 430, 433, 435, 522:526, 539, 541, 550, - 553:554, 558:559, 570:572, 583:585, - 588:590, 597, 574)] - -#recode the variables -df_geogr$CENSUS_REGION_DIVISION_WHERE_LIVE_WHEN_IN_SCHOOL <- factor(df_geogr$CENSUS_REGION_DIVISION_WHERE_LIVE_WHEN_IN_SCHOOL, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 98, 99), labels = c("NEr_NEd", "NEr_MAd", "MWr_ENd", "MWr_WNd", "Sr_SAd", "Sr_ESd", "Sr_WSd", "Wr_Md", "Wr_Pd", "US_na", "Foreign", "NA", "NA")) -df_geogr$HRS_URBAN_RURAL_CODE_2016_BEALE_2013 <- ordered(factor(df_geogr$HRS_URBAN_RURAL_CODE_2016_BEALE_2013, levels = c(1, 2, 3, 9), labels = c("Urban", "Suburban", "Exurban", "NA"))) - -df_sesvb$R13CENREG <- factor(df_sesvb$R13CENREG, levels = c(1, 2, 3, 4, 5), labels = c("NE", "MW", "S", "W", "Other")) -df_sesvb$R13CENDIV <- factor(df_sesvb$R13CENDIV, levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 11), labels = c("NE", "MA", "ENC", "WNC", "SA", "ESC", "WSC", "Mtn", "Pcf", "NotUS")) -df_sesvb$R13SHLT <- ordered(df_sesvb$R13SHLT, levels = c(5, 4, 3, 2, 1), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent")) -df_sesvb$R13HLTC <- ordered(df_sesvb$R13HLTC, levels = c(5, 4, 3, 2, 1), labels = c("MuchWorse", "SomewhatWorse", "Same", "SomewhatBetter", "MuchBetter")) -df_sesvb$R13VGACTX <- ordered(df_sesvb$R13VGACTX, levels = c(5, 4, 3, 2, 1), labels = c("Never", "1-3/month", "1/week", ">1/week", "Daily")) -df_sesvb$R13MDACTX <- ordered(df_sesvb$R13MDACTX, levels = c(5, 4, 3, 2, 1), labels = c("Never", "1-3/month", "1/week", ">1/week", "Daily")) -df_sesvb$R13LTACTX <- ordered(df_sesvb$R13LTACTX, levels = c(5, 4, 3, 2, 1), labels = c("Never", "1-3/month", "1/week", ">1/week", "Daily")) -df_sesvb$R13BACK <- factor(df_sesvb$R13BACK, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13SMOKEV <- factor(df_sesvb$R13SMOKEV, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13SMOKEN <- factor(df_sesvb$R13SMOKEN, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13DRINK <- factor(df_sesvb$R13DRINK, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13SLEEPE <- factor(df_sesvb$R13SLEEPE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13HIBPE <- factor(df_sesvb$R13HIBPE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13DIABE <- factor(df_sesvb$R13DIABE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13CANCRE <- factor(df_sesvb$R13CANCRE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13LUNGE <- factor(df_sesvb$R13LUNGE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13HEARTE <- factor(df_sesvb$R13HEARTE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13STROKE <- factor(df_sesvb$R13STROKE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13PSYCHE <- factor(df_sesvb$R13PSYCHE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13ARTHRE <- factor(df_sesvb$R13ARTHRE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13ALZHEE <- factor(df_sesvb$R13ALZHEE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13DEMENE <- factor(df_sesvb$R13DEMENE, levels = c(0, 1), labels = c("No", "Yes")) -df_sesvb$R13SLFMEM <- ordered(df_sesvb$R13SLFMEM, levels = c(5, 4, 3, 2, 1), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent")) -df_sesvb$R13PSTMEM <- ordered(df_sesvb$R13PSTMEM, levels = c(3, 2, 1), labels = c("Worse", "Same", "Better")) -df_sesvb$R13PRMEM <- ordered(df_sesvb$R13PRMEM, levels = c(5, 4, 3, 2, 1), labels = c("Poor", "Fair", "Good", "Very Good", "Excellent")) -df_sesvb$H13INPOVA <- factor(df_sesvb$H13INPOVA, levels = c(0, 1), labels = c("Above", "Below")) -df_sesvb$H13INPOV <- factor(df_sesvb$H13INPOV, levels = c(0, 1), labels = c("Above", "Below")) -df_sesvb$RAGENDER <- factor(df_sesvb$RAGENDER, levels = c(1, 2), labels = c("Male", "Female")) -df_sesvb$RAHISPAN <- factor(df_sesvb$RAHISPAN, levels = c(0, 1), labels = c("NonHisp", "Hispanic")) -df_sesvb$RARACEM <- factor(df_sesvb$RARACEM, levels = c(1, 2, 3), labels = c("White", "Black", "Other")) -df_sesvb$RAEDEGRM <- ordered(df_sesvb$RAEDEGRM, levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8), labels = c("NoDeg", "GED", "HS", "HS/GED", "AA", "BA", "MA", "PhD", "Other")) -df_sesvb$RAEDUC <- ordered(df_sesvb$RAEDUC, levels = c(1, 2, 3, 4, 5), labels = c("% - t() %>% - as.data.frame() %>% - cbind(model$data) %>% - gather("draw", "ppe", starts_with("V")) - - yrep <- posterior_predict(model, subset = subset) %>% - t() %>% - as.data.frame() %>% - cbind(df_long$Resp) %>% - gather("draw", "yrep", starts_with("V")) - - ppe %>% - mutate(yrep = yrep$yrep) %>% - mutate( - crit = criterion(as.numeric(Resp)-1, ppe), - crit_rep = criterion(yrep, ppe) - ) %>% - group_by(!!group, draw) %>% - summarise( - crit = sum(crit), - crit_rep = sum(crit_rep), - crit_diff = crit_rep - crit - ) %>% - mutate(draw = as.numeric(sub("^V", "", draw))) %>% - arrange(!!group, draw) %>% - identity() -} - -theme_hist <- function(...) { - bayesplot::theme_default() + - theme( - axis.text.y = element_blank(), - axis.ticks.y = element_blank(), - axis.title.y = element_blank(), - ... - ) -} - -######################## -## COMBINE MODEL COMPARISONS -######################## - -#combine all the model comparisons -ModelComparisons <- list("LOOIC" = list("Comparison.1" = LooCom1, "Comparison.2" = LooCom2, "Comparison.3" = LooCom3, "Comparison.4" = LooCom4, "Comparison.5" = LooCom5), - "Weights"= list("Comparison.1" = ModWgt1, "Comparison.2" = ModWgt2, "Comparison.3" = ModWgt3, "Comparison.4" = ModWgt4, "Comparison.5" = ModWgt5)) - -rm(LooCom1, LooCom2, LooCom3, LooCom4, LooCom5, - ModWgt1, ModWgt2, ModWgt3, ModWgt4, ModWgt5) - -#save data for use in .rmd -saveRDS(ModelComparisons, "ModelComparisons.rds") - -######################## -## SUMMARIZE MODEL RESULTS -######################## - -describe_posterior(TwoPL_deptr, centrality = "median", ci = 0.95, ci_method = "hdi", - test = c("p_direction", "p_MAP")) -rope(TwoPL_deptr, ci = c(0.95, 1.00), ci_method = "HDI") - -describe_posterior(TwoPL_itmex, centrality = "median", ci = 0.95, ci_method = "hdi", - test = c("p_direction", "p_MAP")) -rope(TwoPL_itmex, ci = c(0.95, 1.00), ci_method = "HDI") - -######################## -## SUMMARIZE ITEM COVARIATES -######################## - -#describe item covariates -item_cov <- data.frame(matrix(0, nrow = 10, ncol = 6, dimnames = list(c("Butter", "Arm", "Shore", "Letter", "Queen", "Cabin", "Pole", "Ticket", "Grass", "Engine"), - c("FreqSTX", "Concrete", "Diversity", "AoA", "BOI", "Phonemes")))) -item_cov$Item <- c("Butter", "Arm", "Shore", "Letter", "Queen", "Cabin", "Pole", "Ticket", "Grass", "Engine") -item_cov$FreqSTX <- ifelse(item_cov$Item == "Arm", 3.523, ifelse(item_cov$Item == "Butter", 3.018, ifelse(item_cov$Item == "Cabin", 3.001, ifelse(item_cov$Item == "Engine", 3.211, ifelse(item_cov$Item == "Grass", 2.933, ifelse(item_cov$Item == "Letter", 3.625, ifelse(item_cov$Item == "Pole", 2.808, ifelse(item_cov$Item == "Queen", 3.446, ifelse(item_cov$Item == "Shore", 3.006, 3.366))))))))) #log scale, frequency/1,000,000 words in SUBTLXus corpus -item_cov$Concrete <- ifelse(item_cov$Item == "Arm", 4.960, ifelse(item_cov$Item == "Butter", 4.900, ifelse(item_cov$Item == "Cabin", 4.920, ifelse(item_cov$Item == "Engine", 4.860, ifelse(item_cov$Item == "Grass", 4.930, ifelse(item_cov$Item == "Letter", 4.700, ifelse(item_cov$Item == "Pole", 4.660, ifelse(item_cov$Item == "Queen", 4.450, ifelse(item_cov$Item == "Shore", 4.790, 4.700))))))))) -item_cov$Diversity <- ifelse(item_cov$Item == "Arm", 1.657, ifelse(item_cov$Item == "Butter", 1.302, ifelse(item_cov$Item == "Cabin", 1.259, ifelse(item_cov$Item == "Engine", 1.334, ifelse(item_cov$Item == "Grass", 1.565, ifelse(item_cov$Item == "Letter", 1.639, ifelse(item_cov$Item == "Pole", 1.694, ifelse(item_cov$Item == "Queen", 1.544, ifelse(item_cov$Item == "Shore", 1.384, 1.658))))))))) -item_cov$AoA <- ifelse(item_cov$Item == "Arm", 3.260, ifelse(item_cov$Item == "Butter", 5.780, ifelse(item_cov$Item == "Cabin", 6.390, ifelse(item_cov$Item == "Engine", 6.280, ifelse(item_cov$Item == "Grass", 3.940, ifelse(item_cov$Item == "Letter", 4.740, ifelse(item_cov$Item == "Pole", 5.630, ifelse(item_cov$Item == "Queen", 4.420, ifelse(item_cov$Item == "Shore", 6.925, 5.320))))))))) -item_cov$BOI <- ifelse(item_cov$Item == "Arm", 6.478, ifelse(item_cov$Item == "Butter", 6.217, ifelse(item_cov$Item == "Cabin", 4.560, ifelse(item_cov$Item == "Engine", 5.333, ifelse(item_cov$Item == "Grass", 5.455, ifelse(item_cov$Item == "Letter", 5.259, ifelse(item_cov$Item == "Pole", 5.320, ifelse(item_cov$Item == "Queen", 4.083, ifelse(item_cov$Item == "Shore", 4.333, 5.333))))))))) -item_cov$Phonemes <- ifelse(item_cov$Item == "Arm", 3, ifelse(item_cov$Item == "Butter", 4, ifelse(item_cov$Item == "Cabin", 5, ifelse(item_cov$Item == "Engine", 5, ifelse(item_cov$Item == "Grass", 4, ifelse(item_cov$Item == "Letter", 4, ifelse(item_cov$Item == "Pole", 3, ifelse(item_cov$Item == "Queen", 4, ifelse(item_cov$Item == "Shore", 3, 5))))))))) -item_cov$Item <- NULL - -item_cov %>% - summarize_all(list(mean = mean, sd = sd, range = range)) -#FreqSTX = 3.19 (0.28), 2.81-3.63 -#Concrete = 4.79 (0.16), 4.45-4.96 -#Diversity = 1.50 (0.17), 1.26-1.69 -#AoA = 5.27 (1.17), 3.26-6.93 -#BOI = 5.24 (0.76), 4.08-6.48 -#Phonemes = 4.00 (0.82), 3.00-5.00 - -cor(item_cov, method = "kendall") -cor(item_cov, method = "spearman") - -######################## -## GET FIT STATISTICS -######################## - -#check item fit -item_fit <- fit_statistic( - TwoPL_deptr, criterion = ll, group = Item, - ndraws = 1000 -) #run test statistic -item_diff <- item_fit %>% - group_by(Item) %>% - summarise(bp = mean(crit_diff > 0)) #get results from test statistic -item_fit %>% - ggplot(aes(crit_diff)) + - geom_histogram() + - facet_wrap("Item", scales = "free") + - theme_hist() + - theme(text=element_text(family="Times", face="bold", size=12)) #plot test statistic results -saveRDS(item_fit, "ItemFit.rds") - -#get person fit and summaries -person_pars <- ranef(TwoPL_deptr, summary = FALSE)$ID[, , "theta_Intercept"] #get latent ability estimates -person_sds <- apply(person_pars, 1, sd) #add standard deviations -person_pars <- person_pars %>% - sweep(1, person_sds, "/") %>% - posterior_summary() %>% - as_tibble() %>% - rownames_to_column(var = "person") %>% - mutate(person = as.numeric(person)) #put together a table with estimate, SD, and credible intervals -person_pars %>% - arrange(Estimate) %>% - mutate(id2 = seq_len(n())) %>% - ggplot(aes(id2, Estimate, ymin = Q2.5, ymax = Q97.5)) + - geom_pointrange(alpha = 0.7) + - coord_flip() + - labs(x = "Person Number (sorted after Estimate)") + - theme( - axis.text.y = element_blank(), - axis.ticks.y = element_blank() - ) + - theme(text=element_text(family="Times", face="bold", size=12)) #plot table results with credible intervals shown -saveRDS(person_pars, "PersonEstimates.rds") - -#check person fit -person_fit <- fit_statistic( - TwoPL_deptr, criterion = ll, group = ID, - ndraws = 1000 -) #run test statistic -person_diff <- person_fit %>% - group_by(ID) %>% - summarise(bp = mean(crit_diff > 0)) #get results from test statistic -person_max <- which.max(person_diff$bp) #find maximum statistic value -person_fit %>% - filter(ID == person_max) %>% - ggplot(aes(crit_diff)) + - geom_histogram() + - theme_hist() + - theme(text=element_text(family="Times", face="bold", size=12)) #show result from maximum statistic -sum(ifelse(person_diff[, 2] > .95, 1, 0)) #count number of person misfits using a Bayesian p-value of < .05 -#51 > .95 or 4.18% of total sample -saveRDS(person_fit, "PersonFit.rds") diff --git a/eIRT-CERAD-main/ItemCovStudy/R Script Files/5 - Result Plots.R b/eIRT-CERAD-main/ItemCovStudy/R Script Files/5 - Result Plots.R deleted file mode 100644 index 6bc0a7a1..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/R Script Files/5 - Result Plots.R +++ /dev/null @@ -1,289 +0,0 @@ -################################################# -## PLOTTING FUNCTIONS FOR RESULTS -################################################# - -#load needed packages -library(tidyverse) #needed for data wrangling and plotting -library(ggExtra) #needed for marginal plots - -######################## -## PLOT ITEM CHARACTERISTIC CURVES -######################## - -fixed <- as.data.frame(fixef(TwoPL_deptr, summary = FALSE)) - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(p)) %>% - rename(p = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -fixed1 <- fixed -fixed1[, 13:22] <- fixed[, 13:22]+fixed[, 1] -fixed1[, 23:32] <- fixed[, 23:32]+fixed[, 2] -fixed1[, 45:54] <- fixed[, 45:54]+fixed[, 33] -fixed1[, 55:64] <- fixed[, 55:64]+fixed[, 34] - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(p)) %>% - rename(p = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -ICCpars <- rbind(temp, temp1) -ICCpars$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 3000) -ICCpars <- ICCpars[, c(6, 2, 1, 3, 4:5, 7)] - -tiff("TrialOneICCs.tiff", width = 6.5, height = 3, units = 'in', res = 1200) -ICCpars[ICCpars$Trial == 1 & ICCpars$Dependency == "No Prior Recall", ] %>% - mutate(Item = factor(Item, levels = c("Butter", "Arm", "Shore", "Letter", "Queen", "Cabin", "Pole", "Ticket", "Grass", "Engine"))) %>% - ggplot(aes(x = Theta, y = p)) + - geom_line(size = 0.4) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(x = expression(theta~('ability on the logit scale')), - y = expression(italic(p)(y==1))) + - theme_bw() + - theme(text=element_text(family="serif", face="bold", size=12)) -dev.off() - -tiff("TrialTwoICCs.tiff", width = 6.5, height = 3, units = 'in', res = 1200) -ICCpars[ICCpars$Trial == 2, ] %>% - mutate(Item = factor(Item, levels = c("Ticket", "Cabin", "Butter", "Shore", "Engine", "Arm", "Queen", "Letter", "Pole", "Grass"))) %>% - ggplot(aes(x = Theta, y = p, linetype = Dependency)) + - geom_line(size = 0.4) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(x = expression(theta~('ability on the logit scale')), - y = expression(italic(p)(y==1))) + - theme_bw() + - theme(text=element_text(family="serif", face="bold", size=12), - legend.position = "bottom") -dev.off() - -tiff("TrialThreeICCs.tiff", width = 6.5, height = 3, units = 'in', res = 1200) -ICCpars[ICCpars$Trial == 3, ] %>% - mutate(Item = factor(Item, levels = c("Queen", "Grass", "Arm", "Cabin", "Pole", "Shore", "Butter", "Engine", "Ticket", "Letter"))) %>% - ggplot(aes(x = Theta, y = p, linetype = Dependency)) + - geom_line(size = 0.4) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(x = expression(theta~('ability on the logit scale')), - y = expression(italic(p)(y==1))) + - theme_bw() + - theme(text=element_text(family="serif", face="bold", size=12), - legend.position = "bottom") -dev.off() - -######################## -## PLOT ITEM INFORMATION CURVES -######################## - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -IICpars <- rbind(temp, temp1) -IICpars$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 3000) -IICpars <- IICpars[, c(6, 2, 1, 3, 4:5, 7)] - -IICpars[IICpars$Trial == 1 & IICpars$Dependency == "No Prior Recall", ] %>% - mutate(Item = factor(Item, levels = c("Butter", "Arm", "Shore", "Letter", "Queen", "Cabin", "Pole", "Ticket", "Grass", "Engine"))) %>% - ggplot(aes(x = Theta, y = i)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -IICpars[IICpars$Trial == 2, ] %>% - mutate(Item = factor(Item, levels = c("Ticket", "Cabin", "Butter", "Shore", "Engine", "Arm", "Queen", "Letter", "Pole", "Grass"))) %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -IICpars[IICpars$Trial == 3, ] %>% - mutate(Item = factor(Item, levels = c("Queen", "Grass", "Arm", "Cabin", "Pole", "Shore", "Butter", "Engine", "Ticket", "Letter"))) %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -######################## -## TESTWISE INFORMATION CURVE -######################## - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, iter) %>% - summarise(sum_i = sum(i)) %>% - group_by(Theta) %>% - summarise(ci = list(as.data.frame(posterior_summary(sum_i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, iter) %>% - summarise(sum_i = sum(i)) %>% - group_by(Theta) %>% - summarise(ci = list(as.data.frame(posterior_summary(sum_i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) - -IICtest <- rbind(temp, temp1) -IICtest$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 100) - -IICtest %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - labs(title = "IIC for the 2PL", - subtitle = "Curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -######################## -## PLOT RELIABILITY OF THE TEST -######################## - -theta <- as.data.frame(ranef(TwoPL_deptr)$ID) -var_theta <- as.numeric(VarCorr(TwoPL_deptr)[[1]][[1]][1]) -reliability_est <- var_theta/(var_theta + theta[, 2]^2) -rel_dat <- data.frame(Theta = theta[, 1], Rxx = reliability_est) -rm(theta_est, var_theta, reliability_est) #keep environment tidy -ggplot(data = rel_dat, aes(x = Theta, y = Rxx)) + - geom_smooth(color = "black", size = 1.10) + - ylab("Reliability Estimate") + - xlab("Person Ability Estimate") + - theme_bw() + - theme(text=element_text(family="Times", face="bold", size=12)) - - -######################## -## SCATTER PLOT OF RAW TO THETA SCORES -######################## - -df_wide <- reshape(TwoPL_deptr$data[, c("ID", "Item", "Resp")], direction = "wide", idvar = "ID", timevar = "Item") -df_wide <- as.data.frame(sapply(df_wide, function(x) as.numeric(x)-1)) -df_wide$ID <- df_wide$ID+1 -df_wide$RawScore <- rowSums(df_wide[, 2:31]) - -cor(df_wide$RawScore, theta$Estimate.theta_Intercept) -cor(df_wide$RawScore, theta$Estimate.theta_Intercept, method = "kendall") - -ScatterDat <- as.data.frame(cbind(df_wide$RawScore, theta$Estimate.theta_Intercept)) -colnames(ScatterDat) <- c("CERAD Immediate Recall Raw Score", "Latent Trait (Theta)") - -p <- ggplot(ScatterDat, aes(x=ScatterDat[, 1], y=ScatterDat[, 2])) + - geom_jitter(size = 0.8) + - geom_smooth(method = "lm", se = TRUE) + - theme_classic() + - labs( - x = "CERAD Immediate Recall Raw Score", - y = "Latent Trait (Theta)" - ) + - theme(text = element_text(family = "serif", size = 12)) - -tiff("CERADScoreScatterplot.tiff", width = 6.5, height = 3, units = 'in', res = 1200) -ggMarginal(p, type = "densigram") -dev.off() diff --git a/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Inclusion Criteria File.Rmd b/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Inclusion Criteria File.Rmd deleted file mode 100644 index f49ad759..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Inclusion Criteria File.Rmd +++ /dev/null @@ -1,388 +0,0 @@ ---- -title: "Imputation and Inclusion Criteria File" -output: - rmdformats::robobook: - self_contained: true - thumbnails: false - lightbox: true ---- - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - -# Purpose of the Document - -This document serves to provide further detail regarding the multiple imputation and multivariate, multivariable regression methods used to define the sample's inclusion criteria. The HRS and HCAP data are rich with many important clinic data points; however, they do not include clinical diagnoses of cognitive status. As the aim of the CERAD eIRT study was to model memory, it was important to consider only those cases whose cognitive status was normal. Differences in response styles between normal cognitive controls and various dementia etiologies are well-known, so accidental inclusion of individuals with a neurocognitive disorder could affect the parameters of the models. In other words, it is reasonable to expect, based on existing literature, that the model predicting and describing item-level responses to the CERAD list learning test are different for those with and without neurocognitive disorders. - -Based on this *a priori* consideration, the inclusion criteria were an important starting point. Readers who have already reviewed the AsPredicted pre-registration of the study's hypotheses may have realized that the inclusion criteria outlined there and those discussed in the manuscript are different. Indeed, all of the models were fit on two different datasets because the initial criteria (based on a uniform cutoff of the MMSE) resulted in over-exclusion of Black participants ($\chi^2(2) = 29.57, p < .001, V = 0.13$, with Black-identifying participants having a Pearson residual of -2.76 for being labeled non-impaired compared to 3.15 for being labeled impaired). The fact that racial and ethnic minorities tend to score, on average, lower than non-Hispanic White individuals is well documented and should have been a consideration in the initial inclusion criteria. Regardless, it was determined that an alternative, more methodologically robust inclusion criteria ought to be established to satisfy the need to exclude individuals with possible neurocognitive impairments without under-representing individuals of minority backgrounds by over-emphasizing, raw, un-adjusted cognitive test scores. - -In order to pursue a more equitable inclusion criterion, multiple regression was considered as a basic starting place. Within neuropsychology, regression-based norms are relatively commonplace with these equations often including race, age, education, and sex as predictors of cognitive test performance. The benefit of the HRS and HCAP database is that many more socioeconomic and sociocontextual variables that could be used to inform the regression model. A further modeling hurdle was the fact that only raw scores were available for analysis. Raw scores on testing can often violate assumptions of OLS regression as they are clearly bounded, count variables with (often inherent) skew. Additionally, it is well-known that neuropsychological/cognitive tests tend to be correlated with one another even if they measure purportedly different cognitive domains/skills. A solution to these problems was to use multivariate, multiple regression that allowed for correlation among the residuals of each cognitive test. In order to allow for this correlation among residuals within the `brms` syntax, the likelihood of the regression had to be either multivariate normal or Student *t*. Since the raw data are unlikely to be multivariate normal and may be subject to outliers, a more robust multivariate Student *t* distribution was fit. The advantage of this specification is that the degrees of freedom for the Student *t* distribution can be specified as an unknown parameter with a weakly informative prior and thus estimated directly from the data themselves. The ability to directly model the shape of the Student *t* distribution does help some in capturing potentially skewed distributions where the central tendency of the distribution is heavily left or right shifted but then fades rapidly into narrower tails. - -While there is very little missing data in the HRS and HCAP data sources, it was important to be able to estimate cognitive testing scores for every participant so as to not introduce potential selection bias by omitting individuals with less overall complete data. Since a goal was to include a large number of sociocontextual demographic information, this meant using multiple imputation in order to get reasonable guesses for the missing data and then propagating this uncertainty to the Bayesian-estimated posteriors. Once this information was obtained, there was then a need to use the regression to identify those with questionable cognitive status. This was done by comparing the observed scores to their predictions, standardizing this difference, and then dichotomizing this difference into either "normal" (no more than a standard deviation below the expected score) or "abnormal" (a standard deviation or more below the expected score). The result of this procedure is a series of dichotomous 0/1 variables for each cognitive test. In order to maintain an empirically-based understanding of the sample, these dichotomous variables were then subjected to latent class analysis to identify homogeneous groups of individuals based on their relative cognitive performances across all tests. Inclusion was then based on membership to the latent class that appears to correspond to a cognitively normal group of individuals. - -This document provides the details of each of these steps since there was not space in the manuscript to describe this process in adequate detail. Accompanying this document is also the R script file that was created during the study process. Someone who acquires access to the appropriate files through the HRS should be able to run the script file (with some minimal alterations to ensure that the files are read from the correct files) and then get results that match those shown here (within some margin of error due to randomness in some of the methods, though the use of the same seed in R and `brms` should reduce the impact of this). - -Note that by default all the code used to generate the results in this document is not shown. As some readers may be interested in this code, it is possible to turn on its display by toggling the switch in the top right of this page. Every output from the document also has a toggle button (a small button with the word "code") that, when pressed, will show the code used to produce the output. Additionally, the raw Markdown file (*.rmd) is available on the github with the R script. - -## Overview of Sections - -While the preceding section describes the general motivations for each step of the inclusion criterion analyses, it is still expedient to spell out what each section of this document will do so that readers can jump to which ever section is of interest and anticipate what to find there. - -The first section overviews the multiple imputation process and results. Shown in this section are general descriptions of the data's missingness as well as diagnostic analyses of the multiple imputation results. These diagnostic plots include visual inspection of convergence, overlay of imputed and observed values, and propensity score plots for missing versus observed cases as a function of each variable. - -The second section then provides details on the multivariate, multiple regression. This includes discussion of the included predictors, their influence in prediction of each outcome, posterior predictive checks of the model, and general model performance (i.e., $R^2$). - -Finally, the last section provides the results of the latent class analysis. This includes the number of classes identified, the response patterns (i.e., patterns of 0/1 across cognitive tests) that correspond to each latent class, and the reasoning for choosing the latent class that we did as the "cognitively normal" group. - -## What's Being Done - -Since the corresponding R script for running all the models and analyses of this study are included, it can be helpful to see what objects from that script are being used here in case someone wants to replicate the results. The code below shows the objects being called for this document, and they are named in accordance to the fit and saved objects in the R script. Readers may need to toggle the "code" button below to see this output. Alternatively, all the raw code for this Markdown document is also uploaded on Github, so every code block can be seen from that RMD file. - -```{r SetupData} -#read in needed data -df_resp <- readRDS("Data_wide.rds") -df_imp <- readRDS("Data_imputed.rds") -CogRegression <- readRDS("Fitted Models/cognitiveDiagnosisRegression.rds") -LCAres <- readRDS("LatentClassModel.rds") - -#load in required packages -#wrapping in suppressPackageStartupMessages() done just to reduce print out in document -suppressPackageStartupMessages(library(brms)) -suppressPackageStartupMessages(library(mice)) -suppressPackageStartupMessages(library(tidyverse)) -suppressPackageStartupMessages(library(ggplot2)) -suppressPackageStartupMessages(library(kableExtra)) -``` - -# Multiple Imputation Results - -Multiple imputation is a robust method for estimating unobserved/missing data points by capitalizing on the relationship between missing values and the observed data. In the simplest case, imputation involves substituting some general value (e.g., the mean or median) for each missing value. This case is a highly conservative method for accounting for missing data as it assumes that the best we can do for guessing the possible value of the data is to assume that it is likely close to some central tendency of the data. Regression methods are a logical next step for making more informed guesses for missing data by drawing on the fact that variables are correlated and thus information about one can be approximated by looking at other observed data points. These simple imputation methods still have significant methodological challenges. The largest challenge to overcome is that simply estimating a value for a missing variable and treating this estimation as a "close enough" observation will introduce bias. While powerful, regression-based methods have multiple sources of error: there is uncertainty in the prediction itself (i.e., prediction error or residuals) and there is uncertainty about the true values of the regression predictors (i.e., estimation error or sampling error). One method for addressing this bias is to iteratively impute missing values so that there are a range of different estimates given for a single missing value. This range of estimated values can then be passed to statistical analyses in order to introduce empirically the uncertainty of the true unobserved value. This study utilized the `mice` package for *R* in order to perform multiple imputation of missing data points. The `mice` package performs multiple imputation by chained equations, and details of the underlying methodology can be found in [van Buuren and Groothuis-Oudshoorn (2011)] (https://www.jstatsoft.org/article/view/v045i03) and [van Buuren (2018)] (https://stefvanbuuren.name/fimd/). The remainder of this section describes the missingness of the data of interest and the imputation methods and results. - -The first issue with missing data is that only some missing data matters. In the case of the regressions to predict missing cognitive raw scores, we were interested only in those variables that might be usefully predictive of most cognitive tests. Historically in neuropsychology, these variables have been demographic variables like age, sex, race, and years of education. Given more recent research on racial and ethnic differences in raw scores across neuropsychological tests, it also made sense to extend these variables to broader socioeconomic and contextual factors like rurality, annual income, and parental educational status. While the HRS data resources include thousands of possible variables to choose from, the general recommendation in multiple imputation is to choose between 15 and 25 variables, including all variables that will appear in the complete-data model (van Buuren, 2018). We make no effort to say that the variables chosen were the best variables for including in the model; however, we believe them to be superior to standard demographic regression-based models and they achieved the intended goal of producing a classification standard that did not systematically over-exclude racial minorities. Further research on what variables are most useful is certainly needed, and hopefully, this serves as a starting place for researchers. - -The data variables selected for the regression model and thus the multiple imputation were all of the cognitive tests, standard demographic variables, and then sociocontextual variables. The list below summarizes these groupings of variables: - -**Cognitive Tests** - -* MMSE -* Animals -* CERAD List Learning (immediate recall, delayed recall, and discriminability) -* WMS Logical Memory Story B (immediate recall, delayed recall, and recognition) -* CERAD Constructional Praxis (immediate and delayed recall) -* Symbol Digit Modalities Test (total correct) -* Trail Making Test Parts A and B (seconds to complete) -* Raven's Progressive Matrices (Blocks A and B) - -**Demographic Variables** - -* Age -* Sex -* Ethnicity -* Race -* Years of education -* Impaired (by informant report of prior cognitive diagnosis or on the Blessed Dementia Rating Scale) - -**Sociocontextual Variables** - -* Rurality (urban, suburban, or exurban) -* Total value of assets -* Financial wealth -* Total annual income -* Ratio of financial means to household's federal poverty line -* Maternal years of education -* Paternal years of education - -The resulting list of variables include 27 variables. With a reduced dataset ready for the multiple imputation, the next consideration is the number of imputations needed to reasonably approximate the missing data. While greater number of imputations are always better, there is a sharp computational demand trade-off. The imputation process itself can be fairly intensive, meaning that running more iterations or imputations can require a sizable amount of computing time and power. Furthermore, each imputed dataset has to be run in the model. Since the model is a Bayesian model, this means that each dataset must be run over four chains for a set number of warmup and post-warmup samples. Those familiar with Bayesian methods will recognize the computational demand potentially inherent in running the same model multiple times for different imputed datasets. For general illustration purposes, some of the IRT models ran for this study required 15 hours to fit, so if this had to be done on imputed data, then this is 15 hours times the number of imputation to fit a single model. To determine the number of reasonable imputations needed, it is useful to evaluate the overall level of missingness. Shown below is a figure to visualize the overall level and pattern of missingness. - -```{r MissingnessPattern} -empty <- md.pattern(df_resp[, c("MMSE", "Animals", "CERADdel", "LMIM", "CPIM", "SDMT", "CPDM", "LMDM", "LMrcg", "TMTA", "TMTB", "CERADimm", "CERADdisc", - "Ravens", "UrbanRural", "Age", "TotalAssets", "FinancialWealth", "TotalIncome", "PovertyRatioInst", "Gender", "Ethnicity", - "Race", "Education", "MaternalEdu", "PaternalEdu", "Impaired")]) -``` - -Due to the size of the dataset, the missingness plot can be challenging to read. Each pattern of missing data is listed on a unique row and each column corresponds to one of the variables of interest. The cell produced from an intersecting row and column is colored blue if that participant has an observation for that variable, and it is colored red, if that observation is missing. The numbers on the left count the number of participants with that row's pattern of missing data while the numbers on the left count the number of missing variables in that pattern. Numbers at the bottom of the table correspond to the number of cases missing the respective variable. As can be seen from a visual gestalt of the figure is that most participants have the majority of data. Still, the figure does not tell us much about the level of missingness since it is very cluttered. One potentially useful value to know for each value is the number of patterns of missing data occur as a function of whether a given variable is missing. The following table shows the number of patterns of missing data corresponding to each variable: - -```{r MissingPatterns} -matrix(apply(empty[, 1:27], 2, function(x) {sum(x == 0)}), - dimnames = list(c("MMSE", "Rurality", "Age", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Sex", "Impaired", "Animals", "Race", "Ethnicity", "Education", "CERAD Immediate", "CERAD Discriminability", "CERAD Delayed", "Constructional Praxis Immediate", "Logical Memory Immediate", "Ravens", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Symbol Digit Modality", "Trails B", "Maternal Education", "Paternal Education"), "Number of Patterns")) %>% - kable(caption = "Patterns of Missing Data by Variable", align = rep('c', 27)) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -Just as a quick point of clarification, the `md.pattern()` function from `mice` does alter the order of the variables, so this is why there is no clear logical order of the variables in the table above. Rather than re-arrange these rows, however, the order was retained as it corresponds (from top to bottom) to the columns in the preceding figure (from left to right). As can be seen from this simple table, missing Trails B and Symbol Digits Modality were associated with the most patterns of missingness. The next variable that, when missing, was associated with a fair number of missingness patterns was the delayed trials of Logical Memory. These patterns generally suggest that those with greater cognitive impairment were more likely to have missing data elsewhere since these tests are among the most challenging included in the battery. Another metric to consider is which pattern(s) correspond to the most amount of missing data. The table below shows the missingness patterns that correspond to missing the most number of variables. - -```{r MostMissingPattern} -matrix(empty[which(empty[, 28] == max(empty[1:104, 28])), 1:27], - nrow = 2, ncol = 27, byrow = FALSE, - dimnames = list(c("Pattern 1", "Pattern 2"), c("MMSE", "Rurality", "Age", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Sex", "Impaired", "Animals", "Race", "Ethnicity", "Education", "CERAD Immediate", "CERAD Discriminability", "CERAD Delayed", "Constructional Praxis Immediate", "Logical Memory Immediate", "Ravens", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Symbol Digit Modality", "Trails B", "Maternal Education", "Paternal Education"))) %>% - kable(caption = "Patterns with the Most Missing Data", align = rep('c', 27)) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -Consistent with the previous observation, it appears that a likely explanation for missingness is the level of cognitive impairment. As demonstrated in the table above, there are two different patterns that have the most missing variables (missing 12 variables out of the 27 examined), and these patterns clearly occur among the cognitive tests. While not certain, a reasonable guess for why this would be the case is either a participant declines the cognitive tests or would be considered too impaired to complete or attempt the tests. This indicates that it's important to include the informant-based report of impairment in the imputation step since this seems likely to be related to the missingness. Up until this point, however, we have not actually looked at the overall rate of missingness in the dataset, just patterns of it. The following table summarizes the percentages of missing data points for each variable of interest: - -```{r MissingPercentages} -Missingness <- function(x) { - sum(is.na(x)) / length(x) * 100 -} #function computes percent of missing data - -matrix(sapply(df_resp[, c("MMSE", "Animals", "CERADimm", "CERADdel", "CERADdisc", "LMIM", "LMDM", "LMrcg", "CPIM", "CPDM", "SDMT", "TMTA", "TMTB", "Ravens", "Gender", "Ethnicity", "Race", "Education", "Age", "UrbanRural", "TotalAssets", "FinancialWealth", "TotalIncome", "PovertyRatioInst", "MaternalEdu", "PaternalEdu", "Impaired")], function(x) Missingness(x)), nrow = 27, ncol = 1, - dimnames = list(c("MMSE", "Animals", "CERAD Immediate", "CERAD Delayed", "CERAD Discriminability", "Logical Memory Immediate", "Logical Memory Delayed", "Logical Memory Recognition", "Constructional Praxis Immediate", "Constructional Praxis Delayed", "Symbol Digits Modality", "Trails A", "Trail B", "Ravens", "Sex", "Ethnicity", "Race", "Education", "Age", "Rurality", "Total Assets", "Financial Wealth", "Total Income", "Ratio to Poverty Line", "Maternal Education", "Paternal Education", "Impaired"), "Percent Missing")) %>% - kable(caption = "Percentage of Missing Data by Variable", digits = 2, align = rep('c', 27)) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -As is clear from the table above, missingness in the database itself is quite scarce considering the total number of individuals in the database. The overall average percentage of missing data is less than 2% for the whole sample, and of the 3167 participants in the sample, a total of 2431 (76.76%) of the sample have no missing data in any of these variables. To further clarify the rate of completeness in this particular dataset, 3075 participants (97.10% of the total sample) are missing data for just 3 or fewer variables. van Buuren (2018) notes that between 5 and 20 imputations are sufficient for "moderate missingness." While moderate levels of missingness is not defined, it is likely fair to conclude that the current dataset has no worse that moderate missingness and may even have "mild" levels of missingness. While not a hard rule, van Buuren (2018) does suggest the following guidance: "if calculation is not prohibitive, we may set *m* [number of imputations] to the average percentage of missing data." In this case, that average is less than 2%, so the total number of imputations was set to 5 to be on the low end of the recommended imputations for moderate missingness. - -Multiple imputation generating 5 different datasets with unique imputed values for every missing value was then conducted. There are many different ways of imputing reasonable values, and the `mice` package offers a wide array of these methods. In this case, the default methods provided by the `mice` package made sense for the missing data types. To make this step more transparent to readers, the table below shows the imputation method used for each variable that required imputation. - -```{r ImputationMethods} -methods <- df_imp$method -methods <- methods[which(methods != "")] -methods <- methods[c(1, 11, 2, 12, 3, 7:8, 4, 6, 5, 9:10, 13:18)] -methods <- ifelse(methods == "pmm", "Predictive Mean Matching", ifelse(methods == "logreg", "Logistic Regression", "Polytomous Logistic Regression")) -matrix(methods, nrow = 18, ncol = 1, dimnames = list(c("Animals", "CERAD Immediate", "CERAD Delayed", "CERAD Discriminability", "Logical Memory Immediate", "Logical Memory Delayed", "Logical Memory Recognition", "Constructional Praxis Immediate", "Constructional Praxis Delayed", "Symbol Digits Modality", "Trails A", "Trails B", "Ravens", "Ethnicity", "Race", "Education", "Maternal Education", "Paternal Education"), "Method")) %>% - kable(caption = "Imputation Method by Missing Variable", align = 'c') %>% - kable_classic(full_width = FALSE, position = "center") -``` - -The methods are most easily separated into those for continuous versus discrete variables. For all of the continuous variables (all variables except race and ethnicity), the method used is predictive mean matching. This method is a particularly flexible and useful imputation method that is well-suited for this particular application. A common issue with standard regression methods for count variables (e.g., total number of correct responses on a test) is that it is easy to estimate impossible values. For example, a regression might predict a score that is less than 0 or that exceeds the maximum number of points possible. Additionally, a regression-based predicted score is almost always some fractional value that is not possible in count data. Predictive mean matching essentially truncates estimates to observed data by taking the estimate and then create a subset of candidate values from the complete cases with predicted values closest that predicted value of the missing observation. From this subset of candidate values, one is randomly selected to replace the missing value. In this way, the imputed values are only ever values that are actually observable. - -The categorical variables with missing data points were just race and ethnicity. In the case of ethnicity, a simple logistic regression is useful for predicting the missing value since the variable is dichotomous (Hispanic or non-Hispanic). In the case of race, there are three levels of the factor (White, Black, and Other), so a multinomial/polytomous expansion of the logistic regression is needed. Importantly, these methods assume no ordering of the categories and thus are appropriate for purely nominal data types. - -With these methods and the number of imputations in mind, it is important to visually inspect the resulting imputations. The first step is to consider whether the imputation chains mixed. The trace plots of the chains are shown below. - -```{r ImputationChains} -plot(df_imp) -``` - -There are some points of clarification to make for understanding the plots. First, the color of each line corresponds to one of the imputations, so each plot has 5 different colored lines. Second, the standard deviations of the estimated values are also estimated. In cases where there was only 1 case with a missing value (i.e., Animals and race), it is not possible to estimate the standard deviation of the imputed values. Third, interpretation of the categorical variables needs to be done cautiously since they are different than the continuous variables. The plots can look very jagged and seem like they mix poorly until we recall that these can only be values of 0 or 1 (for ethnicity - 0, 1, or 2 for race). Thus, it may seem like there is poor mixing of chains, but this is potentially just an artifact of the limited ranges of estimates. Interpretation of the trace plots for multiple imputation are similar to that of MCMC chains; however, there are far fewer iterations to examine. Overall, these plots do not suggest any issues with estimates getting caught at certain values and generally converge despite the relatively small number of iterations. - -Once we are comfortable with the mixing of the imputation chains, it is important to also examine what those imputed distributions look like in comparison to the observed data. The plots below overlay the observed and missing distributions of each variable that had missing data. - -```{r MissingDistributions} -densityplot(df_imp) -``` - -Consistent with our earlier hypotheses, the missing data distributions all are generally left shifted with a few notable exceptions. In the case of the Trail Making Tests, there is bimodal distribution of the missing data (red) with one mode slightly above the mean time to complete and then another mode at the discontinuation time (300 seconds or 5 minutes). Unlike all the other cognitive tests, worse scores on the Trail Making Test are these right shifted distributions. With that exception in mind, the imputed data distributions for all of the cognitive tests suggest that those with missing values were also the most impaired individuals in the sample. The other important exception to the left-shifted rules are for the demographic variables that were estimated. In these plots, there is little difference between the observed and missing data distributions, suggesting that educational achievement of the participant or their parents were not related to missingness. - -Further examination of the missingness patterns can be examined by visualizing the probability that a participant has missing data as a function of each variable. This is done by computing the propensity score for having missing data (i.e., the probability that a case is missing data) and then plotting that propensity score against each of the imputed variables. For clarity, propensity scores are the estimated probability of group membership as predicted from a logistic regression. In this case, the logistic regression is based on all the same predictors as the multiple imputation. The plots below provide this visual description of missingness propensity. - -```{r PropensityScores} -#run model for propensity scores of missing vs. complete observations -prop <- with(df_imp, glm(ici(df_imp) ~ MMSE + Animals + CERADimm + CERADdel + CERADdisc + LMIM + LMDM + LMrcg + CPIM + CPDM + SDMT + TMTA + TMTB + Ravens + Gender + Ethnicity + Race + Education + Age + UrbanRural + TotalAssets + FinancialWealth + TotalIncome + PovertyRatioInst + MaternalEdu + PaternalEdu + Impaired, - family = binomial)) - -#generate the propensity scores -ps <- rep(rowMeans(sapply(prop$analyses, fitted.values)), - df_imp$m + 1) - -#generate plots -xyplot(df_imp, Animals ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Animal Fluency Raw Score") -xyplot(df_imp, CERADimm ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "CERAD Words Recalled (Immediate)") -xyplot(df_imp, CERADdel ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "CERAD Words Recalled (Delayed)") -xyplot(df_imp, CERADdisc ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "CERAD Recognition Discriminability") -xyplot(df_imp, LMIM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Logical Memory Recall (Immediate)") -xyplot(df_imp, LMDM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Logical Memory Recall (Delayed)") -xyplot(df_imp, LMrcg ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Logical Memory Recognition") -xyplot(df_imp, CPIM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Constructional Praxis (Immediate)") -xyplot(df_imp, CPDM ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Constructional Praxis (Delayed)") -xyplot(df_imp, SDMT ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Symbol Digits Modality Test") -xyplot(df_imp, TMTA ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Trail Making Test Part A") -xyplot(df_imp, TMTB ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Trail Making Test Part B") -xyplot(df_imp, Education ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Years of Education") -xyplot(df_imp, MaternalEdu ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Maternal Educational Attainment") -xyplot(df_imp, PaternalEdu ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Paternal Educational Attainment") -xyplot(df_imp, Race ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Race") -xyplot(df_imp, Ethnicity ~ ps | as.factor(.imp), pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Ethnicity") -``` - -To clarify the observed plots, each grid (numbered 0-5) correspond to either the observed data (0) or one of the imputed datasets (1-5). The blue diamonds all correspond to an observed data point while the red circles each represent an imputed data point. Note that the blue diamonds across plots 0-5 are all identical. Potentially the easiest plot to examine first is the one for Animal fluency since there is just a single missing data point that was imputed. In this plot, it is relatively clear that this observation had a low propensity score for being missing and its imputed estimate was that the person would have scored somewhere in the average range compared to the rest of the sample. The 0 grid for Animal fluency is also useful to examine in the context of the working hypothesis that missingness was related to cognitive impairment. Looking at the left-hand side of the plot (i.e., the lowest propensity for having missing observations), there is a good overall range of possible Animal fluency scores, but as we go to the right (i.e., higher probabilities of missing data), we see that the range of Animal fluency scores becomes progressively restricted. This overall shape is mirrored in the other cognitive test plots and suggests that certainly one reason for missing data is cognitive impairment; however, there also appears to be other factors as well since there are plenty of cases with very poor scores who still are given low propensity estimations for being missing. - -Another important takeaway from the above plots is the fact that there doesn't seem to be a clear pattern of missingness by race or ethnicity. Examination of these plots demonstrate that, broadly speaking, individuals of all racial and ethnic groups had similarly wide propensity scores estimated. Similarly, there was no a heavy concentration of high propensity scores that separated the racial or ethnic groups. Some additional exploration is possible to further inspect the reason for missing data in the sample. The plots below show the propensity scores against MMSE raw scores and informant-based report of impairment. - -```{r ExtraPropensities} -xyplot(df_imp, MMSE ~ ps, pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "MMSE Raw Score") -xyplot(df_imp, Impaired ~ ps, pch = c(1, 19), cex = c(0.8, 1.2), col = mdc(1:2), - xlab = "Probability of Missing Observation", - ylab = "Informant Report of Impairment") -``` - -Again, the plots suggest there is a relationship between having impairment and missing data; however, this is clearly not the only cause of missingness. If this were the case, then the differences in propensity would be much clearer. In the case, of informant report in particular, there is clear evidence that those without impairment are still sometimes estimated as having a high probability of missing data. It's possible that this reflects a multivariate dependency between objective cognitive testing and functional impairment; however, it's also possible that there are other sociodemographic variables or neuropsychiatric factors that might explain missingness. Regardless of the cause of any missingness not at random, the multiple imputation methods used and described here appear to be robust and reasonable approximations. - -# Cognitive Prediction Regression - -With the imputed data prepared, the regression model to predict raw scores and serve as a standard for "abnormal" cognitive test performance could be fit. As with the regression models fit in the main study, these models were fit in `brms`. The regression model included all of the same variables used in the multiple imputation model. The multivariate outcomes were all of the cognitive tests with the exception of Raven's Progressive Matrices. The remaining variables (including Raven's Progressive Matrices) were used as predictors. - -Priors for the regression were specified per outcome with generic priors for the multivariate component. All regression coefficient priors were specified as normal distributions with mean of 0 and standard deviation of 1. Under this parameterization, this means that we're placing an approximately 68% probability that regression coefficients are no larger than |1| and a 95% probability that regression coefficients are no larger than |2|. The data in this case are large, so the priors can easily be dominated by the observations in the case where these priors are incorrect; however, given the scales of the predictors and outcomes, these priors on the regression coefficients appeared reasonable. For priors on the intercepts, each outcome variable was considered separately. These intercept priors were specified as Student *t* distributions with 3 degrees of freedom, centered on the approximate mean of the raw distribution, and standard deviation 3x the magnitude of the raw standard deviation. In cases where the regression coefficients are all 0, the intercept of a regression will be the mean of the outcome variable, so centering the prior around this "null" value is a reasonable first guess for its true value. The scale of the intercept is then multiplied by 3 from the raw distribution to ensure that prior is wide and thus weakly informative. Again, the data are very large and could overpower even fairly strong priors, but this was still undertaken as a disciplined prior specification for these purposes. - -A special caveat and warning should be provided for readers and those who intend to reproduce the analyses. In order to ultimately compute the residuals of the model, the number of iterations and warmup samples had to be left at the default number 1,000 each per chain. Running more iterations or more warmups ultimately caused the predictive errors to be too large to run within R. As currently specified, the resulting array of residuals was 6.6 GB in size. To explain why this is the case, one needs to remember that each model is run on 4 chains (1000 warmup + 1000 post-warmup iterations). Each model then has to be run on the 5 imputed datasets (resulting in 40,000 total iterations). In the end, a total of 20,000 posterior samples are saved for inference, so when residuals need to be computed, the 14 outcome variables all produce a combined multivariate predictive distribution based on these 20,000 posterior samples. In short, the model had to be fit on fewer than ideal iterations. While such few iterations is not a problem for *Stan*'s adaptive Hamiltonian Monte Carlo estimator with respect to convergence or performance, it does limit the ability to reliably estimate the tails of distributions. Fortunately, the primary aim of this particular model is to characterize the central tendency since the interest is ultimately below average performances. For any reader seeking to replicate the methods here, it is important to consider computing resources before running the model and before trying to run the model for more iterations. The personal laptop used for these analyses has an Intel i7-9750H CPU @ 2.60GHz, 32GB of RAM, and runs all 64-bit applications. - -With this background in mind, it is now appropriate to examine the model. First, we examine the validity of the modeling process by ensuring that the chains mixed and that the posterior was adequately able to be explored/sampled. This is done by analyzing the $\hat{R}$, effective sample sizes, trace plots, and diagnostic plots for the model. Details about these plots are provided in the other supplementary file addressing the model fits, so readers are encouraged to review that document for explanations of the following plots. - -```{r RegressionDiagnostics} -mcmc_plot(CogRegression, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(CogRegression, type = "neff_hist", binwidth = 0.1) -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_MMSE") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_Animals") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADimm") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADdel") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CERADdisc") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMIM") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMDM") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_LMrcg") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CPIM") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_CPDM") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_SDMT") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_TMTA") -mcmc_plot(CogRegression, type = "trace", regex_pars = "b_TMTB") -mcmc_plot(CogRegression, type = "trace", regex_pars = "sigma_") -mcmc_plot(CogRegression, type = "nuts_divergence") -mcmc_plot(CogRegression, type = "nuts_treedepth") -``` - -One issue that comes from fitting multiple datasets with the same model is that there can be false positives on the $\hat{R}$ statistic (see [here] (https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html) for details). We can confirm that there is no issue with chain mixing from visual inspection of the trace plots and then also calling the $\hat{R}$ statistics directly from the model itself as done in the table below. - -```{r TrueRhats} -CogRegression$rhats %>% - select(starts_with("b_")) %>% - kable(caption = "Chain Convergence by Imputed Dataset", align = 'c', digits = 4) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -Note that for convenience, not all of the values are shown above. Since the trace plots for every estimated predictor is shown above, it was determined to not include a full table summarizing the $\hat{R}$ for every parameter across the five imputed datasets. For readers interested in the values of the various predictors, the following provides a description of the posterior estimates of each predictor in the model. - -```{r PredictorSummaries} -matrix(fixef(CogRegression), ncol = 4, - dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability", rep(c("Ravens", "Rurality: Suburban", "Rurality: Exurban", "Rurality: Not Classified", "Age", "Total Assests", "Financial Wealth", "Total Income", "Poverty Line Ratio", "Sex: Female", "Ethnicity: Hispanic", "Race: Black", "Race: Other", "Education", "Maternal Education", "Paternal Education", "Impaired: Yes"), 13)), c("Estimate", "Std. Error", "95% CI LB", "95% CI UB"))) %>% - kable(caption = "Regression Coefficients Estimated in the Multivariate Regression", align = 'c', digits = 2) %>% - kable_classic(full_width = FALSE, position = "center") %>% - pack_rows("Intercepts", 1, 13) %>% - pack_rows("MMSE", 14, 30) %>% - pack_rows("Animals", 31, 47) %>% - pack_rows("CERAD Delayed", 48, 64) %>% - pack_rows("Logical Memory Immediate", 65, 81) %>% - pack_rows("Constructional Praxis Immediate", 82, 98) %>% - pack_rows("Symbol Digits Modality", 99, 115) %>% - pack_rows("Constructional Praxis Delayed", 116, 132) %>% - pack_rows("Logical Memory Delayed", 133, 149) %>% - pack_rows("Logical Memory Recognition", 150, 166) %>% - pack_rows("Trails A", 167, 183) %>% - pack_rows("Trails B", 184, 200) %>% - pack_rows("CERAD Immediate", 201, 217) %>% - pack_rows("CERAD Discriminability", 218, 234) -``` - -Readers are cautioned to avoid over-interpreting the results of the table above. The objective of this analysis was to produce a means of reasonably estimating cognitive test performance in a way that reduces the risk of biased over-exclusion of individuals from minority backgrounds. As a result, the scales for some variables make results seem meaningless. For example, when rounded to two decimal places, the coefficients for the various financial variables are essentially zero an thus seemingly irrelevant. In reality, however, these very small coefficients are sensible considering that, for example, the average financial wealth in the sample is $178,415.80 (SD = 678,850.10). Considering that the largest raw score being predicted is the 300 second cutoff for the Trail Making Test, the regression coefficient scale is necessarily very small. If the intention was presenting this model and testing whether certain variable mediated, for example, the effects of race, then there would need to be appropriate transformation of the data to facilitate adequate communication of these variables' importance. In this case, such transformations do not change final predictions, so they were not made *a priori*. While this reduced some data cleaning efforts (e.g., no need to consider how to scale data across multiply imputed datasets), the downside is that it is difficult to understand what the variables might convey for applications like clinical practice. Thus, again, readers are encouraged to cautiously evaluate the above results with this in mind. - -Another aspect to consider from the regression is how well it recovers the data generation process that led to the observed data to begin with. A particularly useful way of checking whether a model may be misspecified is via posterior predictive checks. Again, details of posterior predictive checks are discussed in the manuscript and also in the other supplementary material. Readers less familiar with these plots are encouraged to review those documents. Shown below are the posterior predictive checks for each of the outcomes in the multivariate regression. - -```{r CogPPCs} -#note that functions are wrapped in suppressWarnings() to avoid reprinting the same default warning from 'brms' -#Warning reads: "Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented" -#Warning stems from 'brms' storing just one of the imputed datasets, which then gets used for the post-processing (meaning the other 4 datasets used to derive the posterior are not stored and thus unused in post-processing) - -suppressWarnings(pp_check(CogRegression, resp = "MMSE", nsamples = 25) + - xlim(-5, 35)) #used to zoom into actual MMSE range -suppressWarnings(pp_check(CogRegression, resp = "CERADimm", nsamples = 25) + - xlim(-5, 35)) -suppressWarnings(pp_check(CogRegression, resp = "CERADdel", nsamples = 25) + - xlim(-5, 15)) -suppressWarnings(pp_check(CogRegression, resp = "CERADdisc", nsamples = 25) + - xlim(-15, 15)) -suppressWarnings(pp_check(CogRegression, resp = "LMIM", nsamples = 25) + - xlim(-5, 28)) -suppressWarnings(pp_check(CogRegression, resp = "LMDM", nsamples = 25) + - xlim(-5, 30)) -suppressWarnings(pp_check(CogRegression, resp = "LMrcg", nsamples = 25) + - xlim(-5, 20)) -suppressWarnings(pp_check(CogRegression, resp = "CPIM", nsamples = 25) + - xlim(-5, 16)) -suppressWarnings(pp_check(CogRegression, resp = "CPDM", nsamples = 25) + - xlim(-5, 26)) -suppressWarnings(pp_check(CogRegression, resp = "Animals", nsamples = 25) + - xlim(-5, 50)) -suppressWarnings(pp_check(CogRegression, resp = "SDMT", nsamples = 25) + - xlim(-5, 75)) -suppressWarnings(pp_check(CogRegression, resp = "TMTA", nsamples = 25) + - xlim(-5, 305)) -suppressWarnings(pp_check(CogRegression, resp = "TMTB", nsamples = 25) + - xlim(-5, 305)) -``` - -As would be expected, the model predicts some outcomes better than others. This said, the model does a generally good job of capturing the rough shape of most of the outcomes and approximates the central tendency of each test fairly well. The central tendency is the more valuable aspect of this regression as the ultimate goal is to identify individuals who are below average rather than assigning or estimating some quantile at which a performance falls. The model fit could likely be improved in a few ways. First, some models clearly have a mixture of distributions. These mixtures might be as simple as two classes: a "can do the task" group and a "can't do the task" group. These might be the case for Trails B and CERAD Delayed Recall. This being said, mixture models in `brms` are challenging to run, and finding the correct number of mixtures to evaluate is non-trivial. Second, the models could benefit from being respecified from a Student *t* likelihood. While better than a standard normal distribution for these data, the Student *t* distribution still makes assumptions that are not appropriate for the bounded count data that clearly dominates the majority of these outcomes. Some likelihoods that may be more useful would be the binomial, skew normal, and potentially the log-normal. Alternative likelihoods were not explored because `brms` currently only supports estimations of the residual correlations when the data are fit on a multivariate normal or Student *t* likelihood. Third, a distributional model that estimates a separate sigma (residual variance) for each outcome may lead to even better/individualized prediction. Again, the current results are sufficient for the purposes needed, but it is important that readers recognize that these methods should not be used to base inferences as this was not then intent of the methods. - -As has been highlighted in this document, one of the benefits of the multivariate multiple regression is the ability to estimate the correlation among residuals. This multivariate nature is important for analyzing cognitive test performance in a battery rather than based on a single test. Having that information about the correlation among residuals helps improve accuracy given the desired aim to base cognitive diagnosis classification based on residuals from the model's predictions. Toward this end, the correlation matrix for the residuals is shown here: - -```{r ResidCors} -matrix(VarCorr(CogRegression)$residual__$cor[, 1, ], - nrow = 13, ncol = 13, byrow = FALSE, dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"), c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"))) %>% - kable(caption = "Residual Correlations from the Multivariate Regression", align = 'c', digits = 2) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -The final details to provide regarding the regression model is a gross summary of its performance. The following table summarizes the Bayesian $R^2$ statistic for each outcome estimated. - -```{r ModPerf} -matrix(posterior_summary(CogRegression$criteria$bayes_R2, probs = c(0.025, 0.05, 0.90, 0.975)), - nrow = 13, ncol = 6, byrow = FALSE, - dimnames = list(c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"), c("Estimate", "Std. Error", "95% CI LB", "90% CI LB", "90% CI UB", "95% CI UB"))) %>% - kable(caption = "Model Performance from the Multivariate Regression", align = 'c', digits = c(2, 2, 4, 4, 4, 4)) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -As is relatively apparent from the table, the predictors do not explain large proportions of variance in expected scores. This is particularly true for the Trail Making Test. On the one hand, this is good and makes sense. We do not expect cognitive test performance to be entirely due (or even primarily due) to demographic and background factors. On the other hand, this is not ideal for producing highly accurate predictions of performance. Regardless, the residual distributions are ultimately the point of interest here as this is used as the metric for inferring cognitive impairment on a given test. - -# Latent Class Analysis Results - -The latent class analysis was the ultimate end goal of the entire process. In the absence of any clinically defined diagnostic status, there needed to be a method for detecting these unobserved groups of individuals. There are two primary methods for doing this: latent profile analysis and latent class analysis. Both are identical in their basic purpose: utilize assumed parametric statistical properties to class individuals into an arbitrary number of homogeneous classes. The distinction between the two methods is that latent profile analysis takes as input continuous variables (e.g., raw scores) while latent class analysis takes dichotomous variables as input. A special note is made here that we are deliberately excluding methods like cluster analyses as these impose no statistical assumptions and are instead grouping observed data based on differences within the data. There is no statistical framework there to make inferences as to whether the clusters extend to other samples or settings. While clustering methods are ideal for many large data and machine learning settings where the available data either represent the true population or are so large that sampling error is negligible, a deliberate effort to rely on inferential methods was made for this particular application as the HRS very carefully samples a population of interest. - -Latent class analysis was selected over latent profile analysis because the interest was on clinically-defined diagnostic groups. Arguably, using latent profile analysis could be considered easier since it would not have required the development of a separate regression equation to predict performance. With latent profile analysis, there may be many different latent classes identified but there is no guarantee that these classes correspond to individuals with cognitive impairment; instead, latent profiles may correspond to groups of individuals with relative strengths in certain areas. Relying only on dichotomous indicators of impaired vs not impaired ensures that the model is only making decisions about that aspect of the testing performance. Additionally, recent work by Jak, Bondi and colleagues has emphasized the utility of a simple >1SD below expectations dichotomizing criterion of cognitive data for detecting mild cognitive impairment (MCI). Detection of individuals with likely dementia is not necessarily difficult given the battery of tests administered; however, discriminating between MCI and normal cognition is trickier, so a method built on existing literature to detect MCI was preferred as well. - -The primary challenge of any latent class analysis is the number of latent classes to model. The `BayesLCA` package provides a convenient method for deriving the number of latent classes via a variational Bayes estimator. Specifically, the Dirichlet prior for class membership/mixture can be set to some value less than 1 and then the number of classes to be fit can be set to some value larger than what is expected (i.e., the model can be overfit). The result is that any unused classes are returned as empty and only those number of classes needed to describe the data are filled. In this case, the Dirichlet mixing prior was specified as 1/10 and 10 latent classes were fit. The resulting method resulted in 8 uniquely estimated latent classes. Based on these 8 different classes, class 1 was used as the "cognitively normal" group as this was the most clearly non-impaired group identified. To help understand this decision-making process, we can look at the probability of group membership as a function of each test being scored as either a 0 (not impaired relative to expected performance) or 1 (impaired relative to expected performance). - -```{r GroupProbabilities} -matrix(LCAres$itemprob[1:8, ], nrow = 8, ncol = 13, byrow = FALSE, - dimnames = list(paste("Class", 1:8), c("MMSE", "Animals", "CERAD Delayed", "Logical Memory Immediate", "Constructional Praxis Immediate", "Symbol Digits Modality", "Constructional Praxis Delayed", "Logical Memory Delayed", "Logical Memory Recognition", "Trails A", "Trails B", "CERAD Immediate", "CERAD Discriminability"))) %>% - kable(caption = "Probabilities of Class Membership by Dichotomous Test Performance", align = 'c', digits = 2) %>% - kable_classic(full_width = FALSE, position = "center") -``` - -As can be seen from these probabilities, Class 1 is the only one in which there is a small probability of being included in if there is any test performance that is below expected. The most clear contrast from above is that of Class 8 where probability of membership increases dramatically for each (though most strongly among the memory tests) abnormal test result. All of the other classes demonstrate an increase in probability of membership with each successive abnormal test result. In the case of Class 2, it seems that abnormal CERAD performance defines its membership. In Class 3, it is abnormal Logical Memory performance that describes its members. Classes 4 and 5 are both more likely in cases where MMSE is abnormal, but Class 5 also seems to have more gross impairment across all memory tests. Class 6 id defined by an amnestic profile but intact general cognition (i.e., MMSE). Class 7 seems to be more visuospatial in nature as the Constructional Praxis test defines its membership. Based on inspection of these results, it was determined to consider any case grouped into Class 1 as being cognitively normal and all other individuals as being impaired. As a final note of clarification, cognitive normal assignment did not trump informant report in regards to being included or not. In other words, individuals who were found to be cognitively normal using the methods described throughout this document but whose informant reported any impairment were still classified as impaired and thus excluded from the study. diff --git a/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Models File.Rmd b/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Models File.Rmd deleted file mode 100644 index 8f928e45..00000000 --- a/eIRT-CERAD-main/ItemCovStudy/R Script Files/Supplemental Models File.Rmd +++ /dev/null @@ -1,721 +0,0 @@ ---- -title: "Modeling and Results Supplement" -output: - rmdformats::robobook: - self_contained: true - thumbnails: false - lightbox: true ---- - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - -# Purpose of the Document - -The aim of these supplementary details are to provide readers with better documentation of the complete modeling and model selection process. There were a total of 11 models fit as part of this study, and the number of analyses and supporting figures to understand this process are simply too long for a single document. This document therefore provides relevant plots and summaries to help readers understand the steps taken to arrive at the final model, and perhaps more importantly, the document provides commentary for how each modeling choice was made. - -The goal of this document is therefore to ensure complete transparency in the data analysis process. Accompanying this document is also the R script file that was created during the study process. Someone who acquires access to the appropriate files through the HRS should be able to run the script file (with some minimal alterations to ensure that the files are read from the correct files) and then get results that match those shown here (within some margin of error due to randomness in some of the methods, though the use of the same seed in R and `brms` should reduce the impact of this). - -## Overview of Sections - -In an attempt to facilitate readability of this document, the modeling process is broken down into a few different sections. Additionally, there is a table of contents that readers can use to jump to any section, so for those interested in specific aspects of this supplementary material, it is useful to know what exists in each section. - -First, a flowchart showing the order in which models were built and the model to which each was tested is provided. Each candidate model was compared to the up-to-then best fitting model. The aim of this flowchart is to help familiarize readers with the models and modeling steps quickly and efficiently as the remaining sections add greater detail. - -Second, the priors used in the models are explored. As the data analyses were done in the Bayesian framework, the analysis of the priors is useful. The priors are shown as well as the prior predictive checks for a fixed item model. - -Third, the details of each model are provided. This is the largest section as each model includes a variety of explorations. To help reduce the overall length of the document, each model is given its own tab so that readers can select one model at a time. Details for each model can be broken down into 3 primary sections: model validity, model performance, and model estimates. - -* Model validity refers to tests of whether the estimation process converged and not subject to any issues that would make estimates from the model entirely invalid or unstable. These tests include visual inspection of chain mixtures, $\hat{R}$, effective sample size, and maximum treedepth. -* Model performance refers to the posterior predictive checks that mirror those shown in the manuscript for the final model: predictions of all responses for all items, responses to each item, and responses to all items for a subset of participants. As the space of this document is less limited than space in the final manuscript, the random subset of participants is increased from 12 to 20. -* Model estimates refers to the summary of model parameters like fixed and random effects estimates. This summary is different than the one presented in the manuscript for the final paper as the objective of these intermediary models is not to summarize effect sizes or the probability of these effects; instead, the goal is to get a general idea of what the model is estimating and how it is performing. Toward this end, conditional effects plots for each model are also included. Note that these plots may not be very informative for the majority of the models tested because only a subset included many covariates. - -Finally, as discussed in the manuscript, some additional details regarding the final model are also included here. These details include summaries of the item parameters in traditional IRT metrics (i.e., as difficulty and discrimination), the reliability plot, expected score functions (total test), and information functions (total test and all trials). As in the manuscript, these item-specific plots are provided with and without assuming previous learning of the words. The test is best understood as a dynamic one in which a person's performance on each trial changes our expectation for how they will perform on the next. - -## What's Being Done - -As mentioned earlier, this document shows all the code used to generate the results. Since there is an accompanying R data script, it may be useful for readers to know the objects being called in this Markdown document since those objects can be connected back to the R script. The hope is that this will create a reasonable sense of cohesion between the supplementary materials, and it should mean that all the results here are also fully reproducible. Toward that end, the objects and packages used in this document are shown below (note that the R objects are read in as RDS files whose names are consistent with those listed in the R script file). - -```{r SetupData} -#read in needed data -df_long <- readRDS("Data_long.rds") -Rasch_prior <- readRDS("Fitted Models/1PL_prior_check.rds") -TwoPL_prior <- readRDS("Fitted Models/2PL_prior_check.rds") -Rasch_inter <- readRDS("Fitted Models/1PL_intercept.rds") -TwoPL_inter <- readRDS("Fitted Models/2PL_intercept.rds") -TwoPL_learn <- readRDS("Fitted Models/2PL_growthModel.rds") -TwoPL_multi <- readRDS("Fitted Models/2PL_multidimensional.rds") -TwoPL_chnge <- readRDS("Fitted Models/2PL_changeModel.rds") -TwoPL_depmd <- readRDS("Fitted Models/2PL_dependencyModel.rds") -TwoPL_depun <- readRDS("Fitted Models/2PL_dependencyUniqueModel.rds") -TwoPL_deptr <- readRDS("Fitted Models/2PL_dependencyTrialModel.rds") -TwoPL_itmex <- readRDS("Fitted Models/2PL_itemCovariates.rds") - -#load in required packages -#wrapping in suppressPackageStartupMessages() done just to reduce print out in document -suppressPackageStartupMessages(library(brms)) -suppressPackageStartupMessages(library(tidyverse)) -suppressPackageStartupMessages(library(ggplot2)) -suppressPackageStartupMessages(library(kableExtra)) -``` - -# Modeling Flowchart - -As can be inferred from the figure below, the modeling process involved several iterations of fitting similar models and considering results from old models. As there specific hypotheses regarding the model and its covariates, the models needed to test these hypotheses were naturally created as part of the study. At the same time, alternative models had to be specified against which to test these models. Additionally, while there were hypotheses regarding what covariates (and their signs) would ultimately be in the final model, the important result of the study is the final model itself. Some readers may see the number of models examined and variations in their specification and become concerned of potentially two things: multiple comparisons + inflated error rates and/or model fishing/*p*-hacking. - -The objective of this supporting document is to help clarify all the modeling choices so that readers do not need to question whether model specifications were made to try to improve performance of the final result. With respect to possible concerns regarding multiple comparisons, Bayesian methods do no suffer from these concerns (Gelman et al., 2013; [Gelman, Hill,& Yajima, 2012](http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf); [Neath, Flores, & Cavanaugh, 2017](https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.1420); [Sjölander & Vansteelandt, 2019](https://link.springer.com/article/10.1007/s10654-019-00517-2)). While there are several reasons that this is the case for Bayesian methods, it is sufficient to speak to three. First, we do not use null hypothesis testing in this study. Model comparisons are completed using a formal comparison of information criteria to select models with better out-of-sample performance. Coefficients are not interpreted as significant or not but instead are summarized in terms of their probability of existing. Since the correction to *p*-values for multiple comparisons is to control the risk of falsely rejecting the null hypothesis, this is not a concern when we are not rejecting null hypotheses. Second, we utilize skeptical priors for our effects estimates. This means that we are *a priori* placing greater probability of the effects being 0 (or practically equivalent to 0). This is the inverse of frequentist decision-making practices where the null hypothesis is very easy to reject since it is constrained to (usually) a nill point value, which is rarely a realistic threshold for any model specification. Finally, the models include skeptical priors, meaning that all effects estimates are pulled closer to those *a priori* small effects. - -```{r Flowchart, echo = FALSE, fig.cap = "Flowchart of all models fit and compared in order (click to make larger)"} -knitr::include_graphics("C:\\Users/billy/Desktop/Psych Articles/HCAP/CERAD IRT Update/Figures/ItemCovariateModels_Flowchart.bmp") -``` - -# Prior Specifications and Inspection - -As discussed in the manuscript for this study, prior specification came from documentation on using `brms` for IRT (i.e., [Bürkner, 2020a](https://arxiv.org/abs/1905.09501) and [Bürkner, 2020b](https://www.mdpi.com/2079-3200/8/1/5/htm)). As a general note, the non-linear specification for the 2PL model used in this study come from the B$uuml;rkner (2020b) study published in the *Journal of Intelligence*. Also as discussed in the manuscript, the specification of priors follows the recommendations of other typical multilevel regression guides (e.g., Gelman & Hill, 2007). Specifically, the priors are normal distributions with wide variances relative to the scale of the outcome data. As these priors on the regression coefficients, normal distributions are appropriate prior distributions. While these distributions are centered on 0, they are made wide and thus only weakly informative to the final parameter estimates. This specification helps regularize estimation (i.e., pull estimates toward zero and away from more extreme values) while imparting little *a priori* influence on the estimates. Additionally, by making the priors skeptical (i.e., they place the greatest probability on very small to non-existent effects), there is a reduction in the risk of experimenter bias; however, with 1219 participants each observed 30 times, the data will dominate the prior anyway. - -The priors for the Rasch and 2PL models are shown below: -```{r PriorSpecifications, eval = FALSE} -Rasch_priors <- - prior("normal(0, 5)", class = "b") + - prior("normal(0, 3)", class = "sd", group = "ID") - -TwoPL_priors <- - prior("normal(0, 5)", class = "b", nlpar = "beta") + - prior("normal(0, 1)", class = "b", nlpar = "logalpha") + - prior("normal(0, 1)", class = "sd", group = "ID", nlpar = "theta") -``` - -Readers following the R script file will recognize that the above are repeated in that document ("3 - Measurement and Explanatory Model Fit", lines 19-27). To read these priors, it can be helpful to look at a couple of examples. Starting with the priors for the Rasch model, the prior for the coefficients (`class = "b"`) is specified as a normal distribution with a mean of zero and standard deviation of five. This would mean that, before looking at the data, we are guessing that there is a 68% probability of the intercept being between -5 and 5 (i.e., +/- 1SD), and we are guessing that there is about a 95% probability that the coefficients will be between -10 and 10 (i.e., +/- 2SD). In these models, the coefficient corresponds to the actual item parameters (for the Rasch, this is item easiness), so these are the ranges in which we are saying are potential values we might observe. Take another example but this time from the 2PL model and a random effect. The random effect priors are all labeled as `class = "sd"` since we are putting a prior belief on the plausible values of the standard deviation of the random effects. For the random person effect (i.e., latent trait of each participant), we look for the variable that defines this group (`group = "ID"`, where ID is a number indexing each participant) and the non-linear element it is estimating (`nlpar = "theta"` where $\theta$ is the traditional IRT simple for the latent trait). The prior therefore for the latent trait is the standard normal distribution with mean of zero and standard deviation of 1. This specification is consistent with the treatment of the latent trait in IRT as normally distributed and z-scaled, though generally in IRT models the variance is constrained to be 1 for identifiability purposes (see Bürnker, 2020a for details). - -While it's comforting to be able to go through each prior specification and think about what it means, it is perhaps more efficient to examine some plots. The first set of plots shown are the parameter estimates returned when the model samples only from the prior. In other words, these are the estimated effects implied by the priors. If the prior specifications are truly skeptical and weakly informative, then they will give the greatest probability to effects of very small size and cover a wide range of plausible values. The prior distribution for the Rasch easiness parameter of "Butter" on trial one. - -```{r RaschPriorEstimates} -plot(Rasch_prior, combo = c("dens", "intervals"), variable = "b_ItemButter1", ask = FALSE) -``` - -Consistent with expectations, these plots show wide ranges of plausible values with the greatest probabilities being placed on small effects. In the left column are the density plots of the estimates while the right column shows the interval estimates with the circle = mean estimate, bold line = 50% credible interval, and thin line = 90% credible interval. The prior density very clearly follows a normal distribution that allows admissible values with more extreme estimates being given small overall probability of being true. These wide priors allow the data to dominate the posterior estimates for the parameters, though again this would likely be the case with even more informative priors due to the size of the sample available. - -```{r 2PLPriorEstimates} -plot(TwoPL_prior, combo = c("dens", "intervals"), variable = "b_beta_ItemButter1", ask = FALSE) -``` - -The plot above mirrors what was shown for the Rasch model but are for the 2PL model, which is clearly not different as the same priors are used for this parameter. Though all the prior distributions for the model can be shown, this results in many different plots that can be overwhelming to interpret, and these plots correspond to just priors for those specific parameters. Since researchers may not often have explicit expectations for what each parameter will be or should be, these plots are not always directly helpful. - -Another related graphical output is the prior predictive check. The prior predictive check runs the model using the priors rather than the observed data (note that this model would be similar to a null model). If the priors are specified well, then they should return reasonable, albeit wide, estimates of the observed data. This plot is more intuitive for understanding the effect of priors instead of parameter-by-parameter plots like above. The prior predictive checks for the Rasch and 2PL models are shown below following the same layout as the posterior predictive checks in the manuscript and for the other models. - -```{r RaschPriorPredictiveCheck} -pp_check(Rasch_prior, ndraws = 25, type = "bars") -pp_check(Rasch_prior, ndraws = 25, type = "bars_grouped", group = "Item") -pp_check(Rasch_prior, ndraws = 25, type = "bars_grouped", group = "ID", - newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -The Rasch prior predictive checks above demonstrate that the prior specifications are adequately wide to provide coverage for the observed data. The estimates themselves are expectedly poor, which is a product of the skeptical priors. The same plots are now repeated but for the 2PL model. - -```{r 2PLPriorPredictiveChecks} -pp_check(TwoPL_prior, ndraws = 25, type = "bars") -pp_check(TwoPL_prior, ndraws = 25, type = "bars_grouped", group = "Item") -pp_check(TwoPL_prior, ndraws = 25, type = "bars_grouped", group = "ID", - newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Performance of the 2PL priors are similar to those of the Rasch, suggesting that these priors are also appropriately specified. - -While all the above plots and theoretical justifications suggest that the priors are specified consistent with the wishes for the model, it can also be helpful to perform a post-hoc test of whether a model's priors were influential on its final estimates. As has been mentioned multiple times in this section, due to the sample size of this study, it is expected that the data and not the prior will dominate the posterior estimates, meaning that even with more informative priors the data would have more influence on the final estimates. One such comparison discussed by Andrew Gelman (link [here](https://statmodeling.stat.columbia.edu/2019/08/10/for-each-parameter-or-other-qoi-compare-the-posterior-sd-to-the-prior-sd-if-the-posterior-sd-for-any-parameter-or-qoi-is-more-than-0-1-times-the-prior-sd-then-print-out-a-note-the-prior-dist/)) is to compare the posterior standard deviation (i.e., precision of the effect estimate after looking at the data) to the prior standard deviation (i.e., uncertainty of the effect estimate before looking at the data). In the case that a prior is influential, the ratio of the precision to uncertainty will be large. Put another way, we learn little more about the posterior from observing the data because the prior was already highly informative. Gelman's recommended threshold for determining whether a prior is informative is if the posterior standard deviation for an effect is more than 0.1 times the prior standard deviation. The able below provides this metric for each predictor from the final model reported in the study. - -```{r PriorSensitivityCheck} -#get the posterior samples from the final model -posteriors <- as_draws_df(TwoPL_itmex) - -#get the fixed effects for the item easiness -beta <- posteriors %>% - select(starts_with("b_beta")) %>% - apply(., 2, function(x) sd(x)/sd(posteriors$prior_b_beta)) %>% - as.matrix() - -#do the same for item discrimination -alpha <- posteriors %>% - select(starts_with("b_logalpha")) %>% - apply(., 2, function(x) sd(x)/sd(posteriors$prior_b_logalpha)) %>% - as.matrix() - -#combine into a single result -result <- rbind(beta, alpha) %>% - as.data.frame() %>% - add_column("Prior Influence" = ifelse(.[, 1] >= 0.1, "Informative", "Uninformative")) %>% - rename("Ratio (Posterior:Prior)" = V1) -row.names(result) <- paste(rep(c("Easiness:", "Discrimination:"), each = 11), rep(c("Intercept", "Trial 2", "Trial 3", "Item Pos. (linear)", "Item Pos. (quadratic)", "Word Frequency", "Concreteness", "Semantic Diversity", "Age of Acquisition", "Body-Object Integration", "Phonemes"), 2)) -rm(posteriors, beta, alpha) - -#get resulting table -result %>% - kable(caption = "Comparison of the Posterior to Prior Distribution Standard Deviations", digits = 4, align = 'cc') %>% - column_spec(1:3, bold = ifelse(result$`Ratio (Posterior:Prior)` >= 0.10, TRUE, FALSE)) %>% - kable_classic(full_width = FALSE, position = "float_right") -``` - -The table of these posterior and prior comparison results is shown here. For convenience, only those whose threshold exceeds the recommended > 0.10 ratio are bolded. Generally, the findings suggest that the priors performed as expected: they were weakly informative and did not seemingly have undue influence on the posterior estimates. Notably, the exceptions are the discrimination intercept, which had a 95% credible interval that included zero, and the item position coefficients that had relatively larger standard errors. This is not unexpected as the result essentially indicates that even after looking at the data we do not change our prior beliefs, which were that the effects were either zero or very wide. The influence of the prior on these null findings reflects a point made earlier regarding how Bayesian methods are generally unaffected by multiple comparisons. - -An important point to emphasize at this juncture is the implication of a "significant" finding in Bayesian methods. As discussed throughout this section on priors, the priors here are skeptical of an effect in the sense that they place greatest weight on an effect estimate of zero or close to zero and ambivalence regarding the direction of the effect (i.e., it is just as equally probable that the sign is positive of negative). In the context of the current study, this means that, despite the hypotheses regarding the presence and direction of specific effects, the priors for these predictors are specified in this skeptical way so as to avoid the introduction of experimenter bias. In regard to the robustness of the effects observed, the fact that they are observed from the information provided by the data despite these skeptical priors also helps build confidence in the presence of these effects. - -# Model Details {.tabset .tabset-fade} - -The details of this section highlight the models fitting and results. These details speak to the validity of the model results and then also the actual results (i.e., parameter estimates) of the model. Model validity is particularly important in Bayesian methods because the parameter estimates are based on Monte Carlo Markov Chains (or Hamiltonian Monte Carlo (HMC) in the case of these models run using *Stan*). In cases where a model fails to converge or throw errors under the estimator, the validity of the model results are questionable or even completely invalid (e.g., in the case of divergent transitions). To reflect this need to first confirm the validity of the results, various diagnostics of the model fit are provided first before then presenting the model results. - -For readers unfamiliar with these model checks, a brief overview of each is provided here. The largest threat to model results in the HMC is arguably the presence of divergent transitions. HMC explores the posterior distribution by simulating the evolution of a Hamiltonian system, and in order to do this efficiently, the sampler finds a reasonable step size with which to explore that space. A divergent transition occurs when the trajectory of the system is lost due to too large of a step size. Another important model check is the treedepth of the chains. Again, to improve the efficiency of the posterior sampling, a maximum treedepth is set to prevent the estimator spending excessive time in certain steps and spaces. Since this treedepth may artificially limit the estimator in exploring the posterior space, it is important to check whether any of these treedepths were actually hit during estimation (default treedepth is 10). Another important Bayesian model indicator is $\hat{R}$ because multiple HMC (and MCMC) chains are needed to ensure that the posterior is sampled appropriately. If a single chain is run, then it is not possible to determine whether the random starting values of this chain may have lead to a specific set of parameter estimates. Running multiple independent chains that each have different random starting values helps ensure that the parameter estimates are not biased by exploration of only certain posterior values. In well-behaved models, these chains will mix together without any clear indications of one chain producing a specific set of parameter estimates that differ from what other chains are estimating. While this mixture of chains can be visually inspected via the trace plot (also provided here), the $\hat{R}$ statistic is a simple indicator of this with the conservative recommendation of treating estimates as valid only if the $\hat{R}$ for the parameter is less than 1.01. A final model validity check shown here is the effective sample size. Because multiple chains are run for many samples of the posterior, it is expected that some of those samples are autocorrelated and thus dependent on previous samples. Effective sample size informs us of the precision of the model in MCMC and HMC methods. When samples are independent, the central limit theorem indicates that the precision with which a parameter can be estimated is proportional to the size of the sample (e.g., $\sigma_\bar{x} = \frac{\sigma}{\sqrt{N}}$). The same proportionality can be obtained when samples are dependent but requires replacing $N$ with $N_{ESS}$, or the effective sample size. Due to the dependence of the sampling, $N_{ESS} < N$ and thus precision of the estimate is less than would be if it could be estimated from the total sample. Running the chains for more iterations will necessarily increase $N_{ESS}$, but there is a practical demand on computational effort relative to marginal increases in precision. The recommendation of the *Stan* developers is to run enough iterations of the sampler to obtain an $N_{ESS} >= 4*N_{chains}$. All models were run using 4 independent chains, so the minimally acceptable ESS is 400 (i.e, 4*100). - -In the case that the model checks are appropriate, it is appropriate to examine the posterior distribution and begin inference based on the results. While there are more parameters estimated by the models, posterior summaries of the coefficients in each model are shown. The posterior is summarized as a density plot that reflects the probability distribution of the parameter based on integrating our prior knowledge and observed data. The density plot shows the 95% credible interval with the 80% credible interval shown as a shaded area. Unlike frequentist confidence intervals, these credible intervals can be interpreted as the probability of the parameter having a specific value. For example, if the 95% credible ranges from 0.50 to 1.00, then this means that there is a probability of 0.95 that the parameter has a value somewhere within this interval. This is in contrast to frequentist confidence intervals where the same interval would be interpreted as meaning that 95% of point estimates based on the same statistical test applied to an infinite number of random samples of the same population will be within this interval. Thus, where the credible interval directly summarizes the our beliefs about the parameter and our uncertainty about its true value, the confidence interval only reflects point estimates that we would expect to observe if the study and statistical methods were repeated an infinite number of times. Posterior predictive checks of the models are then also presented as they were for the final model in the corresponding manuscript. There is one additional model exploration plot provided in this section that has not been addressed before in this document: the conditional effects plot. As the predictors in these models are correlated and have their effects estimated on the logit scale, it can be challenging to look at the model estimates and understand the implication of these values in an intuitive manner. One way to address this is to visualize how the predicted outcome of the model changes as a function of each predictor while holding all other predictors constant (e.g., at their mean value). The resulting plot is the conditional effects plot. In the case of these models, this plot shows, for each predictor, what happens to the predicted probability of a correct response as the value of the predictor changes and all other model values are held constant. These plots are not realistic as it is not reasonable to assume that there exist words whose traits can vary only on one property at a time; however, they do provide a quick method of understanding the relative effect of each predictor by showing its linear trend as implied by the model. As a result, these plots should not be used for prediction or extrapolation in any regard; instead, if the goal is prediction of responses, then the entire model should be used, and extrapolation of these predictions to values not observed in this study should be avoided. These plots are simply to help contextualize the meaning of the effect estimate in the model. - -## Rasch Fixed Items - -```{r RaschPlots} -mcmc_plot(Rasch_inter, type = "nuts_divergence") -mcmc_plot(Rasch_inter, type = "nuts_treedepth") -mcmc_plot(Rasch_inter, type = "trace", variable = "b_Item", regex = TRUE) -mcmc_plot(Rasch_inter, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(Rasch_inter, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(Rasch_inter, type = "neff_hist", binwidth = 0.1) -``` - -The fixed items Rasch model demonstrated no evidence of estimation concerns that would raise concerns for the validity of the results. As a result, we can look at the results from the model overall. - -```{r RaschResults} -mcmc_plot(Rasch_inter, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_Item", regex = TRUE) -pp_check(Rasch_inter, ndraws = 50, type = "bars") -pp_check(Rasch_inter, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(Rasch_inter, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Even with no predictors and fewer parameters in the Rasch model, the model does very well in predicting responses. The density plot of the item easiness estimates demonstrates that most items are fairly easy. The following general model summary integrates basic model validity statistics and posterior summaries for additional parameters. - -```{r RaschSummary} -summary(Rasch_inter) -``` - -To help clarify the meaning of some major elements of the above output, consider the following guide: -1. "Estimate" refers to the average posterior value for the parameter, -2. "Est. Error" is the standard deviation of the posterior distribution, -3. "l-95% CI" is the lower bound of the 95% credible interval, -4. "u-95% CI" is the upper bound of the 95% credible interval, -5. "Rhat" is the $\hat{R}$ value for that parameter (rounded to two decimal places), -6. "Bulk_ESS" is the effective sample size based on rank normalized draws and estimates the sampling efficiency of the mean of the posterior, and -7. "Tail_ESS" is the minimum of the effect sample sizes in the 5% and 95% quantiles. - -## 2PL Fixed Items - -```{r 2PLInterPlots} -mcmc_plot(TwoPL_inter, type = "nuts_divergence") -mcmc_plot(TwoPL_inter, type = "nuts_treedepth") -mcmc_plot(TwoPL_inter, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_inter, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_inter, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_inter, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_inter, type = "neff_hist", binwidth = 0.1) -``` - -The fixed item 2PL model demonstrated no evidence of estimation concerns that would raise concerns for the validity of the results. As a result, we can look at the results from the model overall. - -```{r 2PLInterResults} -mcmc_plot(TwoPL_inter, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_inter, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_inter, ndraws = 50, type = "bars") -pp_check(TwoPL_inter, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_inter, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Much like with the Rasch model, the 2PL fixed item model does very well in predicting responses. There are two parameters (one for difficulty/beta and one for discrimination/alpha) being estimated since the 2PL model is a non-linear model. Note that the model's name for the discrimination parameter is "logalpha." This name reflects the fact that the alpha parameter was log-transformed to ensure that it was constrained in estimation to positive values. The following general model summary integrates basic model validity statistics and posterior summaries for additional parameters. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLInterSummary} -summary(TwoPL_inter) -``` - -## Multidimensional 2PL Model - -```{r 2PLMultiPlots} -mcmc_plot(TwoPL_multi, type = "nuts_divergence") -mcmc_plot(TwoPL_multi, type = "nuts_treedepth") -mcmc_plot(TwoPL_multi, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_multi, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_multi, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_multi, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_multi, type = "neff_hist", binwidth = 0.1) -``` - -The 2PL model with a unique factor per trial shows no significant modeling concerns. Since there are no validity concerns, we can look at the results from the model overall. - -```{r 2PLMultiResults} -mcmc_plot(TwoPL_multi, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_multi, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_multi, ndraws = 50, type = "bars") -pp_check(TwoPL_multi, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_multi, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Like the other 2PL models, this model estimates responses with high accuracy. The following general model summary integrates basic model validity statistics and posterior summaries for additional parameters. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLMultiSummary} -summary(TwoPL_multi) -``` - -## Latent Change Model - -```{r 2PLChangePlots} -mcmc_plot(TwoPL_chnge, type = "nuts_divergence") -mcmc_plot(TwoPL_chnge, type = "nuts_treedepth") -mcmc_plot(TwoPL_chnge, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_chnge, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_chnge, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_chnge, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_chnge, type = "neff_hist", binwidth = 0.1) -``` - -The 2PL model with factors for change over each trial shows no significant modeling concerns either. Since there are no validity concerns, we can look at the results from the model overall. - -```{r 2PLChangeResults} -mcmc_plot(TwoPL_chnge, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_chnge, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_chnge, ndraws = 50, type = "bars") -pp_check(TwoPL_chnge, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_chnge, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Consistent with the other 2PL models so far, the model appears to predict the data generation process closely. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLChangeSummary} -summary(TwoPL_chnge) -``` - -## Latent Growth (Learning) Model - -```{r 2PLLearnPlots} -mcmc_plot(TwoPL_learn, type = "nuts_divergence") -mcmc_plot(TwoPL_learn, type = "nuts_treedepth") -mcmc_plot(TwoPL_learn, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_learn, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_learn, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_learn, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_learn, type = "neff_hist", binwidth = 0.1) -``` - -The 2PL model with a growth/learning factor also yields no estimation errors. Since there are no validity concerns for the model, we can look at the results from the model. - -```{r 2PLLearnResults} -mcmc_plot(TwoPL_learn, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_learn, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_learn, ndraws = 50, type = "bars") -pp_check(TwoPL_learn, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_learn, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Continuing the trends of the other models, these posterior checks closely align with the observed data. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLLearnSummary} -summary(TwoPL_learn) -``` - -## Local Dependency Model - -```{r 2PLDependPlots} -mcmc_plot(TwoPL_depmd, type = "nuts_divergence") -mcmc_plot(TwoPL_depmd, type = "nuts_treedepth") -mcmc_plot(TwoPL_depmd, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_depmd, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_depmd, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_depmd, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_depmd, type = "neff_hist", binwidth = 0.1) -``` - -The local dependency effect on the 2PL model fit without any modeling concerns. Given the lack of validity issues, the results from the model overall are shown next. - -```{r 2PLDependResults} -mcmc_plot(TwoPL_depmd, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_depmd, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_depmd, ndraws = 50, type = "bars") -pp_check(TwoPL_depmd, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_depmd, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -As with the other models, the predictive checks closely align to the data. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLDependSummary} -summary(TwoPL_depmd) -``` - -## Local Dependency by Item Model - -```{r 2PLUniquePlots} -mcmc_plot(TwoPL_depun, type = "nuts_divergence") -mcmc_plot(TwoPL_depun, type = "nuts_treedepth") -mcmc_plot(TwoPL_depun, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_depun, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_depun, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_depun, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_depun, type = "neff_hist", binwidth = 0.1) -``` - -As with the model with a uniform local dependency effect, the model with a unique dependency effect per item demonstrated no estimation errors, so next is shown the results from the model overall. - -```{r 2PLUniqueResults} -mcmc_plot(TwoPL_depun, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_depun, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_depun, ndraws = 50, type = "bars") -pp_check(TwoPL_depun, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_depun, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -These posterior predictive checks also strongly align with the observed data. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLUniqueSummary} -summary(TwoPL_depun) -``` - -## Local Dependency by Trial Model - -```{r 2PLTrialPlots} -mcmc_plot(TwoPL_deptr, type = "nuts_divergence") -mcmc_plot(TwoPL_deptr, type = "nuts_treedepth") -mcmc_plot(TwoPL_deptr, type = "trace", variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_deptr, type = "trace", variable = "b_logalpha_Item", regex = TRUE) -mcmc_plot(TwoPL_deptr, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_deptr, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_deptr, type = "neff_hist", binwidth = 0.1) -``` - -Decomposing the dependency matrix into unique trial effects, rather than unique item effects, also does not cause any modeling validity concerns. As with the other models, the overall results are shown next. - -```{r 2PLTrialResults} -mcmc_plot(TwoPL_deptr, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta_Item", regex = TRUE) -mcmc_plot(TwoPL_deptr, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha_Item", regex = TRUE) -pp_check(TwoPL_deptr, ndraws = 50, type = "bars") -pp_check(TwoPL_deptr, ndraws = 50, type = "bars_grouped", group = "Item") -pp_check(TwoPL_deptr, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Posterior predictive checks are consistent with the other models in that the model predictions and observed data closely align. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLTrialSummary} -summary(TwoPL_deptr) -``` - -## Item Covariates - -```{r 2PLItemCovPlots} -mcmc_plot(TwoPL_itmex, type = "nuts_divergence") -mcmc_plot(TwoPL_itmex, type = "nuts_treedepth") -mcmc_plot(TwoPL_itmex, type = "trace", variable = "b_beta", regex = TRUE) -mcmc_plot(TwoPL_itmex, type = "trace", variable = "b_logalpha", regex = TRUE) -mcmc_plot(TwoPL_itmex, type = "trace", variable = "sd_", regex = TRUE) -mcmc_plot(TwoPL_itmex, type = "rhat_hist", binwidth = 0.0001) -mcmc_plot(TwoPL_itmex, type = "neff_hist", binwidth = 0.1) -``` - -The 2PL model with all item covariates demonstrated no evidence of estimation concerns that would raise concerns for the validity of the results. Since there are no validity concerns, we can look at the results from the model overall. - -```{r 2PLItemCovResults} -mcmc_plot(TwoPL_itmex, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_beta", regex = TRUE) -mcmc_plot(TwoPL_itmex, type = "areas_ridges", prob = 0.80, prob_outer = 0.95, variable = "b_logalpha", regex = TRUE) -pp_check(TwoPL_itmex, ndraws = 50, type = "bars") -pp_check(TwoPL_itmex, ndraws = 50, type = "bars_grouped", group = "ID", newdata = subset(df_long, df_long$ID %in% as.factor(sample.int(n = 1219, size = 20, replace = FALSE)))) -``` - -Like the other 2PL models, this model estimates responses with high accuracy. The following general model summary integrates basic model validity statistics and posterior summaries for additional parameters. See the guide at the end of the Rasch intercept tab for details regarding the meaning of each value. - -```{r 2PLItemCovSummary} -summary(TwoPL_itmex) -``` - -# Final Model Details - -This final section provides further details on the final item covariate model identified in this study. To begin with, some additional details regarding the model's performance and fit are provided for readers. The manuscript provides summaries of coefficients as well as item and person fit statistics. In a previous section, this supplemental manuscript provided visual summaries of the coefficients as well. To further enrich the visual understanding of the final data, the following plots describe item and person fit further: - -```{r FitStats, echo = FALSE} -item_fit <- readRDS("ItemFit.rds") -person_fit <- readRDS("PersonFit.rds") - -windowsFonts(Times=windowsFont("Times New Roman")) - -theme_hist <- function(...) { - bayesplot::theme_default() + - theme( - axis.text.y = element_blank(), - axis.ticks.y = element_blank(), - axis.title.y = element_blank(), - ... - ) -} - -item_fit %>% - ggplot(aes(crit_diff)) + - geom_histogram() + - facet_wrap("Item", scales = "free") + - theme_hist() + - theme(text=element_text(family="Times", face="bold", size=12)) -person_diff <- person_fit %>% - group_by(ID) %>% - summarise(bp = mean(crit_diff > 0)) -person_max <- which.max(person_diff$bp) -person_fit %>% - filter(ID == person_max) %>% - ggplot(aes(crit_diff)) + - geom_histogram() + - theme_hist() + - theme(text=element_text(family="Times", face="bold", size=12)) -``` - -These two plots show the empirical distribution of the differences in log-likelihood of the predicted and observed data under the models. The first plot is grouped by items and reflects the item fit. In an ideal model, the distribution is normally distributed with a mean of 0. To address the impracticality of plotting all of these plots for each of 1219 participants, just the most extreme case (individual with worst fit) is shown. While the aim of this study was to examine predictors of item traits, it is still possible to visualize person traits (i.e., memory). The following plot shows the ordered distribution of latent trait estimates and their 95% credible intervals for each participant in the study: - -```{r PersonPlots, echo = FALSE} -person_pars <- readRDS("PersonEstimates.rds") - -person_pars %>% - arrange(Estimate) %>% - mutate(id2 = seq_len(n())) %>% - ggplot(aes(id2, Estimate, ymin = Q2.5, ymax = Q97.5)) + - geom_pointrange(alpha = 0.7) + - coord_flip() + - labs(x = "Person Number (sorted after Estimate)") + - theme( - axis.text.y = element_blank(), - axis.ticks.y = element_blank() - ) + - theme(text=element_text(family="Times", face="bold", size=12)) -``` - -The plots demonstrate that, on the whole, the estimates of a person's latent trait is fairly wide. This is not entirely surprising given that the only estimate being given to the latent trait is a random intercept for each person. The explanatory item response theory model will be expanded in a future study to include factors that predict latent trait, which may reduce uncertainty in the latent trait measurement. The largest limiter to higher accuracy in the latent trait estimate is the fact that memory is being measured with just 10 items repeated only 3 times, so person predictors are likely not going to be a significant source of error reduction. It is also potentially useful to visualize all the conditional effects from the final model. These plots are shown next: - -```{r FinalConditionalEffects, echo = FALSE} -plot(conditional_effects(TwoPL_itmex), ask = FALSE) -``` - -There are also some general test descriptives that are usually helpful to examine. These include test reliability, the expected score function, and test information. The following plots are these visualizations in their respective order: - -```{r ExtraPlots, echo = FALSE} -theta_est <- as.data.frame(ranef(TwoPL_deptr, summary = TRUE)$ID)[, 1:2] -var_theta <- as.numeric(VarCorr(TwoPL_deptr)[[1]][[1]][1]) -reliability_est <- var_theta/(var_theta + theta_est[, 2]^2) -rel_dat <- data.frame(Theta = theta_est[, 1], Rxx = reliability_est) -rm(theta_est, var_theta, reliability_est) -ggplot(data = rel_dat, aes(x = Theta, y = Rxx)) + - geom_smooth(color = "black", size = 1.10) + - ylab("Reliability Estimate") + - xlab("Person Ability Estimate") + - theme_bw() + - theme(text=element_text(family="Times", face="bold", size=12)) - -fixed <- as.data.frame(fixef(TwoPL_deptr, summary = FALSE)) - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(p)) %>% - rename(p = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -fixed1 <- fixed -fixed1[, 13:22] <- fixed[, 13:22]+fixed[, 1] -fixed1[, 23:32] <- fixed[, 23:32]+fixed[, 2] -fixed1[, 45:54] <- fixed[, 45:54]+fixed[, 33] -fixed1[, 55:64] <- fixed[, 55:64]+fixed[, 34] - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(p)) %>% - rename(p = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -ICCpars <- rbind(temp, temp1) -ICCpars$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 3000) -ICCpars <- ICCpars[, c(6, 2, 1, 3, 4:5, 7)] - -TESpars <- ICCpars %>% - group_by(Theta, Trial, Dependency) %>% - summarise(Expected = sum(p)) - -TESpars[TESpars$Trial == 1 & TESpars$Dependency == "No Prior Recall", ] %>% - ggplot(aes(x = Theta, y = Expected)) + - geom_line(size = 1.05) + - labs(title = "Expected Score Function for Trial 1", - x = expression(theta~('ability on the logit scale')), - y = "Expected Raw Score") + - ylim(0, 10) + - theme_classic() - -TESpars[TESpars$Trial == 2, ] %>% - ggplot(aes(x = Theta, y = Expected, linetype = Dependency)) + - geom_line(size = 1.05) + - labs(title = "Expected Score Function for Trial 2", - x = expression(theta~('ability on the logit scale')), - y = "Expected Raw Score") + - ylim(0, 10) + - theme_classic() - -TESpars[TESpars$Trial == 3, ] %>% - ggplot(aes(x = Theta, y = Expected, linetype = Dependency)) + - geom_line(size = 1.05) + - labs(title = "Expected Score Function for Trial 3", - x = expression(theta~('ability on the logit scale')), - y = "Expected Raw Score") + - ylim(0, 10) + - theme_classic() - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, Item) %>% - summarise(ci = list(as.data.frame(posterior_summary(i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) %>% - mutate(Trial = rep(1:3, times = 10), - Item = str_remove(Item, "\\d$")) - -IICpars <- rbind(temp, temp1) -IICpars$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 3000) -IICpars <- IICpars[, c(6, 2, 1, 3, 4:5, 7)] - -IICpars[IICpars$Trial == 1 & IICpars$Dependency == "No Prior Recall", ] %>% - mutate(Item = factor(Item, levels = c("Butter", "Arm", "Shore", "Letter", "Queen", "Cabin", "Pole", "Ticket", "Grass", "Engine"))) %>% - ggplot(aes(x = Theta, y = i)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -IICpars[IICpars$Trial == 2, ] %>% - mutate(Item = factor(Item, levels = c("Ticket", "Cabin", "Butter", "Shore", "Engine", "Arm", "Queen", "Letter", "Pole", "Grass"))) %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -IICpars[IICpars$Trial == 3, ] %>% - mutate(Item = factor(Item, levels = c("Queen", "Grass", "Arm", "Cabin", "Pole", "Shore", "Butter", "Engine", "Ticket", "Letter"))) %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - facet_wrap(~ Item, ncol = 5) + - labs(title = "IICs for the 2PL", - subtitle = "Each curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() - -temp <- fixed %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, iter) %>% - summarise(sum_i = sum(i)) %>% - group_by(Theta) %>% - summarise(ci = list(as.data.frame(posterior_summary(sum_i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) - -temp1 <- fixed1 %>% - select(contains("Item")) %>% - mutate(iter = 1:n()) %>% - pivot_longer(starts_with(c("beta_", "logalpha_"))) %>% - mutate(Item = str_remove(name, "^beta_Item|^logalpha_Item"), - Parameter = ifelse(str_detect(name, "beta"), "Difficulty", "Discrimination")) %>% - select(-name) %>% - pivot_wider(names_from = Parameter, values_from = value) %>% - expand(nesting(iter, Item, Difficulty, Discrimination), - Theta = seq(-6, 6, length.out = 100)) %>% - mutate(p = inv_logit_scaled(Difficulty + exp(Discrimination) * Theta)) %>% - mutate(i = p * (1 - p)) %>% - group_by(Theta, iter) %>% - summarise(sum_i = sum(i)) %>% - group_by(Theta) %>% - summarise(ci = list(as.data.frame(posterior_summary(sum_i)) %>% - rename(i = Estimate, ymin = Q2.5, ymax = Q97.5))) %>% - unnest(cols = c(ci)) %>% - select(-Est.Error) - -IICtest <- rbind(temp, temp1) -IICtest$Dependency <- rep(c("No Prior Recall", "Previous Recall"), each = 100) - -IICtest %>% - ggplot(aes(x = Theta, y = i, linetype = Dependency)) + - geom_line(size = 1.05) + - geom_ribbon(aes(ymin = ymin, ymax = ymax), alpha = 0.25) + - labs(title = "IIC for the 2PL", - subtitle = "Curve is based on the posterior median.", - x = expression(theta~('ability on the logit scale')), - y = "Information") + - theme_classic() -``` - -Within IRT, information and reliability are very closely related. It is clear that these plots both have generally the same shape with peaks in their functions around slightly below average scores. As is clear from both of these scales, the CERAD word list is not ideal for measuring individuals with either very low or very high memory abilities. In dementia/mild cognitive impairment research, this particular trait may be favorable as the goal is often to characterize with greater precision mildly impaired memory with relatively little need for carefully measuring extremes. The expected score function is a useful visual analogue for the relationship between raw score and latent trait estimate. The relationship is clearly monotonic and overall very gradual, which means that the raw scores do span a relatively wide range of latent trait values. diff --git a/eIRT-CERAD-main/README.md b/eIRT-CERAD-main/README.md deleted file mode 100644 index 0e0a998b..00000000 --- a/eIRT-CERAD-main/README.md +++ /dev/null @@ -1,15 +0,0 @@ -# Explanatory Item Response Theory Models of the CERAD List Learning Test - -This repository serves as a central place to keep R scripts and supplementary materials for publications related to the application of explanatory item response theory models to the CERAD List Learning test. As projects are started and publications are made, this material will be updated to help readers keep track of files and related materials. - -## Examining Word List Selection and Performance: An Explanatory Item Analysis of the CERAD Word List Learning Test -|Resource Available|Link| -|---|---| -|Primary Data Source|[Here](https://hrsdata.isr.umich.edu/data-products/2016-harmonized-cognitive-assessment-protocol-hcap?_ga=2.27455356.1307685611.1614482368-1012243465.1597251037)| -|Pre-registration|[Here](https://osf.io/pyd63)| -|Supplementary Material|[Here](https://github.com/w-goette/eIRT-CERAD/tree/main/ItemCovStudy/Markdown%20Documents)| -|R code|[Here](https://github.com/w-goette/eIRT-CERAD/tree/main/ItemCovStudy/R%20Script%20Files)| -|Open Science Framework Page|[Here](https://osf.io/bd8s9/)| -|Final Publication|TBD| - -These materials correspond to the application of explanatory item response theory to the immediate recall trials of the CERAD List Learning test. Included among these materials are pre-regristration and scientific transparency resources. Unfortunately, the data used for these analyses cannot be released without appropriate data requests and signing of data use agreements. Above is a link for accessing the primary data source for those interested in accessing the data. With this exception, the goal of all other materials is to ensure reproducability of all results and open transparency of all modeling results and methods. Toward that end, Markdown documents (saved as .html files) are available under Supplementary Materials in order to provide extensive review of all results and methods. Additionally, all R code used to clean the original data files, code and test models, and perform post-processing is provided through the table above. Within that same section is also the .rmd files use to create the Markdown documents so that readers can access that information directly. The OSF link is an alternative storage location for the R files that is slightly more harmonious with the AsPredicted pre-registration but otherwise is redundant to this Github page.