datawrangling.qmd

---
title: "Data wrangling for psychometrics in R"
subtitle: "R code examples for psychometricians"
author: 
  name: 'Magnus Johansson'
  affiliation: 'RISE Research Institutes of Sweden'
  affiliation-url: 'https://www.ri.se/en/shic'
  orcid: '0000-0003-1669-592X'
date: 2024-11-01
date-format: iso
execute: 
  cache: false
  warning: false
  message: false
editor_options: 
  chunk_output_type: console
---

While the [`easyRasch` R package](https://pgmj.github.io/easyRasch/) simplifies the process of doing Rasch analysis in R, users still need to be able to import and often make modifications to their data. This demands some knowledge in R data wrangling. For a comprehensive treatment of this topic, please see the ["R for data science"](https://r4ds.had.co.nz/) book. This guide focuses on common data wrangling issues that occur in psychometric analyses.

I will rely heavily on `library(tidyverse)` functions in most examples.

This page will be updated intermittently. The planned content includes:

- dividing a set of items into two or more dataframes in order to run separate analyses (for instance based on an exploratory analysis of a large item set, or based on a demographic variable)
- recoding/merging response categories
- splitting an item into two based on a DIF variable
- merging two items into a super-item/testlet

## Removing respondents with missing data

There may be situations where you want to specify a minimum number of items that a respondent must have answered in order to be included in the analysis. This is done by using the `filter()` function from `library(dplyr)`. The syntax is as follows:

``` r
min.responses <- 5 # set the minimum number of responses

df2 <- df %>% 
  filter(length(itemlabels$itemnr) - rowSums(is.na(.[itemlabels$itemnr])) >=     min.responses)
```

The object `itemlabels$itemnr` is a vector (column in a dataframe) with the short item labels as they are used in the dataframe `df` which contain the data, with items as columns using the corresponding labels. 

The `length()` function is used to count the number of items in the vector, and `rowSums(is.na())` is used to count the number of missing values in each row. Then a simple subtraction is done, total number of items minus number of missing responses. The `>=` operator is used to compare the number of responses to the minimum number of responses specified in the object `min.responses`. The `filter()` function removes all rows that do not meet the criteria. And the new dataset is saved to `df2`.

## Dividing a dataset

## Recoding response categories

Many options are available. I have settled on primarily using `car::recode()` from `library(car)` since I find the syntax to be logical and consistent. Another option to consider is `dplyr::case_when()`.

The basic syntax of `car::recode()` (henceforth referred to as only `recode()`) is this:

``` r
recode(variable_to_recode, "newvalue1=oldvalue1;newvalue2=oldvalue2", as.factor = FALSE/TRUE)
```

As you can see, semi-colon `;` is used to separate recodings. When numbers are recoded, you just write them out as `1=0`. When characters are involved, you need the single quote symbol to enclose the character string, i.e `'Never'=0` for recoding to numerics (which also necessitates the option `as.factor = FALSE` if you want the recoded variable to be numeric), or `'Never'='Nevver'` when you need to correct misspelled response options.

::: {.callout-tip}
Remember to define your preferred recoding function in your script to avoid headaches due to unexpected error messages. This is done by using the assignment operator `<-`. Example below.

`recode <- car::recode`
:::

There are some more or less clever ways to use `recode()`. You can simply copy&paste code for each variable, such as:

``` r
df$q1 <- recode(df$q1, "1=0;2=1;3=2", as.factor = FALSE)
df$q2 <- recode(df$q2, "1=0;2=1;3=2", as.factor = FALSE)
df$q3 <- recode(df$q3, "1=0;2=1;3=2", as.factor = FALSE)
df$q4 <- recode(df$q4, "1=0;2=1;3=2", as.factor = FALSE)
df$q5 <- recode(df$q5, "1=0;2=1;3=2", as.factor = FALSE)
```

Such an approach might actually be sensible if each variable needed different recodings. But when you want to make the same changes to many variables, there are of course more efficient strategies. The code above could be rewritten as:

``` r
df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE)))
```

Or, in this case, where we only want to recode our scale from the range of 1-3 to 0-2, i.e. subtract one:

``` r
df %>% 
  mutate(across(q1:q5, ~ .x - 1))
```

The syntax in these two examples is related to how `tidyverse` uses unnamed functions by the combination of `~` and `.x`, where the latter becomes a placeholder for the variables defined in the first term of `across()`.

It is good practice to check that your recoding worked as intended. The first step I recommend is `RItileplot()`.

``` r
df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% 
  RItileplot()
```

If you are trying out different ways to merge response categories to resolve issues with disordered thresholds, you may also want to review the probability curves before committing your recode to a new data object.

``` r
df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% 
  RIitemCats(legend = "left")
```

Below is an example where we have multiple cell contents that we want to recode to `NA` to ensure that R interprets them as missing data. We want to recode three different things:

- all the numbers from 990 to 999 (usually a way to differentiate between types of missing data)
- blank cells
- "Don't know" responses

``` r
df$q45 <- recode(df$q45,"990:999=NA;''=NA;'Don't know'=NA")
```

The `:` means that all numbers from 990 to 999 will be recoded into `NA`.

## Item split

It can be desirable to split an item due to issues with DIF. This refers to taking a variable and creating two (or more) replacement variables, one for each demographic group. For each variable, data will be missing for the other group. Often this is relevant for gender DIF, which we will use in this example.

This example assumes that there is a (DIF) vector variable `dif.gender` with the gender data, which has the same length as the number of rows in the dataset `df`. We'll create a new dataframe to store the dataset with the item split. Item q10 is the one we want to split.

``` r
df.q10split <- df %>% 
  add_column(gender = dif.gender) %>% 
  mutate(q10f = if_else(gender == "Female", q10, NA), # create variable q10f when gender is "Female"
         q10m = if_else(gender == "Male", q10, NA)
         ) %>% 
  select(!gender) %>% # remove gender and q10 variables from the dataset
  select(!q10)

# check the data
RItileplot(df.q10split)
```

The `if_else()` function used within `mutate()` has three inputs/options in the example above:

- condition (logical statement)
- assignment if the condition true
- assignment if false

## Item merge

Sometimes it is desirable to merge two items into one. This is often done when there is a high residual correlation between two items. This is called a testlet. The items are merged into a new variable, and the original items are removed from the dataset.

It is usually a good idea to create a new dataframe with the merged variable, in case you need to go back to the original dataset.

``` r
df2 <- df %>% 
  mutate(sdq2_15 = sdq2 + sdq15) %>% # create variable by adding them
  select(!sdq2) %>% 
  select(!sdq15)
```

Then you should check the ICC curves for the merged item:

``` r
RIitemCats(df2, item = "sdq2_15")
```

## Checking response distribution prior to DIF analysis

This example assumes that you have previously created a DIF variable (vector) for gender with two groups. The variable is a factor with the labels "Female" and "Male".

The reason for doing this is making sure that there are no empty cells (particularly in lower response categories) in either group prior to running the DIF analysis, since this could lead to DIF being indicated incorrectly.

``` r
difGenderTileplots <- df %>% 
  add_column(gender = dif.gender) %>% 
  split(.$gender) %>%
  map(~ RItileplot(.x %>% select(!gender)) + labs(title = .x$gender))
  
library(patchwork)
difGenderTileplots$Female + difGenderTileplots$Male
```
The last part will make the tile plots show up side by side, labeled with the gender variable as it is coded in the dataset. You will need to adapt the code to the factor labels in your own dataset.

## Links

<https://github.com/Cghlewis/data-wrangling-functions/wiki>