generated from datan3-2020/assignment3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
assignment3.Rmd
152 lines (110 loc) · 6.59 KB
/
assignment3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: "Statistical assignment 3"
author: "[add your name here] [add your candidate number here - mandatory]"
date: "[add date here]"
output: github_document
---
```{r setup, include=FALSE}
# Please note these options.
# This tells R Markdown that we want to show code in the output document.
knitr::opts_chunk$set(echo = TRUE)
# Switching off messages in the output document.
knitr::opts_chunk$set(message = FALSE)
# Switching on caching to make things faster (don't commit cache files on Github).
knitr::opts_chunk$set(cache = TRUE)
```
In this assignment we will explore political interest (*vote6*) and how it changes over time.
## Read data
First we want to read and join the data for the first 7 waves of the Understanding Society. (Wave 8 does not have a variable for political interest). We only want five variables: personal identifier, sample origin, sex, age and political interest. It is tedious to join all the seven waves manually, and it makes sense to use a loop in this case. Since you don't yet know about iteration I'll provide the code for you; please see the explanation of the code here: http://abessudnov.net/dataanalysis3/iteration.html.
The only thing you need to do for this code to work on your computer is to provide a path to the directory where the data are stored on your computer.
```{r}
library(tidyverse)
library(data.table)
# data.table is faster compared to readr so we'll use it in this case (the function fread()). You need to install this package first to be able to run this code.
# create a vector with the file names and paths
files <- dir(
# Select the folder where the files are stored.
"[add your path here]",
# Tell R which pattern you want present in the files it will display.
pattern = "indresp",
# We want this process to repeat through the entire folder.
recursive = TRUE,
# And finally want R to show us the entire file path, rather than just
# the names of the individual files.
full.names = TRUE)
# Select only files from the UKHLS.
files <- files[stringr::str_detect(files, "ukhls")]
files
# create a vector of variable names
vars <- c("memorig", "sex_dv", "age_dv", "vote6")
for (i in 1:7) {
# Create a vector of the variables with the correct prefix.
varsToSelect <- paste(letters[i], vars, sep = "_")
# Add pidp to this vector (no prefix for pidp)
varsToSelect <- c("pidp", varsToSelect)
# Now read the data.
data <- fread(files[i], select = varsToSelect)
if (i == 1) {
all7 <- data
}
else {
all7 <- full_join(all7, data, by = "pidp")
}
# Now we can remove data to free up the memory.
rm(data)
}
```
## Reshape data (20 points)
Now we have got the data from all 7 waves in the same data frame **all7** in the wide format. Note that the panel is unbalanced, i.e. we included all people who participated in at least one wave of the survey. Reshape the data to the long format. The resulting data frame should have six columns for six variables.
```{r}
Long <- all7 %>%
...
Long
```
## Filter and recode (20 points)
Now we want to filter the data keeping only respondents from the original UKHLS sample for Great Britain (memorig == 1). We also want to clean the variables for sex (recoding it to "male" or "female") and political interest (keeping the values from 1 to 4 and coding all negative values as missing). Tabulate *sex* and *vote6* to make sure your recodings were correct.
```{r}
Long <- Long %>%
filter(...) %>%
mutate(sex_dv = ...) %>%
mutate(vote6 = ...)
Long %>%
...
Long %>%
...
```
## Calculate mean political interest by sex and wave (10 points)
Political interest is an ordinal variable, but we will treat it as interval and calculate mean political interest for men and women in each wave.
```{r}
meanVote6 <- Long %>%
...
meanVote6
```
## Reshape the data frame with summary statistics (20 points)
Your resulting data frame with the means is in the long format. Reshape it to the wide format. It should look like this:
| sex_dv | a | b | c | d | e | f | g |
|--- |--- |--- |--- |--- |--- |--- |--- |
| female | | | | | | | |
| male | | | | | | | |
In the cells of this table you should have mean political interest by sex and wave.
Write a short interpretation of your findings.
```{r}
meanVote6 %>%
...
```
## Estimate stability of political interest (30 points)
Political scientists have been arguing how stable the level of political interest is over the life course. Imagine someone who is not interested in politics at all so that their value of *vote6* is always 4. Their level of political interest is very stable over time, as stable as the level of political interest of someone who is always very interested in politics (*vote6* = 1). On the other hand, imagine someone who changes their value of *votes6* from 1 to 4 and back every other wave. Their level of political interest is very unstable.
Let us introduce a measure of stability of political interest that is going to be equal to the sum of the absolute values of changes in political interest from wave to wave. Let us call this measure Delta. It is difficult for me to typeset a mathematical formula in Markdown, but I'll explain this informally.
Imagine a person with the level of political interest that is constant over time: {1, 1, 1, 1, 1, 1, 1}. For this person, Delta is zero.
Now imagine a person who changes once from "very interested in politics" to "fairly interested in politics": {1, 1, 1, 1, 2, 2, 2}. For them, Delta = (1 - 1) + (1 - 1) + (1 - 1) + (2 - 1) + (2 - 2) + (2 - 2) = 1.
Now imagine someone who changes from "very interested in politics" to "not at all interested" every other wave: {1, 4, 1, 4, 1, 4, 1}. Delta = (4 - 1) + abs(1 - 4) + (4 - 1) + abs(1 - 4) + (4 - 1) + abs(1 - 4) = 3 * 6 = 18.
Large Delta indicates unstable political interest. Delta = 0 indicates a constant level of political interest.
Write the R code that does the following.
1. To simplify interpretation, keep only the respondents with non-missing values for political interest in all seven waves.
2. Calculate Delta for each person in the data set.
3. Calculate mean Delta for men and women.
4. Calculate mean Delta by age (at wave 1) and plot the local polynomial curve showing the association between age at wave 1 and mean Delta. You can use either **ggplot2** or the *scatter.smooth()* function from base R.
5. Write a short interpretation of your findings.
```{r}
...
```