generated from datan3-2020/assignment4
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ASS. 4 (MD).Rmd
163 lines (129 loc) · 5.99 KB
/
ASS. 4 (MD).Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Statistical assignment 4"
author: "Jessica Ledger -- 660013603"
date: "26/02/2020"
output: github_document
---
```{r setup, include=FALSE}
# Please note these options.
# This tells R Markdown that we want to show code in the output document.
knitr::opts_chunk$set(echo = TRUE)
# Switching off messages in the output document.
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
# Switching on caching to make things faster (don't commit cache files on Github).
knitr::opts_chunk$set(cache = TRUE)
```
In this assignment you will need to reproduce 5 ggplot graphs. I supply graphs as images; you need to write the ggplot2 code to reproduce them and knit and submit a Markdown document with the reproduced graphs (as well as your .Rmd file).
First we will need to open and recode the data. I supply the code for this; you only need to change the file paths.
```{r}
library(tidyverse)
Data8 <- read_tsv("/Users/jessicaledger/Desktop/WORK/THIRD YEAR/Data Analysis III/data2020/data/UKDA-6614-tab/tab/ukhls_w8/h_indresp.tab")
Data8 <- Data8 %>%
select(pidp, h_age_dv, h_payn_dv, h_gor_dv)
Stable <- read_tsv("/Users/jessicaledger/Desktop/WORK/THIRD YEAR/Data Analysis III/data2020/data/UKDA-6614-tab/tab/ukhls_wx/xwavedat.tab")
Stable <- Stable %>%
select(pidp, sex_dv, ukborn, plbornc)
Data <- Data8 %>% left_join(Stable, "pidp")
rm(Data8, Stable)
Data <- Data %>%
mutate(sex_dv = ifelse(sex_dv == 1, "male",
ifelse(sex_dv == 2, "female", NA))) %>%
mutate(h_payn_dv = ifelse(h_payn_dv < 0, NA, h_payn_dv)) %>%
mutate(h_gor_dv = recode(h_gor_dv,
`-9` = NA_character_,
`1` = "North East",
`2` = "North West",
`3` = "Yorkshire",
`4` = "East Midlands",
`5` = "West Midlands",
`6` = "East of England",
`7` = "London",
`8` = "South East",
`9` = "South West",
`10` = "Wales",
`11` = "Scotland",
`12` = "Northern Ireland")) %>%
mutate(placeBorn = case_when(
ukborn == -9 ~ NA_character_,
ukborn < 5 ~ "UK",
plbornc == 5 ~ "Ireland",
plbornc == 18 ~ "India",
plbornc == 19 ~ "Pakistan",
plbornc == 20 ~ "Bangladesh",
plbornc == 10 ~ "Poland",
plbornc == 27 ~ "Jamaica",
plbornc == 24 ~ "Nigeria",
TRUE ~ "other")
)
```
Reproduce the following graphs as close as you can. For each graph, write two sentences (not more!) describing its main message.
1. Univariate distribution (20 points).
```{r}
Data %>%
ggplot(mapping = aes(x = h_payn_dv)) +
geom_freqpoly() +
xlab("Net Monthly Pay") +
ylab("No. of Respondents")
```
This graph shows that, most respondents have a net pay per month of approximately ~£1,400. However, after this peak the net pay decreases quite dramatically, with a small peak at ~£5,500.
2. Line chart (20 points). The lines show the non-parametric association between age and monthly earnings for men and women.
```{r}
Data %>%
ggplot(mapping = aes(x = h_age_dv, y = h_payn_dv, linetype = sex_dv)) +
geom_smooth(colour = "black") +
xlim(15, 65) +
xlab("Age") +
ylab("Monthly earnings")+
labs(linetype="Sex")
```
This graph shows that until the age of ~22 both sexes follow the same upward trajectory of growth in net monthly earnings. After this point, the growth in monthly earnings slows dramtically for women in comparison to men (widening the gap) and both sexes face a downward trajectory in monthly earnings after the age of 50.
3. Faceted bar chart (20 points).
```{r}
BySex <- Data %>%
group_by(sex_dv, placeBorn) %>%
summarise(medianpay = median(h_payn_dv, na.rm = TRUE)) %>%
filter(!is.na(sex_dv)) %>%
filter(!is.na(placeBorn))
BySex %>%
ggplot(mapping = aes(x = sex_dv, y = medianpay)) +
geom_histogram(stat = "identity") +
facet_wrap(~ placeBorn, ncol = 3) +
ylim(0,2000) +
xlab("Sex") +
ylab("Median Monthly Net Pay")
```
This chart demonstrates that in every country, men have a larger median net monthly pay than women. The difference in the median net monthly pay varies per country, the largest different is seen in Ireland and the smallest gap is interestingly in Bangladesh.
4. Heat map (20 points).
```{r}
library(tidyr)
ByRegion <- Data %>%
group_by(h_gor_dv, placeBorn) %>%
summarise(Meanage = mean(h_age_dv, na.rm = TRUE)) %>%
filter(!is.na(h_gor_dv)) %>%
filter(!is.na(placeBorn))
ByRegion %>%
ggplot(mapping = aes(x = h_gor_dv, y = placeBorn, fill = Meanage)) +
geom_tile() +
xlab("Region") +
ylab("Country of birth") +
labs(fill = "Mean age") +
theme(axis.text.x = element_text(angle = 90))
```
This heat map shows the mean age of residents per region in the UK depending on their country of birth. The map indicates that the mean age across the regions is 60-70. There is a noticably low mean age for Nigerians in Scotland and Yorkshire bit this may be due to a small sample size, there are also quite a few NAs accross the map especially in Northern Ireland.
5. Population pyramid (20 points).
```{r}
byPop <- Data %>%
group_by(sex_dv, h_age_dv) %>%
filter(!is.na(sex_dv)) %>%
filter(!is.na(h_age_dv)) %>%
count(sex_dv, h_age_dv) #creates n column
byPop$n <- ifelse(byPop$sex_dv == "male", -1*byPop$n, byPop$n)
byPop %>%
ggplot(mapping = aes(x = h_age_dv, y = n, fill = sex_dv)) +
geom_bar(data = subset(byPop, sex_dv == "female"), stat = "identity", colour = "red") +
geom_bar(data = subset(byPop, sex_dv == "male"), stat = "identity", colour = "blue") +
coord_flip() +
xlab("Age") +
labs(fill = "Sex")
```