forked from dataanalysis3/datan3_2019
-
Notifications
You must be signed in to change notification settings - Fork 26
/
class2.Rmd
139 lines (97 loc) · 4.4 KB
/
class2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: 'Data Analysis 3: Class 2'
author: "Alexey Bessudnov"
date: "22 January 2020"
output:
ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
```
## Plan for today
- Working with Github
- Reproducible research, project workflow, R Markdown
- Understanding Society Data structure
- Test assignment
## Github
- Allows you to keep track of changes in your project and share it with other people
- We'll work with Github from R Studio
- Stage -> Commit -> Push your files
- Include the files you don't want to track in the .gitignore file
## Reproducible research
- It should be possible to reproduce every statistical analysis you've done (for other researchers or future you)
- Always keep your code in scripts
- Have a folder for each project (R Studio project) and create subfolders with a clear structure
- Always write comments for your code so that other people (or future you) can understand what you have done
- Organise your code in a clear way and follow style conventions (see, for example, Google's R Style Guide: https://google.github.io/styleguide/Rguide.xml)
- Use Github
## R Markdown
- Combine code and output in one document
- Easily change output formats (html, LaTeX, Word, etc.)
- See [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
- Also see http://abessudnov.net/dataanalysis3/rmarkdown.html
- Alternative: Jupyter Notebook
## Exercise 1
- Create an .Rmd file with the text "Hello world"
- Sample 100 observations from the standard normal distribution; calculate the mean value in your sample and create a density plot
- Experiment with different output formats (Markdown, Word, LaTeX)
- Experiment with different Markdown options (show everything, show only output and not code)
- Commit and push the .Rmd and .md files on Github (do not commit files in other output formats)
## Understanding Society data structure
- Read the User Guide, pp. 57-64: https://www.understandingsociety.ac.uk/sites/default/files/downloads/documentation/mainstage/user-guides/mainstage-user-guide.pdf
- Let us explore data files
- **mrdoc** contains all the documentation
- **tab** contains data files in the tabulated format
## Data structure 2
- 18 waves of the BHPS data
- 9 waves of UndSoc data + cross-wave data
- Codebooks available here: https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation
- Let us look at the latest wave 9 and explore the data files
## Reading data in R
- You can read data in R in several ways
- Base R (*read.csv()* etc.)
- We will use the **readr** package (part of **tidyverse**)
- For tab separated files we want the function *read_tsv()*
- At a later point, we will use the *fread()* function from the **data.table** package
## Wave 9: data from the individual adult questionnaire (i_indresp)
```{r}
# attaching the tidyverse library quietly
suppressMessages(library(tidyverse))
indresp <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_indresp.tab")
# remove the data to free up memory
rm(indresp)
```
## Household-level substantive file (h_hhresp)
```{r}
hhresp <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_hhresp.tab")
rm(hhresp)
```
## Household roster data (h_egoalt)
```{r}
egoalt <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_egoalt.tab")
rm(egoalt)
```
## Stable individual characteristics
```{r}
xwavedat <- read_tsv("data/UKDA-6614-tab/tab/ukhls_wx/xwavedat.tab")
rm(xwavedat)
```
## Other data files
- i_child: data on children
- i_youth: data from self-completion questionnaire for children
- i_indall: all people in the household, inc. children and non-responses
- i_indsamp: technical individual-level data
- i_hhsamp: technical and observational data on hoseholds
- i_callrec: call records
- i_income: income and payments individual level data
- i_newborn: data on newborn children
- i_parstyle: parenting style data
## Assignments
- Test assignment (deadline 28 Jan, 2pm): https://github.com/datan3-2020/testAssignment
- Accept the invitation that I'll send by email and complete the assignment using Github Classroom
- Once completed do not forget to submit a link to your Github repo on Bart
- Any questions raise an issue here: https://github.com/datan3-2020/datan3/issues
## Homework for next class
- Read ch.5 (Data transformation) from R for Data Science: https://r4ds.had.co.nz/transform.html
- Do exercises 5.2.4, 5.3.1, 5.6.7 (and others if you have time)