class2.Rmd

---
title: 'Data Analysis 3: Class 2'
author: "Alexey Bessudnov"
date: "22 January 2020"
output:
  ioslides_presentation
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
```

## Plan for today

- Working with Github
- Reproducible research, project workflow, R Markdown
- Understanding Society Data structure
- Test assignment

## Github

- Allows you to keep track of changes in your project and share it with other people
- We'll work with Github from R Studio
- Stage -> Commit -> Push your files
- Include the files you don't want to track in the .gitignore file

## Reproducible research

- It should be possible to reproduce every statistical analysis you've done  (for other researchers or future you)
- Always keep your code in scripts
- Have a folder for each project (R Studio project) and create subfolders with a clear structure
- Always write comments for your code so that other people (or future you) can understand what you have done
- Organise your code in a clear way and follow style conventions (see, for example, Google's R Style Guide: https://google.github.io/styleguide/Rguide.xml)
- Use Github

## R Markdown

- Combine code and output in one document
- Easily change output formats (html, LaTeX, Word, etc.)
- See [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
- Also see http://abessudnov.net/dataanalysis3/rmarkdown.html
- Alternative: Jupyter Notebook

## Exercise 1

- Create an .Rmd file with the text "Hello world"
- Sample 100 observations from the standard normal distribution; calculate the mean value in your sample and create a density plot
- Experiment with different output formats (Markdown, Word, LaTeX)
- Experiment with different Markdown options (show everything, show only output and not code)
- Commit and push the .Rmd and .md files on Github (do not commit files in other output formats)

## Understanding Society data structure

- Read the User Guide, pp. 57-64: https://www.understandingsociety.ac.uk/sites/default/files/downloads/documentation/mainstage/user-guides/mainstage-user-guide.pdf
- Let us explore data files
- **mrdoc** contains all the documentation 
- **tab** contains data files in the tabulated format

## Data structure 2

- 18 waves of the BHPS data
- 9 waves of UndSoc data + cross-wave data
- Codebooks available here: https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation
- Let us look at the latest wave 9 and explore the data files

## Reading data in R

- You can read data in R in several ways
- Base R (*read.csv()* etc.)
- We will use the **readr** package (part of **tidyverse**)
- For tab separated files we want the function *read_tsv()*
- At a later point, we will use the *fread()* function from the **data.table** package


## Wave 9: data from the individual adult questionnaire (i_indresp)

```{r}
# attaching the tidyverse library quietly
suppressMessages(library(tidyverse))

indresp <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_indresp.tab")

# remove the data to free up memory
rm(indresp)

```

## Household-level substantive file (h_hhresp)


```{r}
hhresp <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_hhresp.tab")

rm(hhresp)

```

## Household roster data (h_egoalt)

```{r}
egoalt <- read_tsv("data/UKDA-6614-tab/tab/ukhls_w9/i_egoalt.tab")

rm(egoalt)

```

## Stable individual characteristics

```{r}
xwavedat <- read_tsv("data/UKDA-6614-tab/tab/ukhls_wx/xwavedat.tab")

rm(xwavedat)

```

## Other data files

- i_child: data on children
- i_youth: data from self-completion questionnaire for children
- i_indall: all people in the household, inc. children and non-responses
- i_indsamp: technical individual-level data
- i_hhsamp: technical and observational data on hoseholds
- i_callrec: call records
- i_income: income and payments individual level data
- i_newborn: data on newborn children
- i_parstyle: parenting style data

## Assignments

- Test assignment (deadline 28 Jan, 2pm): https://github.com/datan3-2020/testAssignment
- Accept the invitation that I'll send by email and complete the assignment using Github Classroom
- Once completed do not forget to submit a link to your Github repo on Bart
- Any questions raise an issue here: https://github.com/datan3-2020/datan3/issues

## Homework for next class

- Read ch.5 (Data transformation) from R for Data Science: https://r4ds.had.co.nz/transform.html
- Do exercises 5.2.4, 5.3.1, 5.6.7 (and others if you have time)