-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path2_playing.Rmd
95 lines (71 loc) · 2.99 KB
/
2_playing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
output: pdf_document
---
```{r, echo=FALSE}
cat(paste("(C) (cc by-sa) Wouter van Atteveldt, file generated", format(Sys.Date(), format="%B %d %Y")))
```
> Note on the data used in this howto:
> This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/,
> but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc.
> For that reason, I've extracted csv files for some interesting tables that I've uploaded to
> https://github.com/vanatteveldt/learningr/tree/master/data.
> If you're accessing this tutorial from the githup project, these files should be in your 'data' sub folder automatically.
Playing with data in R
========================================================
To demonstrate R, we will use the data from Piketty's 'Capital in the 21st Century'
```{r}
income = read.csv("data/income_topdecile.csv")
```
We've downloaded a csv file and read it into a new variable `income`, which should appear in your environment list.
You can click on the file to inspect it visually, but we can also use the `head` command:
```{r}
head(income, n=10)
```
As you can see, the values are NA (missing) for most rows, especially in the earlier period.
Let's throw out all data containing missing values using the `na.omit` function:
```{r}
income = na.omit(income)
head(income)
```
Much better.
Now, we can list the variables in the file using `names` and get the numbers of rows or columns with `nrow` and `ncol`, respectively:
```{r}
names(income)
nrow(income)
ncol(income)
```
We can also ask for a summary of each of the variables in the file using the `summary` command:
```{r}
summary(income)
```
This lists the range, mean, etc. for each variable.
We can select any column from a data frame using variable$column:
```{r}
income$U.S.
```
This gives a vector of numbers representing the different cells in that column.
We can use various functions such as `mean`, `sum`, and `length` to get information about a vector.
```{r}
length(income$U.S.)
mean(income$U.S.)
mean(income$Europe)
```
As perhaps expected, the mean income inequality in Europe is lower than than in the U.S..
Let's do a t-test to see if the difference is significant:
```{r}
t.test(income$U.S., income$Europe, paired=T)
```
So, with p<.05 we can conclude that the income distribution in the U.S. is more unequal than in Europe.
Let's make a simple plot of the income inequality in the U.S. and Europe
(reproducing fig 9.8 on page 324)
```{r}
plot(x=income$Year, y=income$U.S., type="l", ylab="Top decile income share", xlab="Year", ylim=c(0, 0.5))
lines(x=income$Year, y=income$Europe, col="red")
```
As you can see, income distribution in pre-WWI Europe is actually more unequal than in the U.S.,
but this is reversed during the 1910's and inequality diverges after the 1970's.
Still, the lines are probably correlated:
```{r}
cor.test(income$U.S., income$Europe)
```
So, although the correlation is moderate at 0.43, it is not significant (due to a lack of data points)