-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1 Descriptives.Rmd
187 lines (130 loc) · 5.52 KB
/
1 Descriptives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: "Wages, Gender and other factors"
output:
word_document: default
html_document:
df_print: paged
pdf_document: default
editor_options:
chunk_output_type: console
---
In this notebook we make use of a small data set("CPS1985.xlsx") concerning employees and examine:
- the composition of the work force with respect to various employee characteristics(variables).
- whether there is any sign of of possible wage difference between men and women and
- if there is any profound correlation among the variables
We first import the relevant data which are available as an Excel file.
```{r}
setwd("D:/data/Econometrics and Applied Statistics")
library(readxl)
cps <- read_excel("CPS1985.xlsx")
```
We first take a look at the data
```{r}
summary(cps)
```
from which we get the basic descriptive statistics for each numerical-variable and the length for each character-variable.
Additionally, let us draw a Histogram and a Boxplot for each variable. From the diagrams we can check for skewness and look for outliers (that we may or may not decide to get rid off).
```{r}
hist(cps$wage,
xlab = "Hourly Wage in $",
main = "Histogram of wage",
col = "steelblue", breaks = 20)
boxplot(cps$wage,
ylab = "Hourly Wage in $",
main = "Boxplot of wage",
col = "steelblue")
hist(cps$education,
xlab = "Education in years",
main = "Histogram of Education",
col = "steelblue", breaks = 20)
boxplot(cps$education,
ylab = "Education in years",
main = "Boxplot of education",
col = "steelblue")
hist(cps$experience,
xlab = "Experience in years",
main = "Histogram of Experience",
col = "steelblue", breaks = 20)
boxplot(cps$experience,
ylab = "Experience in years",
main = "Boxplot of experience",
col = "steelblue")
hist(cps$age,
xlab = "Age in years",
main = "Histogram of age",
col = "steelblue", breaks = 20)
boxplot(cps$age,
ylab = "Age in years",
main = "Boxplot of age",
col = "steelblue")
```
We calculate also the distribution of each categorical variable
```{r}
table(cps$ethnicity)
table(cps$region)
table(cps$gender)
table(cps$occupation)
table(cps$sector)
table(cps$union)
table(cps$married)
```
We can also calculate each distribution across the possible values of a specific categorical variable. For example, a statistic of interest would be the distribution of working sector, marital status or wage across gender
```{r}
table(cps$gender, cps$sector)
table(cps$gender, cps$married)
tapply(cps$wage, cps$gender, summary)
```
Even more specifically, we are interested in the mean wage, standard deviation and total number of observations of each distinct state of gender(male or female). To retrieve this information we group the initial data by gender and proceed the aforementioned calculations
```{r message=FALSE}
library(dplyr)
avgs <- cps %>%
group_by(gender) %>%
summarise(mean(wage),
sd(wage),
n())
print(avgs)
```
A first glance naive conclusion from the above table would be that the average wage for women is about 2\$ less than the average wage for men. But is this really true?
To compare the mean wage for women and men and test statistical significance of the difference between them we should split the initial data in 2 appropriate subgroups (men, women) and perform a t-test applied on the variable "wage"
```{r}
male_obs <- cps %>% dplyr::filter(gender == "male")
female_obs <- cps %>% dplyr::filter(gender == "female")
t.test(male_obs$wage, female_obs$wage)
```
The above result confirms that the the difference in means is not equal to 0.
To illuminate the procedure we perform the above calculation also manually. To do so we return to the table "avgs" getting access to the [estimated]{.underline} E(wage), se(wage) and number of observation for each gender.
```{r}
# split the dataset by gender
male <- avgs %>% dplyr::filter(gender == "male")
female <- avgs %>% dplyr::filter(gender == "female")
# rename columns of both splits
colnames(male) <- c("Gender", "Y_bar_m", "s_m", "n_m")
colnames(female) <- c("Gender", "Y_bar_f", "s_f", "n_f")
male
female
```
Now, considering the wage_male and wage_female as independent variables, we then have that for the variable gap = wage_male - wage_female follows asymptotically a t_stastic with:
E(gap) = E(wage_male) - E(wage_female) , var(gap) = var(wage_male) + var(wage_female) , thus
[estimated E(gap)]{.underline} is gap_bar Y_bar_m - Y_bar_f
[estimated se(gap)]{.underline} is (s_m2 /n_m +s_f2 /n_f)^1/2^
```{r}
gap <- male$Y_bar_m - female$Y_bar_f
gap_se <- sqrt(male$s_m^2 / male$n_m + female$s_f^2 / female$n_f)
```
So, we finally calculate the 95% confidence interval as follows
```{r}
gap_ci_l <- gap - 1.96 * gap_se
gap_ci_u <- gap + 1.96 * gap_se
result <- cbind(gap, gap_se, gap_ci_l, gap_ci_u)
print(result, digits = 3)
```
Our result coincides with the result of the automated t-test we have alternatively performed.
As a final task we examine correlation between the continuous numerical variables of the dataset. To that end we calculate the correlation and create the corresponding scatterplot for each pair
```{r}
library(corrplot)
subset=cps[,2:5]
cor1=cor(subset)
corrplot.mixed (cor1, lower.col='black', number.cex=.7)
pairs(subset)
```
Among others we observe a very high positive correlation between experience and age, a relatively high positive correlation between wage and education and a relatively high negative correlation between education and experience.