-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.Rmd
106 lines (77 loc) · 3.66 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: "Atlanta City Employee Salaries 2015"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.width = 7)
```
## Overview
This [Shiny app](https://tmasjc.shinyapps.io/atl_salary_data/) explores the Atlanta city employee salaries data for the Year 2015. In particular, the relationship between employee ``salary`` and variables such as ``gender``, ``age``, and ``ethnic group``.
Dataset obtained from [data.world](https://data.world/brentbrewington/atlanta-city-employee-salaries), contributed by [Brent Brewington](https://github.com/bbrewington/atlanta-salary-data).
## Components
- **Selector Panel** which allows user to select points using cursor (click-and-drag)
- **Plot I** displays the relationship of employee ``median salary`` by ``age``, seperated by ``gender``
- **Plot II** displays the relationship of employee ``median salary`` by ``ethnic group``, seperated by ``gender``
- **Table I** which tabulates count at each combination of ``ethnic group`` and ``gender``
![Screen Shot](screen.gif)
## Motivation
This Shiny app is meant as a simple demonstration of how unbalanced dataset can be misleading at times. In statistics, this phenomenon is known as **[Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)** or the Yule–Simpson effect.
If we look at male and female ``median salary`` at various ``age group``, we might observe that *in general, male earns higher pay than female*.
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(stringr)
library(ggplot2)
raw <- readRDS("Data/atl_2015.RDS") %>%
rename(ethnic = ethnic.origin, job = job.title, salary = annual.salary) %>%
mutate(gender = factor(sex), ethnic = factor(ethnic)) %>%
as_tibble()
which_less_than <- function(vec, n) {
x <- table(vec) < n
sort(unique(vec))[x]
}
sub_age <- with(raw, age %>% which_less_than(10))
sub_ethnic <- with(raw, ethnic %>% which_less_than(30))
dat <- raw %>%
filter(!(age %in% sub_age), !(ethnic %in% sub_ethnic)) %>%
mutate(ethnic = factor(str_extract(ethnic, "^[A-Z]?[a-z]+")))
```
```{r}
old <- theme_set(theme_light() + theme(plot.title = element_text(size = 11), legend.position = "none"))
```
```{r}
dat %>%
group_by(age, sex) %>%
summarise(salary = median(salary)) %>%
ggplot(aes(age, salary, col = sex, group = sex)) +
scale_x_continuous(breaks = seq(20, 100, 10), limits = c(20, 70)) +
geom_line() +
labs(x = "Age", y = "Median Salary", col = "Sex") +
theme(legend.position = "bottom")
```
However, if we look at male and female ``median salary`` by ``ethnic groups``, we notice that female's median salary does not seem to lag behind. As a matter of fact, *both Asian and White women median salary is actually higher than men by an obvious margin*.
```{r}
dat %>%
group_by(ethnic, sex) %>%
summarise(salary = median(salary)) %>%
ggplot(aes(ethnic, salary, col = sex)) +
geom_line(col = "gray") +
geom_point() +
labs(x = "Ethnic Group", y = "Median Salary", col = "Sex") +
theme(legend.position = "bottom")
```
Tinker around the Shiny app to explore the data and understand the logic behind.
## Deployment
This Shiny app is shipped with Docker, built on [rocker/tidyverse](https://hub.docker.com/r/rocker/tidyverse/) image.
```bash
# from this repo
git clone [email protected]:tmasjc/atl-salary-data.git
# move inside directory
cd atl-salary-data
# name your image
# it may take a while to build
docker build -t atl_salary_data .
# start your container
docker run -dp 3838:3838 atl_salary_data
## You are set. Go to localhost:3838/atl_salary_data to view application. ##
```
Or you can simply deploy it to Shiny server as per normal.