-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-results.Rmd
277 lines (242 loc) · 14.5 KB
/
05-results.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# Results
## Job Count by Category
Since we want to study the total number of jobs for each job category and one particular job could belong to multiple categories, we extract all the categories related to a job, seperate them, and create a new data frame called `popular_category`, which stores the counts of different job categories. Then, in order to visualize the numbers of job postings among different categories, we draw a descending horizontal bar chart based on his new data frame.
```{r}
categoryList <- job %>%
filter(!is.na(Job.Category)) %>%
select(Job.Category, Job.ID) %>%
mutate(Job.Category = as.character(Job.Category),
Job.Category = str_split(Job.Category, ",|&|,&"))
popular_category <-
as.data.frame(unlist(categoryList["Job.Category"],use.names=FALSE)) %>%
set_colnames("Category") %>%
mutate(Category = trimws(Category,"both")) %>%
filter(!is.na(Category)) %>%
filter(Category !="") %>%
filter(is.character(Category )) %>%
group_by(Category) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:25)
ggplot(popular_category, aes(x = fct_reorder(Category,count), y = count)) +
geom_col(color = "black", fill = "orange") +
ggtitle("Job Count by Category") +
labs(x = "Category", y = "Count") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
```
From the graphs below, we can tell that **Architecture** and **Engineering** have the most job postings, while **Procurement Policy** and **Social Services** have the fewest.
## Distributions of Salaries
We also want to study the distributions of salaries among different types of payroll. Since there are three payroll types in our data set, which are **Annual**, **Daily** and **Hourly**, we will draw three histograms to visualize the distributions. We take the mean of `Salary Range From` and `Salary Range To` as our salary for the histogram at the x-axis.
```{r}
job <- job %>%
mutate(salary = Salary.Range.From+(Salary.Range.To-Salary.Range.From)/2)
Annual = job[job$Salary.Frequency=="Annual",]
ggplot(Annual, aes(salary)) +
geom_histogram(bins = 40, color = "black", fill = "orange") +
ggtitle("Salary Distribution (Annual)") +
labs(x = "Salary", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
```
```{r}
Daily = job[job$Salary.Frequency=="Daily",]
ggplot(data = Daily, aes(Daily$salary)) +
geom_histogram(bins = 20, color = "black", fill = "orange") +
ggtitle("Salary Distribution (Daily)") +
labs(x = "Salary", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
```
```{r}
Hourly = job[job$Salary.Frequency=="Hourly",]
ggplot(data = Hourly, aes(Hourly$salary)) +
geom_histogram(bins = 40, color = "black", fill = "orange") +
ggtitle("Salary Distribution (Hourly)") +
labs(x = "Salary", y = "Count") +
theme(plot.title = element_text(hjust = 0.5))
```
From these three plots above, we have the following obeservations:
* For most of the jobs, the salaries are given annually. There are also some jobs which have hourly salaries. Only a few of those jobs have daily salaries.
* For salaries calculated annually, it has approximately right-skewed normal distribution, which means that most jobs do not have a relatively high salaries.
* For salaries calculated daily, there is no specific pattern regarding the distribution. Some jobs have relatively low daily salaries, while others have much higher salaries.
* For salaries calculated hourly, most of them has a relatively low value, but there are still some jobs have relatively high hourly salaries.
```{r}
temp = Hourly %>%
filter(salary < 10)
```
Then, We also look into our data and find out more information about our salary distribution. For insace, for houly paied jobs, Stationary Engineer and City Medical Specialist have extremly high hourly salaries, while College Aide has low hourly salaries.
```{r}
#converting salary on hourly scale anddaily scale to yearly scale
#no of working days in US in a year: 261 source:
#no of working hours in US in a day: 8.4 hours
job <- job %>% mutate(Annual_salary = if_else( Salary.Frequency == "Annual", round((Salary.Range.From + Salary.Range.To)/2,2),
if_else(Salary.Frequency == "Daily", round((Salary.Range.From + Salary.Range.To)*261/2,2),
round((Salary.Range.From + Salary.Range.To)*261*8.4/2,2))
)
)
##make the list of category of each job id as a single observations
df<-unnest(categoryList, cols = c(Job.Category))%>%
mutate(Job.Category = trimws(Job.Category,"both"))%>%
filter(Job.Category!="")
df_all<-left_join(df, job, by = "Job.ID")
df_popular<-df%>%
filter(Job.Category %in% popular_category$Category)%>%
merge(.,job[c("Job.ID","Annual_salary","Posting.Date")], by = "Job.ID")%>%
unique()%>%
mutate(month = lubridate::month(mdy(Posting.Date)))%>%
group_by(Job.Category,month)%>%
mutate(count = n())
```
```{r}
ggplot(df_popular, aes(x=month,y=Job.Category,fill=Job.Category))+
geom_density_ridges(scale = 3, show.legend = FALSE) + theme_ridges()+
labs(x="month",y="count")+
ggtitle("The counts of insects treated with different insecticides.")+
theme(plot.title = element_text(hjust = 0.5))
```
```{r}
## all job posting with only category, anuual salary and job id
ggplot(df_popular,aes(x = reorder(Job.Category,Annual_salary,FUN=mean), y = Annual_salary)) +
geom_boxplot(color = "black", fill = "orange") +
ggtitle("Distribution of Salaries w.r.t Different Categories") +
labs(x = "Category", y = "Anual Salary") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
```
## Word Clouds for Text Information
### How we get started
Meanwhile, we also want to study the minimum qualification requirements and preferred skills for the available jobs in NYC. We want to find if there are any patterns in these two columns and if we can extract any useful information from them. In order to illustrate our findings graphfically, we decide to use Word Clouds to show the most frequent words in these texts.
So what is Word Clouds? Word Clouds is visual representations of text data. They are useful for quickly perceiving the most prominent terms, which makes them widely used in media and well understood by the public. A Word Cloud is a collection of words depicted in different sizes. The bigger and bolder the word appears, the greater frequency within a given text and the more important it is.
In order to extract meaningful vocabularies from the text descriptions, we take advantage of the text mining package `tm` in R. This package is based on the ideas of Natural Language Processing (NLP). It have methods that can tranform all words to lowercases, remove words that are uninformative in Enlighs such as "a" and "the", and get rid of whitespaces and punctuations.
After these manipulations on the text data, we can create a new data frame of word frequencies. We can also sort it by frequency and find out the most frequent words under minimum qualification requirements and preferred skills for all jobs or for any particular category of jobs that we are interested in.
### Results
Due to the problem of `wordcloud2` that only one Word Cloud graph appears after knitting to Bookdown or HTML, we save all our graphs to four seperate html files that can be automatically rendered everytime they are opened in a browser. Here are the link to those files in my GitHub repo: https://github.com/ju-chengyou/5702_Final_Word_Cloud.
Here, we will show the Word Cloud of the most frequent words in Minimum Qual Requirements among all jobs in our dataset.
#### Minium Qual Requirements @ All Jobs
```{r}
job_docs <- VCorpus(VectorSource(job)) # Whole dataset
# inspect(job_docs)
job_mini_req <- VCorpus(VectorSource(job$Minimum.Qual.Requirements)) # Minimum Qual Requirements
# inspect(job_mini_req)
job_pref_skil <- VCorpus(VectorSource(job$Preferred.Skills)) # Preferred Skills
# inspect(job_pref_skil)
```
```{r}
# Tech Jobs
tech_jobs <- subset(job, Job.Category == "Technology, Data & Innovation")
# dim(tech_jobs) # There should be 28 jobs related to technology
job_tech_mini_req <- VCorpus(VectorSource(tech_jobs$Minimum.Qual.Requirements))
# inspect(job_tech_mini_req)
job_tech_pref_skil <- VCorpus(VectorSource(tech_jobs$Preferred.Skills))
# inspect(job_tech_pref_skil)
```
```{r}
# All Jobs cross Minimum Qual Requirements
# toSpace <- content_transformer(function (x , pattern) gsub(pattern, " ", x))
# job_mini_req <- tm_map(job_mini_req, toSpace, "/")
# job_mini_req <- tm_map(job_mini_req, toSpace, "@")
# job_mini_req <- tm_map(job_mini_req, toSpace, "\\|")
job_mini_req <- tm_map(job_mini_req, content_transformer(tolower))
job_mini_req <- tm_map(job_mini_req, removeNumbers)
job_mini_req <- tm_map(job_mini_req, removeWords, stopwords("english"))
job_mini_req <- tm_map(job_mini_req, removeWords, c("the", "one", "two", "for", "must", "year", "including"))
job_mini_req <- tm_map(job_mini_req, removePunctuation)
job_mini_req <- tm_map(job_mini_req, stripWhitespace)
# job_mini_req <- tm_map(job_mini_req, stemDocument)
```
```{r}
mini_req_matrix <- TermDocumentMatrix(job_mini_req)
mini_freq_m <- as.matrix(mini_req_matrix)
mini_freq_v <- sort(rowSums(mini_freq_m), decreasing=TRUE)
mini_freq <- data.frame(word = names(mini_freq_v), freq=mini_freq_v)
# head(mini_freq, 20)
htmlTable(head(mini_freq, 20), caption="Minimum Qual Requirements in All Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- library(webshot) -->
<!-- webshot::install_phantomjs(force = TRUE) -->
<!-- mini_freq_graph <- wordcloud2(data=mini_freq, color='random-light', backgroundColor='black') -->
<!-- library("htmlwidgets") -->
<!-- saveWidget(mini_freq_graph,"mini_freq_graph.html", selfcontained = F) -->
<!-- ``` -->
```{r}
library(wordcloud2)
wordcloud2(data=mini_freq, color='random-light', backgroundColor='black', size=0.8)
```
<!-- ```{r showChoro1} -->
<!-- htmltools::includeHTML("~/Documents/Columbia_Fall_2019/5702_Projects/5702-final-project/mini_freq_graph.html") -->
<!-- ``` -->
#### Preferred Skills @ All Jobs
```{r}
# All Jobs cross Preferred Skills
job_pref_skil <- tm_map(job_pref_skil, content_transformer(tolower))
job_pref_skil <- tm_map(job_pref_skil, removeNumbers)
job_pref_skil <- tm_map(job_pref_skil, removeWords, stopwords("english"))
job_pref_skil <- tm_map(job_pref_skil, removeWords, c("the", "one", "two", "for", "must", "year", "including"))
job_pref_skil <- tm_map(job_pref_skil, removePunctuation)
job_pref_skil <- tm_map(job_pref_skil, stripWhitespace)
```
```{r}
pref_skil_matrix <- TermDocumentMatrix(job_pref_skil)
pref_freq_m <- as.matrix(pref_skil_matrix)
pref_freq_v <- sort(rowSums(pref_freq_m), decreasing=TRUE)
pref_freq <- data.frame(word = names(pref_freq_v), freq=pref_freq_v)
pref_freq <- pref_freq[-1,]
htmlTable(head(pref_freq, 20), caption="Preferred Skills in All Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- wordcloud2(data=pref_freq, color='random-light', backgroundColor='black', size=0.8) -->
<!-- ``` -->
<!-- ```{r showChoro1} -->
<!-- htmltools::includeHTML("~/Documents/Columbia_Fall_2019/5702_Projects/5702-final-project/pref_freq_graph.html") -->
<!-- ``` -->
#### Minium Qual Requirements @ Tech Jobs
```{r}
# Tech Jobs cross Minimum Qual Requirements
job_tech_mini_req <- tm_map(job_tech_mini_req, content_transformer(tolower))
job_tech_mini_req <- tm_map(job_tech_mini_req, removeNumbers)
job_tech_mini_req <- tm_map(job_tech_mini_req, removeWords, stopwords("english"))
job_tech_mini_req <- tm_map(job_tech_mini_req, removeWords, c("the", "one", "two", "for", "must", "year", "including"))
job_tech_mini_req <- tm_map(job_tech_mini_req, removePunctuation)
job_tech_mini_req <- tm_map(job_tech_mini_req, stripWhitespace)
```
```{r}
tech_mini_matrix <- TermDocumentMatrix(job_tech_mini_req)
tech_mini_freq_m <- as.matrix(tech_mini_matrix)
tech_mini_freq_v <- sort(rowSums(tech_mini_freq_m), decreasing=TRUE)
tech_mini_freq <- data.frame(word = names(tech_mini_freq_v), freq=tech_mini_freq_v)
htmlTable(head(tech_mini_freq, 20), caption="Minimum Qual Requirements in Technology Related Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
```{r}
# library(wordcloud2)
# wordcloud2(data=tech_mini_freq, color='random-light', backgroundColor='black', size=0.8)
```
#### Preferred Jobs @ Tech Jobs
```{r}
# Tech Jobs cross Preferred Skills
job_tech_pref_skil <- tm_map(job_tech_pref_skil, content_transformer(tolower))
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeNumbers)
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeWords, stopwords("english"))
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeWords, c("the", "one", "two", "for", "must", "year", "including"))
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removePunctuation)
job_tech_pref_skil <- tm_map(job_tech_pref_skil, stripWhitespace)
```
```{r}
tech_pref_matrix <- TermDocumentMatrix(job_tech_pref_skil)
tech_pref_freq_m <- as.matrix(tech_pref_matrix)
tech_pref_freq_v <- sort(rowSums(tech_pref_freq_m), decreasing=TRUE)
tech_pref_freq <- data.frame(word = names(tech_pref_freq_v), freq=tech_pref_freq_v)
tech_pref_freq <- tech_pref_freq[-1,]
htmlTable(head(tech_pref_freq, 20), caption="Preferred Skills in Technology Related Jobs Word Frequency",
header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- wordcloud2(data = tech_pref_freq, color='random-light', backgroundColor='black', size=0.8) -->
<!-- ``` -->
### Obervations
We can have plenty of observations from the four Word Clouds. For instance, we can see that for both Minimum Qual Requirements and Preferred Skills, *experience* is the most frequent word in all these four graphs, which makes sense, since previous working experience is indeed very important for applicants.
Also, when comparing all jobs with technological jobs, we notice that for tech jobs prefer to hire employees with skills related to technology, since vocabularies like *computer* and *programming* appears a lot in these texts. Even some words about specific skills, such as *sql*, appear in our most frequent word list.
Meanwhile, in all these four graphs, vocabularies like *skills*, *knowledge*, *management*, *communication* appear plenty of times. This makes sense since all employers want to hire people who have solid skills and are good at communication and cooperation.
Finally, in general, we find that minimum requirements of all jobs and tech jobs graphs share almost the same set of frequent words, which we believe is due to the fact that **minimum** requirements are similar for all kinds of jobs.