05-results.Rmd

# Results
## Job Count by Category

Since we want to study the total number of jobs for each job category and one particular job could belong to multiple categories, we extract all the categories related to a job, seperate them, and create a new data frame called `popular_category`, which stores the counts of different job categories. Then, in order to visualize the numbers of job postings among different categories, we draw a descending horizontal bar chart based on his new data frame. 

```{r}
categoryList <- job %>%
  filter(!is.na(Job.Category)) %>%
  select(Job.Category, Job.ID) %>%
  mutate(Job.Category = as.character(Job.Category),
         Job.Category = str_split(Job.Category, ",|&|,&"))

popular_category <-
  as.data.frame(unlist(categoryList["Job.Category"],use.names=FALSE)) %>%
  set_colnames("Category") %>%
  mutate(Category = trimws(Category,"both")) %>%
  filter(!is.na(Category)) %>%
  filter(Category !="") %>%
  filter(is.character(Category ))  %>%
  group_by(Category) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:25)
  
ggplot(popular_category, aes(x = fct_reorder(Category,count), y = count)) +
  geom_col(color = "black", fill = "orange") +
  ggtitle("Job Count by Category") +
  labs(x = "Category", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  coord_flip()
```

From the graphs below, we can tell that **Architecture** and **Engineering** have the most job postings, while **Procurement Policy** and **Social Services** have the fewest.


## Distributions of Salaries

We also want to study the distributions of salaries among different types of payroll. Since there are three payroll types in our data set, which are **Annual**, **Daily** and **Hourly**, we will draw three histograms to visualize the distributions. We take the mean of `Salary Range From` and `Salary Range To` as our salary for the histogram at the x-axis.

```{r}
job <- job %>%
  mutate(salary = Salary.Range.From+(Salary.Range.To-Salary.Range.From)/2)

Annual = job[job$Salary.Frequency=="Annual",]
ggplot(Annual, aes(salary)) +
  geom_histogram(bins = 40, color = "black", fill = "orange") +
  ggtitle("Salary Distribution (Annual)") + 
  labs(x = "Salary", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))
```
```{r}
Daily = job[job$Salary.Frequency=="Daily",]
ggplot(data = Daily, aes(Daily$salary)) +
  geom_histogram(bins = 20, color = "black", fill = "orange") +
  ggtitle("Salary Distribution (Daily)") + 
  labs(x = "Salary", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))
```
```{r}
Hourly = job[job$Salary.Frequency=="Hourly",]
ggplot(data = Hourly, aes(Hourly$salary)) +
  geom_histogram(bins = 40, color = "black", fill = "orange") +
  ggtitle("Salary Distribution (Hourly)") + 
  labs(x = "Salary", y = "Count") + 
  theme(plot.title = element_text(hjust = 0.5))
```

From these three plots above, we have the following obeservations: 
* For most of the jobs, the salaries are given annually. There are also some jobs which have hourly salaries. Only a few of those jobs have daily salaries. 
* For salaries calculated annually, it has approximately right-skewed normal distribution, which means that most jobs do not have a relatively high salaries.
* For salaries calculated daily, there is no specific pattern regarding the distribution. Some jobs have relatively low daily salaries, while others have much higher salaries. 
* For salaries calculated hourly, most of them has a relatively low value, but there are still some jobs have relatively high hourly salaries. 

```{r}
temp = Hourly %>% 
  filter(salary < 10)
```

Then, We also look into our data and find out more information about our salary distribution. For insace, for houly paied jobs, Stationary Engineer and City Medical Specialist have extremly high hourly salaries, while College Aide has low hourly salaries.

```{r}
#converting salary on hourly scale anddaily scale to yearly scale
#no of working days in US in a year: 261 source: 
#no of working hours in US in a day: 8.4 hours 
job <- job %>% mutate(Annual_salary = if_else( Salary.Frequency == "Annual", round((Salary.Range.From + Salary.Range.To)/2,2),
                                 if_else(Salary.Frequency == "Daily", round((Salary.Range.From + Salary.Range.To)*261/2,2),
                                         round((Salary.Range.From + Salary.Range.To)*261*8.4/2,2))
                                 )
                               )

##make the list of category of each job id as a single observations 
df<-unnest(categoryList, cols = c(Job.Category))%>%
  mutate(Job.Category = trimws(Job.Category,"both"))%>%
  filter(Job.Category!="")

df_all<-left_join(df, job, by = "Job.ID")

df_popular<-df%>%
  filter(Job.Category %in% popular_category$Category)%>%
  merge(.,job[c("Job.ID","Annual_salary","Posting.Date")], by = "Job.ID")%>%
  unique()%>%
  mutate(month = lubridate::month(mdy(Posting.Date)))%>%
  group_by(Job.Category,month)%>%
  mutate(count = n())
```
```{r}
ggplot(df_popular, aes(x=month,y=Job.Category,fill=Job.Category))+
  geom_density_ridges(scale = 3, show.legend = FALSE) + theme_ridges()+
  labs(x="month",y="count")+
  ggtitle("The counts of insects treated with different insecticides.")+
  theme(plot.title = element_text(hjust = 0.5))
```
```{r}
## all job posting with only category, anuual salary and job id
ggplot(df_popular,aes(x = reorder(Job.Category,Annual_salary,FUN=mean), y = Annual_salary)) +
geom_boxplot(color = "black", fill = "orange") +
ggtitle("Distribution of Salaries w.r.t Different Categories") + 
labs(x = "Category", y = "Anual Salary") + 
theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()
```


## Word Clouds for Text Information

### How we get started
Meanwhile, we also want to study the minimum qualification requirements and preferred skills for the available jobs in NYC. We want to find if there are any patterns in these two columns and if we can extract any useful information from them. In order to illustrate our findings graphfically, we decide to use Word Clouds to show the most frequent words in these texts.

So what is Word Clouds? Word Clouds is visual representations of text data. They are useful for quickly perceiving the most prominent terms, which makes them widely used in media and well understood by the public. A Word Cloud is a collection of words depicted in different sizes. The bigger and bolder the word appears, the greater frequency within a given text and the more important it is.

In order to extract meaningful vocabularies from the text descriptions, we take advantage of the text mining package `tm` in R. This package is based on the ideas of Natural Language Processing (NLP). It have methods that can tranform all words to lowercases, remove words that are uninformative in Enlighs such as "a" and "the", and get rid of whitespaces and punctuations.

After these manipulations on the text data, we can create a new data frame of word frequencies. We can also sort it by frequency and find out the most frequent words under minimum qualification requirements and preferred skills for all jobs or for any particular category of jobs that we are interested in.

### Results

Due to the problem of `wordcloud2` that only one Word Cloud graph appears after knitting to Bookdown or HTML, we save all our graphs to four seperate html files that can be automatically rendered everytime they are opened in a browser. Here are the link to those files in my GitHub repo: https://github.com/ju-chengyou/5702_Final_Word_Cloud.

Here, we will show the Word Cloud of the most frequent words in Minimum Qual Requirements among all jobs in our dataset.

#### Minium Qual Requirements @ All Jobs
```{r}
job_docs <- VCorpus(VectorSource(job)) # Whole dataset
# inspect(job_docs)
job_mini_req <- VCorpus(VectorSource(job$Minimum.Qual.Requirements)) # Minimum Qual Requirements
# inspect(job_mini_req)
job_pref_skil <- VCorpus(VectorSource(job$Preferred.Skills)) # Preferred Skills
# inspect(job_pref_skil)
```
```{r}
# Tech Jobs
tech_jobs <- subset(job, Job.Category == "Technology, Data & Innovation")
# dim(tech_jobs) # There should be 28 jobs related to technology
job_tech_mini_req <- VCorpus(VectorSource(tech_jobs$Minimum.Qual.Requirements))
# inspect(job_tech_mini_req)
job_tech_pref_skil <- VCorpus(VectorSource(tech_jobs$Preferred.Skills))
# inspect(job_tech_pref_skil)
```
```{r}
# All Jobs cross Minimum Qual Requirements
# toSpace <- content_transformer(function (x , pattern) gsub(pattern, " ", x))
# job_mini_req <- tm_map(job_mini_req, toSpace, "/")
# job_mini_req <- tm_map(job_mini_req, toSpace, "@")
# job_mini_req <- tm_map(job_mini_req, toSpace, "\\|")
job_mini_req <- tm_map(job_mini_req, content_transformer(tolower))
job_mini_req <- tm_map(job_mini_req, removeNumbers)
job_mini_req <- tm_map(job_mini_req, removeWords, stopwords("english"))
job_mini_req <- tm_map(job_mini_req, removeWords, c("the", "one", "two", "for", "must", "year", "including")) 
job_mini_req <- tm_map(job_mini_req, removePunctuation)
job_mini_req <- tm_map(job_mini_req, stripWhitespace)
# job_mini_req <- tm_map(job_mini_req, stemDocument)
```
```{r}
mini_req_matrix <- TermDocumentMatrix(job_mini_req)
mini_freq_m <- as.matrix(mini_req_matrix)
mini_freq_v <- sort(rowSums(mini_freq_m), decreasing=TRUE)
mini_freq <- data.frame(word = names(mini_freq_v), freq=mini_freq_v)
# head(mini_freq, 20)
htmlTable(head(mini_freq, 20), caption="Minimum Qual Requirements in All Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- library(webshot) -->
<!-- webshot::install_phantomjs(force = TRUE) -->
<!-- mini_freq_graph <- wordcloud2(data=mini_freq, color='random-light', backgroundColor='black') -->
<!-- library("htmlwidgets") -->
<!-- saveWidget(mini_freq_graph,"mini_freq_graph.html", selfcontained = F) -->
<!-- ``` -->
```{r}
library(wordcloud2)
wordcloud2(data=mini_freq, color='random-light', backgroundColor='black', size=0.8)
```
<!-- ```{r showChoro1} -->
<!-- htmltools::includeHTML("~/Documents/Columbia_Fall_2019/5702_Projects/5702-final-project/mini_freq_graph.html") -->
<!-- ``` -->

#### Preferred Skills @ All Jobs
```{r}
# All Jobs cross Preferred Skills
job_pref_skil <- tm_map(job_pref_skil, content_transformer(tolower))
job_pref_skil <- tm_map(job_pref_skil, removeNumbers)
job_pref_skil <- tm_map(job_pref_skil, removeWords, stopwords("english"))
job_pref_skil <- tm_map(job_pref_skil, removeWords, c("the", "one", "two", "for", "must", "year", "including")) 
job_pref_skil <- tm_map(job_pref_skil, removePunctuation)
job_pref_skil <- tm_map(job_pref_skil, stripWhitespace)
```
```{r}
pref_skil_matrix <- TermDocumentMatrix(job_pref_skil)
pref_freq_m <- as.matrix(pref_skil_matrix)
pref_freq_v <- sort(rowSums(pref_freq_m), decreasing=TRUE)
pref_freq <- data.frame(word = names(pref_freq_v), freq=pref_freq_v)
pref_freq <- pref_freq[-1,]
htmlTable(head(pref_freq, 20), caption="Preferred Skills in All Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- wordcloud2(data=pref_freq, color='random-light', backgroundColor='black', size=0.8) -->
<!-- ``` -->

<!-- ```{r showChoro1} -->
<!-- htmltools::includeHTML("~/Documents/Columbia_Fall_2019/5702_Projects/5702-final-project/pref_freq_graph.html") -->
<!-- ``` -->

#### Minium Qual Requirements @ Tech Jobs
```{r}
# Tech Jobs cross Minimum Qual Requirements
job_tech_mini_req <- tm_map(job_tech_mini_req, content_transformer(tolower))
job_tech_mini_req <- tm_map(job_tech_mini_req, removeNumbers)
job_tech_mini_req <- tm_map(job_tech_mini_req, removeWords, stopwords("english"))
job_tech_mini_req <- tm_map(job_tech_mini_req, removeWords, c("the", "one", "two", "for", "must", "year", "including")) 
job_tech_mini_req <- tm_map(job_tech_mini_req, removePunctuation)
job_tech_mini_req <- tm_map(job_tech_mini_req, stripWhitespace)
```
```{r}
tech_mini_matrix <- TermDocumentMatrix(job_tech_mini_req)
tech_mini_freq_m <- as.matrix(tech_mini_matrix)
tech_mini_freq_v <- sort(rowSums(tech_mini_freq_m), decreasing=TRUE)
tech_mini_freq <- data.frame(word = names(tech_mini_freq_v), freq=tech_mini_freq_v)
htmlTable(head(tech_mini_freq, 20), caption="Minimum Qual Requirements in Technology Related Jobs Word Frequency", header=c("Word", "Frequency"), rnames=FALSE)
```
```{r}
# library(wordcloud2)
# wordcloud2(data=tech_mini_freq, color='random-light', backgroundColor='black', size=0.8)
```

#### Preferred Jobs @ Tech Jobs
```{r}
# Tech Jobs cross Preferred Skills
job_tech_pref_skil <- tm_map(job_tech_pref_skil, content_transformer(tolower))
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeNumbers)
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeWords, stopwords("english"))
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removeWords, c("the", "one", "two", "for", "must", "year", "including")) 
job_tech_pref_skil <- tm_map(job_tech_pref_skil, removePunctuation)
job_tech_pref_skil <- tm_map(job_tech_pref_skil, stripWhitespace)
```
```{r}
tech_pref_matrix <- TermDocumentMatrix(job_tech_pref_skil)
tech_pref_freq_m <- as.matrix(tech_pref_matrix)
tech_pref_freq_v <- sort(rowSums(tech_pref_freq_m), decreasing=TRUE)
tech_pref_freq <- data.frame(word = names(tech_pref_freq_v), freq=tech_pref_freq_v)
tech_pref_freq <- tech_pref_freq[-1,]
htmlTable(head(tech_pref_freq, 20), caption="Preferred Skills in Technology Related Jobs Word Frequency", 
          header=c("Word", "Frequency"), rnames=FALSE)
```
<!-- ```{r} -->
<!-- library(wordcloud2) -->
<!-- wordcloud2(data = tech_pref_freq, color='random-light', backgroundColor='black', size=0.8) -->
<!-- ``` -->

### Obervations
We can have plenty of observations from the four Word Clouds. For instance, we can see that for both Minimum Qual Requirements and Preferred Skills, *experience* is the most frequent word in all these four graphs, which makes sense, since previous working experience is indeed very important for applicants.

Also, when comparing all jobs with technological jobs, we notice that for tech jobs prefer to hire employees with skills related to technology, since vocabularies like *computer* and *programming* appears a lot in these texts. Even some words about specific skills, such as *sql*, appear in our most frequent word list.

Meanwhile, in all these four graphs, vocabularies like *skills*, *knowledge*, *management*, *communication* appear plenty of times. This makes sense since all employers want to hire people who have solid skills and are good at communication and cooperation.

Finally, in general, we find that minimum requirements of all jobs and tech jobs graphs share almost the same set of frequent words, which we believe is due to the fact that **minimum** requirements are similar for all kinds of jobs.