Skip to content

Latest commit

 

History

History
 
 

2019-02-19

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

PhDs Awarded by Field

Dr. Ellie Murray proposed a DataViz challenge for the #epibookclub based around the number of PhD degrees awraded in the USA. As an epidemiology postdoc she was especially interested in how others would approach DataViz for Epidemiology PhDs. There are additional fields within this dataset, so take a crack at whatever looks interesting!

Happy Saturday #datavizbook #epibookclub! For this week, since we've learned a bit about #Rstats, I thought we'd try something new -- a #dataviz challenge!

The US gov collects data on all doctoral degree graduates every year. Let's see what we can learn! https://t.co/fjIBhq8Hkr

— Ellie Murray (@EpiEllie) February 16, 2019

The data comes from the NSF - where there are at least 72 different datasets if you wanted to approach the data from a different angle. They are primarily summary tables stored as .xlsx files. Cleaning these can be a bit awkward so if you are interested, it would be a cool project to try to do fully in R!

Alternatively - I have cleaned the data for you and saved as a .csv for you to read directly into R. There was a lot of duplication, ie totals were represented within the broader table, and there wasn't a nice separation of the fields (simply indentation in the Excel sheet).

To get at the details for broad or major fields, dplyr::summarize is your friend!

phd_field <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")

Data Dictionary

phd_by_field.csv

variable class description
broad_field character The parent field (highest delineator)
major_field character A sub-field below the broad field
field character The fine-field, most detailed
year integer Year PhD awarded
n_phds double Total number of PhDs awarded

Spoilers - Cleaning Script

# Because all the things
library(tidyverse)
library(tidyverse)

# Grab just the sub-major (major field) titles for separating the columns
sub_major_fields <- readxl::read_excel("sed17-sr-tab012.xlsx", skip = 3) %>%
  rename(field = `Field of study`) %>%
  filter(!is.na(field)) %>%
  pull(field)

# Manually grabbed the broad fields (based off indentation)
major_fields <- c(
  "Life sciences",
  "Physical sciences and earth sciences",
  "Mathematics and computer sciences",
  "Psychology and social sciences",
  "Engineering",
  "Education",
  "Humanities and arts",
  "Other"
)

# read in fiend field dataset
# create new columns based off the matching of additional major or broad fields

df <- readxl::read_excel("sed17-sr-tab013.xlsx", skip = 3) %>%
  rename(field = `Fine field of study`) %>%
  mutate(
    field = case_when(
      field == "Othero" ~ "Other",
      TRUE ~ field
    ),
    sub_major_field = case_when(
      field %in% sub_major_fields ~ field,
      TRUE ~ NA_character_
    ),
    major_field = case_when(
      field %in% major_fields ~ field,
      TRUE ~ NA_character_
    )
  )


# Use tidyr::fill() to fill in the repeats of each major/broad field
df_field <- df %>%
  fill(major_field, .direction = "down") %>%
  fill(sub_major_field, .direction = "down") %>%
  filter(!field %in% major_fields) %>%
  filter(!field %in% sub_major_fields)

# gather the years, remove the commas, and rename to appropriate columns
df_clean <- df_field %>%
  gather(year, n_phds, `2008.0`:`2017.0`) %>%
  mutate(
    year = factor(as.integer(year)),
    n_phds = parse_number(n_phds)
  ) %>%
  rename(field = field) %>%
  select(broad_field = major_field, major_field = sub_major_field, field, year, n_phds)

df_clean %>% View()

# Check to confirm numbers match (they do!)
df_clean %>% 
  group_by(major_field, year) %>% 
  summarize(sum(n_phds, na.rm = TRUE))
# Write to .csv for posting
df_clean  %>% 
  write_csv("phd_by_field.csv")