PhDs in the US

Finally I got myself into the #tidytuesday. This project, promoted by R for Data Science, aims to enhance the manipulation and visualisation skills among the R community by the exploratory analysis of a raw new dataset that is posted on a weekly basis. Apart from improving the #RStats skills, the idea of this project is to enable connections amongst the #Rstats community, explore other´s work and get feedback.

The data for this week consisted of sample of PhDs awarded by field in the US. The dataset is relatively small with just 5 variables that give information about the broad, major and main field, the year of the award and the number of PhDs awarded.

library(tidyverse)
library(gganimate)

grads = read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")

glimpse(grads)
## Observations: 3,370
## Variables: 5
## $ broad_field <chr> "Life sciences", "Life sciences", "Life sciences", "…
## $ major_field <chr> "Agricultural sciences and natural resources", "Agri…
## $ field       <chr> "Agricultural economics", "Agricultural and horticul…
## $ year        <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008…
## $ n_phds      <dbl> 111, 28, 3, 68, 41, 18, 77, 182, 52, 96, 41, 32, 44,…

For my representation I want to calculate the number of students for each year in each broad_field for each year. Also, I create a variable (lab_clean) with shorter names for the broad fields. It will be helpful for the representation afterwards.

check = grads %>%
  group_by(broad_field, year) %>%
  tally(n_phds) %>%
  mutate(lab_clean = case_when(broad_field == "Education" ~ "Edu.",
                               broad_field == "Humanities and arts" ~ "Hum.",
                               broad_field == "Mathematics and computer sciences" ~ "Sci.",
                               broad_field == "Engineering" ~ "Eng.",
                               broad_field == "Life sciences" ~ "Lif.",
                               broad_field == "Psychology and social sciences" ~ "Soc.", 
                               broad_field == "Other" ~ "Oth."))

head(check)
## # A tibble: 6 x 4
## # Groups:   broad_field [1]
##   broad_field  year     n lab_clean
##   <chr>       <dbl> <dbl> <chr>    
## 1 Education    2008  6561 Edu.     
## 2 Education    2009  6528 Edu.     
## 3 Education    2010  5287 Edu.     
## 4 Education    2011  4670 Edu.     
## 5 Education    2012  4803 Edu.     
## 6 Education    2013  4934 Edu.

The visualisation is just a combination of geom_line() and gganimate.

ggplot(check, aes(year, n, group = broad_field, colour = broad_field)) + 
  geom_line() + 
  geom_segment(aes(xend = 2017, yend = n), linetype = 2, colour = 'grey') + 
  geom_point(size = 2) + 
  scale_x_continuous(breaks = c(2008:2017)) +
  geom_text(aes(x = 2017.5, label = lab_clean), hjust = 0) + 
  transition_reveal(year) + 
  coord_cartesian(clip = 'off') + 
  labs(title = 'US PhDs Awarded by Board Field since 2008',
       y = 'Total number of PhDs awarded', 
       x = " ",
       caption = "@EdudinGonzalo") + 
  theme_minimal() + 
  theme(legend.position = "bottom", 
        plot.margin = margin(1.5, 1, 1, 1.5), 
        legend.title = element_blank())

It seems that the number of PhD graduates in Humanities, Education, Sciences and Engineering have remained stable over time. Social and especially life sciences, however, have increased the amount of postgraduates since 2008.

Find the code on my Github repository.


See also