Crowd flow confounder example in ML and sensitivity chapters #241

malcolmbarrett · 2024-06-15T16:49:46Z

Working on the sensitivity chapter has led me to a bit of a deep dive into using some variables from touringplans::parks_metadata_raw to better capture variables related to the crowd flow at Magic Kingdom. After spending some time with it, I think I'm overloading this section in the sensitivity chapter. This started as me creating two new variables for an alternative DAG (is_weekend and is_holiday) but is now getting a bit too nuanced for this section.

I think a better approach would be to present this more complex confounding structure in the ML chapter, partly to help justify using more flexible modeling approaches. Then, we can pick up that thread later without spending so much time on the idea in the sensitivity chapter.

So, for now, I'm going to stick with the two examples above and rework this in a few months. Here's some copy and code related to that. (I think this should be expanded to be even more sophisticated, e.g. happenings at other parks)

In particular, we want to capture baseline crowd flow. We'll use a few new variables to try to approximate this: the previous day's wait time at the same hour, the number of schools in session, whether it's a weekend or holiday, and if it's a holiday, how it's related to crowd size, some variables related to events around the Magic Kingdom like fireworks and parades, and a marker of the ride capacity loss to due attraction shutdowns in the park.

Consider this expanded DAG in @fig-dag-extra-days. For simplicity, we're presenting all of these confounders in a single supernode called crowd flow. We're assuming that all of them are causes of both whether there are Extra Magic Morning and wait times.

metadata <- parks_metadata_raw |> 
  filter(year == 2018) |> 
  select(
   # some of these are precision variables
    date, insession, insession_sqrt_dlr, mkevent, holiday, holidaym, mkprdday, 
    mkprddt1, mkprddt2, mkfirewk, mkfiret1, mkfiret2, 
    # maybe post outcome var techncially
    # should lag?
    capacitylost_mk
  )

seven_dwarfs_with_days <- seven_dwarfs_train_2018 |> 
  filter(wait_hour == 9) |> 
  mutate(
    is_holiday = park_date %in% holidays, 
    is_weekend = timeDate::isWeekend(park_date),
    prev_wait = lag(wait_minutes_posted_avg, order_by = park_date)
  ) |> 
  left_join(metadata, by = c("park_date" = "date")) |> 
  mutate(
    insession = parse_number(insession),
    insession_sqrt_dlr = parse_number(insession_sqrt_dlr)
  )

fit_ipw_effect(
  park_extra_magic_morning ~ park_temperature_high +
    park_close + park_ticket_season + is_weekend + insession + insession_sqrt_dlr + 
    mkevent + holiday + holidaym + mkprdday + mkfirewk + capacitylost_mk,
  .data = seven_dwarfs_with_days
)

calculate_coef2 <- function(n_days_lag) {
  distinct_emm <- seven_dwarfs_with_days |> 
    arrange(park_date) |> 
    transmute(
      park_date, 
      prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag),
      prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag),
      prev_park_close = lag(park_close, n = n_days_lag),
      prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag),
      prev_is_weekend = lag(insession, n = n_days_lag),
      prev_insession = lag(insession, n = n_days_lag),
      prev_insession_sqrt_dlr = lag(insession_sqrt_dlr, n = n_days_lag),
      prev_mkevent = lag(mkevent, n = n_days_lag),
      prev_holiday = lag(holiday, n = n_days_lag),
      prev_holidaym = lag(holidaym, n = n_days_lag),
      prev_mkprdday = lag(mkprdday, n = n_days_lag),
      prev_mkfirewk = lag(mkfirewk, n = n_days_lag),
      prev_capacitylost_mk = lag(capacitylost_mk, n = n_days_lag),
    )
  
  seven_dwarfs_with_days_lag <- seven_dwarfs_with_days |> 
    left_join(distinct_emm, by = "park_date") |> 
    filter(!is.na(prev_park_extra_magic_morning))
  
  fit_ipw_effect(
    prev_park_extra_magic_morning ~ prev_park_temperature_high + prev_park_close + prev_park_ticket_season + prev_is_weekend + prev_insession + prev_insession_sqrt_dlr + 
      prev_mkevent + prev_holiday + prev_holidaym + prev_mkprdday + prev_mkfirewk + prev_capacitylost_mk,
    .data = seven_dwarfs_with_days_lag, 
    .trt = "prev_park_extra_magic_morning",
    .outcome_fmla = wait_minutes_posted_avg ~ prev_park_extra_magic_morning + park_extra_magic_morning
  )
}

calculate_coef2(63)
coefs <- purrr::map_dbl(1:63, calculate_coef2)

ggplot(data.frame(coefs = coefs, x = 1:63), aes(x = x, y = coefs)) + 
  geom_hline(yintercept = 0) + 
  geom_point() + 
  geom_smooth() + 
  labs(y = "difference in wait times (minutes)\n on day (i) for EMM on day (i - n)", x = "day (i - n)")

The text was updated successfully, but these errors were encountered:

malcolmbarrett modified the milestones: Chapter 22: Machine learning and causal inference, Chapter 21: Sensitivity analysis Jun 15, 2024

malcolmbarrett added the ⚖️ sensitivity analysis label Jun 15, 2024

malcolmbarrett self-assigned this Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crowd flow confounder example in ML and sensitivity chapters #241

Crowd flow confounder example in ML and sensitivity chapters #241

malcolmbarrett commented Jun 15, 2024 •

edited

Loading

Crowd flow confounder example in ML and sensitivity chapters #241

Crowd flow confounder example in ML and sensitivity chapters #241

Comments

malcolmbarrett commented Jun 15, 2024 • edited Loading

malcolmbarrett commented Jun 15, 2024 •

edited

Loading