Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crowd flow confounder example in ML and sensitivity chapters #241

Open
malcolmbarrett opened this issue Jun 15, 2024 · 0 comments
Open

Crowd flow confounder example in ML and sensitivity chapters #241

malcolmbarrett opened this issue Jun 15, 2024 · 0 comments

Comments

@malcolmbarrett
Copy link
Collaborator

malcolmbarrett commented Jun 15, 2024

Working on the sensitivity chapter has led me to a bit of a deep dive into using some variables from touringplans::parks_metadata_raw to better capture variables related to the crowd flow at Magic Kingdom. After spending some time with it, I think I'm overloading this section in the sensitivity chapter. This started as me creating two new variables for an alternative DAG (is_weekend and is_holiday) but is now getting a bit too nuanced for this section.

I think a better approach would be to present this more complex confounding structure in the ML chapter, partly to help justify using more flexible modeling approaches. Then, we can pick up that thread later without spending so much time on the idea in the sensitivity chapter.

So, for now, I'm going to stick with the two examples above and rework this in a few months. Here's some copy and code related to that. (I think this should be expanded to be even more sophisticated, e.g. happenings at other parks)

In particular, we want to capture baseline crowd flow. We'll use a few new variables to try to approximate this: the previous day's wait time at the same hour, the number of schools in session, whether it's a weekend or holiday, and if it's a holiday, how it's related to crowd size, some variables related to events around the Magic Kingdom like fireworks and parades, and a marker of the ride capacity loss to due attraction shutdowns in the park.

Consider this expanded DAG in @fig-dag-extra-days. For simplicity, we're presenting all of these confounders in a single supernode called crowd flow. We're assuming that all of them are causes of both whether there are Extra Magic Morning and wait times.

metadata <- parks_metadata_raw |> 
  filter(year == 2018) |> 
  select(
   # some of these are precision variables
    date, insession, insession_sqrt_dlr, mkevent, holiday, holidaym, mkprdday, 
    mkprddt1, mkprddt2, mkfirewk, mkfiret1, mkfiret2, 
    # maybe post outcome var techncially
    # should lag?
    capacitylost_mk
  )

seven_dwarfs_with_days <- seven_dwarfs_train_2018 |> 
  filter(wait_hour == 9) |> 
  mutate(
    is_holiday = park_date %in% holidays, 
    is_weekend = timeDate::isWeekend(park_date),
    prev_wait = lag(wait_minutes_posted_avg, order_by = park_date)
  ) |> 
  left_join(metadata, by = c("park_date" = "date")) |> 
  mutate(
    insession = parse_number(insession),
    insession_sqrt_dlr = parse_number(insession_sqrt_dlr)
  )

fit_ipw_effect(
  park_extra_magic_morning ~ park_temperature_high +
    park_close + park_ticket_season + is_weekend + insession + insession_sqrt_dlr + 
    mkevent + holiday + holidaym + mkprdday + mkfirewk + capacitylost_mk,
  .data = seven_dwarfs_with_days
)

calculate_coef2 <- function(n_days_lag) {
  distinct_emm <- seven_dwarfs_with_days |> 
    arrange(park_date) |> 
    transmute(
      park_date, 
      prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag),
      prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag),
      prev_park_close = lag(park_close, n = n_days_lag),
      prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag),
      prev_is_weekend = lag(insession, n = n_days_lag),
      prev_insession = lag(insession, n = n_days_lag),
      prev_insession_sqrt_dlr = lag(insession_sqrt_dlr, n = n_days_lag),
      prev_mkevent = lag(mkevent, n = n_days_lag),
      prev_holiday = lag(holiday, n = n_days_lag),
      prev_holidaym = lag(holidaym, n = n_days_lag),
      prev_mkprdday = lag(mkprdday, n = n_days_lag),
      prev_mkfirewk = lag(mkfirewk, n = n_days_lag),
      prev_capacitylost_mk = lag(capacitylost_mk, n = n_days_lag),
    )
  
  seven_dwarfs_with_days_lag <- seven_dwarfs_with_days |> 
    left_join(distinct_emm, by = "park_date") |> 
    filter(!is.na(prev_park_extra_magic_morning))
  
  fit_ipw_effect(
    prev_park_extra_magic_morning ~ prev_park_temperature_high + prev_park_close + prev_park_ticket_season + prev_is_weekend + prev_insession + prev_insession_sqrt_dlr + 
      prev_mkevent + prev_holiday + prev_holidaym + prev_mkprdday + prev_mkfirewk + prev_capacitylost_mk,
    .data = seven_dwarfs_with_days_lag, 
    .trt = "prev_park_extra_magic_morning",
    .outcome_fmla = wait_minutes_posted_avg ~ prev_park_extra_magic_morning + park_extra_magic_morning
  )
}

calculate_coef2(63)
coefs <- purrr::map_dbl(1:63, calculate_coef2)

ggplot(data.frame(coefs = coefs, x = 1:63), aes(x = x, y = coefs)) + 
  geom_hline(yintercept = 0) + 
  geom_point() + 
  geom_smooth() + 
  labs(y = "difference in wait times (minutes)\n on day (i) for EMM on day (i - n)", x = "day (i - n)")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant