Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add yeo-johnson step #179

Closed
wants to merge 20 commits into from
Closed

feat: add yeo-johnson step #179

wants to merge 20 commits into from

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Mar 11, 2025

close cmu-delphi/epipredict#457

  • the lambda parameter fitting is done via 0th order optimization i ripped from recipes::yeo_johnson

TODO

  • layer skeleton in place
  • write inverse yj transform
  • tests
  • slather.layer_epi_YeoJohnson edge cases (see the TODO comment in the function... I'm not clear on what to expect from the terms interface in a layer)
  • get feedback
  • PR into epipredict feat: add step_/layer_ epi_YeoJohnson epipredict#451

@dshemetov dshemetov marked this pull request as ready for review March 14, 2025 20:58
@dshemetov dshemetov requested a review from dsweber2 March 15, 2025 00:56
@dshemetov dshemetov force-pushed the step branch 2 times, most recently from 864474b to c7575a7 Compare March 17, 2025 22:44
@dshemetov dshemetov changed the title wip: add yeo-johnson feat: add yeo-johnson Mar 17, 2025
@dshemetov dshemetov changed the title feat: add yeo-johnson feat: add yeo-johnson step Mar 17, 2025
@dshemetov dshemetov requested a review from dajmcdon March 18, 2025 00:24
Copy link

@dajmcdon dajmcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it partway. It's looking good so far!

...,
role = NA,
trained = FALSE,
lambdas = NULL,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NULL is the only possible value here, then it should be removed as an arg to this function. (You can set it to NULL in the output and keep it in the constructor function below.)

limits = c(-5, 5),
num_unique = 5,
na_rm = TRUE,
epi_keys_checked = NULL,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these used for? Probably similar to lambdas that if the user can't set them, they shouldn't be exposed.

Is this intended to allow for other epikeys beside geo_value? Grouping across them? If so, the name and default should match other step_epi_*() functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're exposed in the recipes step this is based on. Not sure why that exposes λ if you're not supposed to set it https://recipes.tidymodels.org/reference/step_YeoJohnson.html

Copy link
Contributor Author

@dshemetov dshemetov Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think David's comment was meant for the lambdas = NULL comment above?

But yea,

lambdas = NULL,
  limits = [c](https://rdrr.io/r/base/c.html)(-5, 5),
  num_unique = 5,
  na_rm = TRUE,

are copied from the original step. I don't seriously expect users to want to overwrite with their own lambdas though, so happy to remove that, if we don't care to keep the interface the same.)

epi_keys_checked ... I just lazily copied from step_adjust_ahead for an easy way to get access to recipes$template in prep and bake. I don't really anticipate grouping by anything other than geo_value + other_keys, so I'll remove it and find the epikeys another way.

Comment on lines 74 to 84
filtered_data %>%
mutate(cases = log(cases)) %>%
ggplot(aes(time_value, cases)) +
geom_line(color = "blue") +
geom_line(data = out1 %>% mutate(cases = log(cases)),
aes(time_value, cases), color = "green") +
geom_line(data = out2 %>% mutate(cases = log(cases)),
aes(time_value, cases), color = "red") +
facet_wrap(~geo_value, scales = "free_y") +
theme_minimal() +
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filtered_data %>%
mutate(cases = log(cases)) %>%
ggplot(aes(time_value, cases)) +
geom_line(color = "blue") +
geom_line(data = out1 %>% mutate(cases = log(cases)),
aes(time_value, cases), color = "green") +
geom_line(data = out2 %>% mutate(cases = log(cases)),
aes(time_value, cases), color = "red") +
facet_wrap(~geo_value, scales = "free_y") +
theme_minimal() +
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases")
all_together <- rbind(
filtered_data %>%
mutate(name = "raw"),
out1 %>% mutate(name = "yeo-johnson"),
out2 %>% mutate(name = "quarter-root")
)
all_together %>%
ggplot(aes(time_value, cases, color = name)) +
geom_line() +
facet_grid(~geo_value, scales = "free_y") +
theme_minimal() +
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") +
scale_y_log10()

This will generate an actual legend and makes toggling log-scale easier.

As for the result of the transform, the difference in scale between NY and CA is a bit confusing tbh. Not sure why NY is scaled so much more aggressively; is it because of the literal actual zero? For practical purposes we'd probably want to smooth this dataset anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the blip where it goes literally negative in CA in early July is a bit concerning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked into it, it's an actual negative value in the raw signal

filtered_data %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2024-03-20

# A tibble: 1 × 3
  geo_value time_value cases
  <chr>     <date>     <dbl>
1 ca        2021-06-29 -3940
> out1 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2024-03-20

# A tibble: 1 × 3
  geo_value time_value   cases
  <chr>     <date>       <dbl>
1 ca        2021-06-29 -11320.
> out2 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2024-03-20

# A tibble: 1 × 3
  geo_value time_value cases
  <chr>     <date>     <dbl>
1 ca        2021-06-29   NaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's a huge data anomaly. I wonder if we should fix this test dataset. On the one hand, it's educational of data reality, on the other hand, what the hell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I take as a demo that it will do ~ the right thing in the presence of both positive and negative values, just wanted to check it was legit and not an artifact of implementation problems.

The difference in scaling for NY & CA is a bit confusing. I forget exactly what the optimization routine is minimizing in its choice of lambda; iirc it's literally the variance? Probably worth noting in the description of the step

@dshemetov dshemetov requested a review from dajmcdon March 24, 2025 20:52
@dshemetov
Copy link
Contributor Author

Replaced with cmu-delphi/epipredict#451

@dshemetov dshemetov closed this Mar 31, 2025
@dshemetov dshemetov deleted the step branch March 31, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Yeo-Johnson
3 participants