-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add yeo-johnson step #179
Conversation
864474b
to
c7575a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it partway. It's looking good so far!
..., | ||
role = NA, | ||
trained = FALSE, | ||
lambdas = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NULL
is the only possible value here, then it should be removed as an arg to this function. (You can set it to NULL
in the output and keep it in the constructor function below.)
limits = c(-5, 5), | ||
num_unique = 5, | ||
na_rm = TRUE, | ||
epi_keys_checked = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these used for? Probably similar to lambdas
that if the user can't set them, they shouldn't be exposed.
Is this intended to allow for other epikeys beside geo_value
? Grouping across them? If so, the name and default should match other step_epi_*()
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're exposed in the recipes step this is based on. Not sure why that exposes λ if you're not supposed to set it https://recipes.tidymodels.org/reference/step_YeoJohnson.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think David's comment was meant for the lambdas = NULL comment above?
But yea,
lambdas = NULL,
limits = [c](https://rdrr.io/r/base/c.html)(-5, 5),
num_unique = 5,
na_rm = TRUE,
are copied from the original step. I don't seriously expect users to want to overwrite with their own lambdas though, so happy to remove that, if we don't care to keep the interface the same.)
epi_keys_checked
... I just lazily copied from step_adjust_ahead for an easy way to get access to recipes$template in prep and bake. I don't really anticipate grouping by anything other than geo_value + other_keys, so I'll remove it and find the epikeys another way.
test-yeo-johnson.Rmd
Outdated
filtered_data %>% | ||
mutate(cases = log(cases)) %>% | ||
ggplot(aes(time_value, cases)) + | ||
geom_line(color = "blue") + | ||
geom_line(data = out1 %>% mutate(cases = log(cases)), | ||
aes(time_value, cases), color = "green") + | ||
geom_line(data = out2 %>% mutate(cases = log(cases)), | ||
aes(time_value, cases), color = "red") + | ||
facet_wrap(~geo_value, scales = "free_y") + | ||
theme_minimal() + | ||
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtered_data %>% | |
mutate(cases = log(cases)) %>% | |
ggplot(aes(time_value, cases)) + | |
geom_line(color = "blue") + | |
geom_line(data = out1 %>% mutate(cases = log(cases)), | |
aes(time_value, cases), color = "green") + | |
geom_line(data = out2 %>% mutate(cases = log(cases)), | |
aes(time_value, cases), color = "red") + | |
facet_wrap(~geo_value, scales = "free_y") + | |
theme_minimal() + | |
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") | |
all_together <- rbind( | |
filtered_data %>% | |
mutate(name = "raw"), | |
out1 %>% mutate(name = "yeo-johnson"), | |
out2 %>% mutate(name = "quarter-root") | |
) | |
all_together %>% | |
ggplot(aes(time_value, cases, color = name)) + | |
geom_line() + | |
facet_grid(~geo_value, scales = "free_y") + | |
theme_minimal() + | |
labs(title = "Yeo-Johnson transformation", x = "Time", y = "Log Cases") + | |
scale_y_log10() |
This will generate an actual legend and makes toggling log-scale easier.
As for the result of the transform, the difference in scale between NY and CA is a bit confusing tbh. Not sure why NY is scaled so much more aggressively; is it because of the literal actual zero? For practical purposes we'd probably want to smooth this dataset anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the blip where it goes literally negative in CA in early July is a bit concerning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looked into it, it's an actual negative value in the raw signal
filtered_data %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 -3940
> out1 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 -11320.
> out2 %>% filter(geo_value == "ca", time_value == "2021-06-29")
An `epi_df` object, 1 x 3 with metadata:
* geo_type = state
* time_type = day
* as_of = 2024-03-20
# A tibble: 1 × 3
geo_value time_value cases
<chr> <date> <dbl>
1 ca 2021-06-29 NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's a huge data anomaly. I wonder if we should fix this test dataset. On the one hand, it's educational of data reality, on the other hand, what the hell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I take as a demo that it will do ~ the right thing in the presence of both positive and negative values, just wanted to check it was legit and not an artifact of implementation problems.
The difference in scaling for NY & CA is a bit confusing. I forget exactly what the optimization routine is minimizing in its choice of lambda; iirc it's literally the variance? Probably worth noting in the description of the step
* step and layer work with a single outcome and layer_yj(.pred) * need to work on multiple outcomes case
Co-authored-by: Daniel McDonald <[email protected]>
Co-authored-by: Daniel McDonald <[email protected]>
Co-authored-by: Daniel McDonald <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Replaced with cmu-delphi/epipredict#451 |
close cmu-delphi/epipredict#457
TODO
terms
interface in a layer)