-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a data dictionary #315
base: 2025-assessment-year
Are you sure you want to change the base?
Changes from 8 commits
8328e9c
65cd2eb
46c163b
7994e4f
3f33931
5fb7b56
69d176e
ffbc73d
2e300ec
95db04c
ad157a1
87ba713
7a88504
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ cache/ | |
*.rds | ||
*.zip | ||
*.csv | ||
!docs/data-dict.csv | ||
*.xlsx | ||
*.xlsm | ||
*.html | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
#!/usr/bin/env Rscript | ||
# Script to check that the data dictionary file is up to date with the | ||
# latest feature set | ||
library(yaml) | ||
|
||
params_filename <- "params.yaml" | ||
data_dict_filename <- "docs/data-dict.csv" | ||
|
||
params <- read_yaml(params_filename) | ||
data_dict <- read.csv(data_dict_filename) | ||
|
||
symmetric_diff <- c( | ||
setdiff(data_dict$variable_name, params$model$predictor$all), | ||
setdiff(params$model$predictor$all, data_dict$variable_name) | ||
) | ||
symmetric_diff_len <- length(symmetric_diff) | ||
|
||
if (symmetric_diff_len > 0) { | ||
err_msg_prefix <- ifelse(symmetric_diff_len == 1, "Param is", "Params are") | ||
err_msg <- paste0( | ||
err_msg_prefix, | ||
" not present in both ", | ||
params_filename, | ||
" and ", | ||
data_dict_filename, | ||
": ", | ||
paste(symmetric_diff, collapse = ", ") | ||
) | ||
stop(err_msg) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -231,10 +231,13 @@ Model accuracy for each parameter combination is measured on a validation set us | |
|
||
### Features Used | ||
|
||
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`. | ||
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only change to this paragraph is removing this line:
The first half of this line seems inaccurate to me (all of these features are in use in the model) and not particularly helpful (this table represents the most recent version of the parameters, so the date is not useful). Happy to keep one or both of these pieces of info if you think there's a good reason for them, though. |
||
|
||
For a machine-readable version of this data dictionary, see [`docs/data-dict.csv`](./docs/data-dict.csv). | ||
|
||
```{r feature_guide, message=FALSE, results='asis', echo=FALSE} | ||
library(dplyr) | ||
library(readr) | ||
library(tidyr) | ||
library(yaml) | ||
library(jsonlite) | ||
|
@@ -316,35 +319,50 @@ param_notes <- param_tbl$value %>% | |
)) %>% | ||
unlist() | ||
|
||
ccao::vars_dict %>% | ||
inner_join( | ||
param_tbl %>% mutate(description = param_notes), | ||
by = c("var_name_model" = "value") | ||
param_tbl_fmt <- param_tbl %>% | ||
mutate(description = param_notes) %>% | ||
left_join( | ||
ccao::vars_dict, | ||
by = c("value" = "var_name_model") | ||
Comment on lines
+322
to
+326
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I switched this up so that the params are the left side of the join here, which feels more intuitive to me than |
||
) %>% | ||
group_by(var_name_pretty) %>% | ||
mutate(row = paste0("X", row_number())) %>% | ||
distinct( | ||
`Feature Name` = var_name_pretty, | ||
Category = var_type, | ||
Type = var_data_type, | ||
Notes = description, | ||
feature_name = var_name_pretty, | ||
variable_name = value, | ||
description, | ||
category = var_type, | ||
type = var_data_type, | ||
Comment on lines
+331
to
+335
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's clearer for the CSV to use lowercase and underscored column names, so we start with those and then reformat them when rendering the table to the README. |
||
var_value, row | ||
) %>% | ||
mutate(Category = recode( | ||
Category, | ||
mutate(category = recode( | ||
category, | ||
char = "Characteristic", acs5 = "ACS5", loc = "Location", | ||
prox = "Proximity", ind = "Indicator", time = "Time", | ||
meta = "Meta", other = "Other", ccao = "Other" | ||
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape" | ||
)) %>% | ||
pivot_wider( | ||
id_cols = `Feature Name`:`Notes`, | ||
id_cols = `feature_name`:`category`, | ||
names_from = row, | ||
values_from = var_value | ||
) %>% | ||
unite("Possible Values", starts_with("X"), sep = ", ", na.rm = TRUE) %>% | ||
mutate(Notes = replace_na(Notes, "")) %>% | ||
arrange(Category) %>% | ||
relocate(Notes, .after = everything()) %>% | ||
unite("possible_values", starts_with("X"), sep = ", ", na.rm = TRUE) %>% | ||
mutate(description = replace_na(description, "")) %>% | ||
arrange(category) | ||
|
||
# Write machine-readable version of the table to file | ||
param_tbl_fmt %>% | ||
write_csv("docs/data-dict.csv") | ||
Comment on lines
+353
to
+355
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seemed to me like the simplest way to keep the data dict up to date: Any time we render the README, we'll save the data dict to the file. If model parameters haven't changed, the data dict file won't change, and there won't be a diff; otherwise, there will be a diff and the code author will be prompted to commit it. Not the most airtight system, but I figure it's probably a good enough starting place. Let me know if you have other ideas! Perhaps out of scope for now, but we could also consider adding a pre-commit check similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pre-commit hook was pretty straightforward so I went ahead and implemented it in 46c163b. |
||
|
||
# Render human-readable version of the table to the doc | ||
param_tbl_fmt %>% | ||
rename( | ||
"Feature Name" = "feature_name", | ||
"Variable Name" = "variable_name", | ||
"Description" = "description", | ||
"Category" = "category", | ||
"Possible Values" = "possible_values" | ||
) %>% | ||
knitr::kable(format = "markdown") | ||
``` | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also considered adding a hook to make sure that
ccao::vars_dict
contains every feature in the param file, and that the dbt DAG has a description for every feature. This would help guard against a situation where we neglect to update the table property, but it has the downside of enforcing a commit-level check that we could only resolve by updating external dependencies, which I expect would lower our velocity during modeling. We could instead think about making this a CI check that only runs on tags, but I've skipped that here since it feels out of scope.