Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a data dictionary #315

Draft
wants to merge 13 commits into
base: 2025-assessment-year
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ cache/
*.rds
*.zip
*.csv
!docs/data-dict.csv
*.xlsx
*.xlsm
*.html
Expand Down
50 changes: 34 additions & 16 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -231,10 +231,13 @@ Model accuracy for each parameter combination is measured on a validation set us

### Features Used

The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`.
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change to this paragraph is removing this line:

Most of them are in use in the model as of r Sys.Date().

The first half of this line seems inaccurate to me (all of these features are in use in the model) and not particularly helpful (this table represents the most recent version of the parameters, so the date is not useful). Happy to keep one or both of these pieces of info if you think there's a good reason for them, though.


For a machine-readable version of this data dictionary, see [`docs/data-dict.csv`](./docs/data-dict.csv).

```{r feature_guide, message=FALSE, results='asis', echo=FALSE}
library(dplyr)
library(readr)
library(tidyr)
library(yaml)
library(jsonlite)
Expand Down Expand Up @@ -316,35 +319,50 @@ param_notes <- param_tbl$value %>%
)) %>%
unlist()

ccao::vars_dict %>%
inner_join(
param_tbl %>% mutate(description = param_notes),
by = c("var_name_model" = "value")
param_tbl_fmt <- param_tbl %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
Comment on lines +322 to +326
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this up so that the params are the left side of the join here, which feels more intuitive to me than vars_dict being the left side. We also use a left join so that we'll preserve all of the parameters, even if one happens to be misdocumented in vars_dict in the future.

) %>%
group_by(var_name_pretty) %>%
mutate(row = paste0("X", row_number())) %>%
distinct(
`Feature Name` = var_name_pretty,
Category = var_type,
Type = var_data_type,
Notes = description,
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type,
Comment on lines +331 to +335
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clearer for the CSV to use lowercase and underscored column names, so we start with those and then reformat them when rendering the table to the README.

var_value, row
) %>%
mutate(Category = recode(
Category,
mutate(category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other"
)) %>%
pivot_wider(
id_cols = `Feature Name`:`Notes`,
id_cols = `feature_name`:`category`,
names_from = row,
values_from = var_value
) %>%
unite("Possible Values", starts_with("X"), sep = ", ", na.rm = TRUE) %>%
mutate(Notes = replace_na(Notes, "")) %>%
arrange(Category) %>%
relocate(Notes, .after = everything()) %>%
unite("possible_values", starts_with("X"), sep = ", ", na.rm = TRUE) %>%
mutate(description = replace_na(description, "")) %>%
arrange(category)

# Write machine-readable version of the table to file
param_tbl_fmt %>%
write_csv("docs/data-dict.csv")
Comment on lines +353 to +355
Copy link
Contributor Author

@jeancochrane jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to me like the simplest way to keep the data dict up to date: Any time we render the README, we'll save the data dict to the file. If model parameters haven't changed, the data dict file won't change, and there won't be a diff; otherwise, there will be a diff and the code author will be prompted to commit it. Not the most airtight system, but I figure it's probably a good enough starting place. Let me know if you have other ideas!

Perhaps out of scope for now, but we could also consider adding a pre-commit check similar to readme-rmd-rendered that compares the params in params.yml to the params in this file to make sure they match. I'm happy to take a crack at that now if you think it's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-commit hook was pretty straightforward so I went ahead and implemented it in 46c163b.


# Render human-readable version of the table to the doc
param_tbl_fmt %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Possible Values" = "possible_values"
) %>%
knitr::kable(format = "markdown")
```

Expand Down
Loading
Loading