-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First draft of Reporting Source of Truth™ #496
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wrridgeway This is an awesome first pass at this! All the pieces are here, just needs a bit more abstraction to speed/DRY things up:
- I'd recommend trying out the Spark pandas API on the tables that don't use assesspy. I think you can get it going pretty easily just using the drop-in Spark version of the pandas API.
- We need to figure out a way to abstract out some of this shared code. The most straightforward option is probably a shared, CCAO-specific Python package.
@jeancochrane I think you should also take a quick look at this to see if there are any obvious design improvements we could make.
Table to feed the Python dbt job that creates the | ||
`reporting.sot_assessment_roll` table. Feeds public reporting assets. | ||
|
||
**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): This is the same as the description for sot_assesssment_roll_input
. Let's change the actual sot_
non-input table descriptions to include the table's purpose, structure, and a short description of how to use the table.
**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id` | ||
{% enddocs %} | ||
|
||
# sot_ratio_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): No other dbt models have plural names, let's stick to that convention. So:
sot_ratio_stats
-->sot_ratio_stat
sot_sales
-->sot_sale
etc.
print(geography_type, group_type) | ||
|
||
group = [geography_type, group_type, "year", "stage_name"] | ||
summary = data.groupby(group).agg(stats).round(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: It's sort of bad practice to use something from the global environment inside a function like this. Just pass the stats
dictionary in the function arguments.
# Define aggregation functions. These are just wrappers for basic python | ||
# functions that make using them easier to use with pandas.agg(). | ||
def q10(x): | ||
return x.quantile(0.1) | ||
|
||
|
||
def q25(x): | ||
return x.quantile(0.25) | ||
|
||
|
||
def q75(x): | ||
return x.quantile(0.75) | ||
|
||
|
||
def q90(x): | ||
return x.quantile(0.9) | ||
|
||
|
||
def first(x): | ||
if len(x) >= 1: | ||
output = x.iloc[0] | ||
else: | ||
output = None | ||
|
||
return output | ||
|
||
|
||
more_stats = [ | ||
"min", | ||
q10, | ||
q25, | ||
"median", | ||
q75, | ||
q90, | ||
"max", | ||
"mean", | ||
"sum", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: It would be nice to abstract out some of the stuff like this into a shared script. Maybe we can move it into a dedicated ccao
Python package. @jeancochrane What do you think?
output["av_tot_count"] / output["av_tot_size"] | ||
) | ||
|
||
output = output.sort_values("year") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Is it enough to sort by year when doing the lag below? Shouldn't you first sort by whatever grouping is used in the lag?
# Remove groups that only have one sale since we can't calculate stats | ||
data = data.dropna(subset=["sale_price"]) | ||
data = data[data["sale_n_tot"] >= 20] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: There's a disconnect between the comment and code here.
# Remove groups that only have one sale since we can't calculate stats | ||
data = data.dropna(subset=["sale_price"]) | ||
data = data[data["sale_n_tot"] >= 20] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: We also drop the distribution tails in reporting.ratio_stats
. We should discuss whether or not that's something we should continue in this view.
AS | ||
census_congressional_district, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Try to condense cases like this to one line:
AS | |
census_congressional_district, | |
AS census_congressional_district, |
CAST(sf.char_bldg_sf AS INT) AS sale_char_bldg_sf, | ||
CAST(sf.char_land_sf AS INT) AS sale_char_land_sf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Need to clarify somehow (column name?) that this uses the total square footage of all cards.
AND NOT sales.is_multisale | ||
AND NOT sales.sale_filter_deed_type | ||
AND NOT sales.sale_filter_less_than_10k | ||
AND NOT sales.sale_filter_same_sale_within_365 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: We should use the new sales val filters here, BUT we should also try to get a count of outliers in the sot_sales
model. That means we would include outliers here, but exclude them later in the Python model (after they're counted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far! I didn't give the logic a super thorough review, I was mostly looking at infra. Agreed with Dan that we should factor out shared logic and publish it in a Python ccao
package so that we can share it between models; otherwise I think this approach looks right to me. Let me know if you want any help with the factoring and packaging component.
|
||
df = assemble(input, geos=geos, groups=groups) | ||
|
||
schema = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Thought, non-blocking] Alternatively, if it turns out we do have to manage this schema directly, I think it would be more readable/maintainable if we defined it as a dictionary and then serialized it as a string for Spark, e.g.:
schema = {
"geography_type": "string",
"geography_id": "string",
"geography_data_year": "string",
# etc.
...
}
spark_df = spark_session.createDataFrame(
df,
schema=", ".join(f"{key}: {val}" for key, val in schema.items())
)
AND NOT sales.is_multisale | ||
AND NOT sales.sale_filter_deed_type | ||
AND NOT sales.sale_filter_less_than_10k | ||
AND NOT sales.sale_filter_same_sale_within_365 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wrridgeway you mentioned being confused about how to interpret the outlier status in vw_pin_sale
, I think we just want to use sale_filter_is_outlier
since it also handles pre-2014 sales:
COALESCE(sales_val.sv_is_outlier, FALSE) AS sale_filter_is_outlier, |
Notes
av_quintile
as a grouping for now since it doesn't apply to some tables and interacts oddly with class groupings.np.int64
get cast as abigint
(this is at least partly due to issues withNA
/nan
/None
). Nullable booleans are also currently an issue withreassessment_year
.Sales
Ratios
Priorities moving forward