Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of Reporting Source of Truth™ #496

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

wrridgeway
Copy link
Member

@wrridgeway wrridgeway commented Jun 6, 2024

Notes

  • Ignoring av_quintile as a grouping for now since it doesn't apply to some tables and interacts oddly with class groupings.
  • PySpark is brutal when it comes to column types. Some of these columns are probably not the best types (primarily doubles that should be ints), but it was getting pretty awful rebuilding everything because pysark refused to let a np.int64 get cast as a bigint (this is at least partly due to issues with NA/nan/None). Nullable booleans are also currently an issue with reassessment_year.
  • Not sure what the best way to do delta columns for tables with stages is - right now we compare BOR 2020 to BOR 2019, etc.

Sales

  • Prior AVs seems a little weird since this isn't sale-level?
  • I'm not sure how to classify sales as "Valid" or "Invalid" based on our current "Outlier" schema in vw_pin_sale.

Ratios

  • Need to hammer out exactly what our SOPs should be and what min samples should be without them.

Priorities moving forward

  1. Sort out runner memory issues (or, figure out a way around them, possibly by looping through data by year, though this will take a long time)
  2. Improve column types
  3. Code cleanup. The code isn't awful, but there are some specific portions that could absolutely be consolidated into loops or other more efficient methods. I gave up trying to implement these improvements for the sake of delivering an MVP, but it's pretty low hanging fruit.
  4. Performance improvements

@wrridgeway wrridgeway linked an issue Jun 6, 2024 that may be closed by this pull request
@wrridgeway wrridgeway marked this pull request as ready for review July 8, 2024 13:15
@wrridgeway wrridgeway requested a review from a team as a code owner July 8, 2024 13:15
@wrridgeway wrridgeway changed the title First draft of sales script First draft of source of truth™ Jul 8, 2024
@wrridgeway wrridgeway changed the title First draft of source of truth™ First draft of Source of Truth™ Jul 8, 2024
@wrridgeway wrridgeway self-assigned this Jul 8, 2024
@wrridgeway wrridgeway changed the title First draft of Source of Truth™ First draft of Reporting Source of Truth™ Jul 8, 2024
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway This is an awesome first pass at this! All the pieces are here, just needs a bit more abstraction to speed/DRY things up:

  • I'd recommend trying out the Spark pandas API on the tables that don't use assesspy. I think you can get it going pretty easily just using the drop-in Spark version of the pandas API.
  • We need to figure out a way to abstract out some of this shared code. The most straightforward option is probably a shared, CCAO-specific Python package.

@jeancochrane I think you should also take a quick look at this to see if there are any obvious design improvements we could make.

Comment on lines +36 to +39
Table to feed the Python dbt job that creates the
`reporting.sot_assessment_roll` table. Feeds public reporting assets.

**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): This is the same as the description for sot_assesssment_roll_input. Let's change the actual sot_ non-input table descriptions to include the table's purpose, structure, and a short description of how to use the table.

**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id`
{% enddocs %}

# sot_ratio_stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): No other dbt models have plural names, let's stick to that convention. So:

  • sot_ratio_stats --> sot_ratio_stat
  • sot_sales --> sot_sale

etc.

print(geography_type, group_type)

group = [geography_type, group_type, "year", "stage_name"]
summary = data.groupby(group).agg(stats).round(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: It's sort of bad practice to use something from the global environment inside a function like this. Just pass the stats dictionary in the function arguments.

Comment on lines +11 to +48
# Define aggregation functions. These are just wrappers for basic python
# functions that make using them easier to use with pandas.agg().
def q10(x):
return x.quantile(0.1)


def q25(x):
return x.quantile(0.25)


def q75(x):
return x.quantile(0.75)


def q90(x):
return x.quantile(0.9)


def first(x):
if len(x) >= 1:
output = x.iloc[0]
else:
output = None

return output


more_stats = [
"min",
q10,
q25,
"median",
q75,
q90,
"max",
"mean",
"sum",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: It would be nice to abstract out some of the stuff like this into a shared script. Maybe we can move it into a dedicated ccao Python package. @jeancochrane What do you think?

output["av_tot_count"] / output["av_tot_size"]
)

output = output.sort_values("year")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is it enough to sort by year when doing the lag below? Shouldn't you first sort by whatever grouping is used in the lag?

Comment on lines +94 to +96
# Remove groups that only have one sale since we can't calculate stats
data = data.dropna(subset=["sale_price"])
data = data[data["sale_n_tot"] >= 20]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: There's a disconnect between the comment and code here.

Comment on lines +94 to +96
# Remove groups that only have one sale since we can't calculate stats
data = data.dropna(subset=["sale_price"])
data = data[data["sale_n_tot"] >= 20]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We also drop the distribution tails in reporting.ratio_stats. We should discuss whether or not that's something we should continue in this view.

Comment on lines +54 to +55
AS
census_congressional_district,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Try to condense cases like this to one line:

Suggested change
AS
census_congressional_district,
AS census_congressional_district,

Comment on lines +31 to +32
CAST(sf.char_bldg_sf AS INT) AS sale_char_bldg_sf,
CAST(sf.char_land_sf AS INT) AS sale_char_land_sf,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Need to clarify somehow (column name?) that this uses the total square footage of all cards.

Comment on lines +99 to +102
AND NOT sales.is_multisale
AND NOT sales.sale_filter_deed_type
AND NOT sales.sale_filter_less_than_10k
AND NOT sales.sale_filter_same_sale_within_365
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We should use the new sales val filters here, BUT we should also try to get a count of outliers in the sot_sales model. That means we would include outliers here, but exclude them later in the Python model (after they're counted).

@dfsnow dfsnow requested a review from jeancochrane August 3, 2024 04:23
Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far! I didn't give the logic a super thorough review, I was mostly looking at infra. Agreed with Dan that we should factor out shared logic and publish it in a Python ccao package so that we can share it between models; otherwise I think this approach looks right to me. Let me know if you want any help with the factoring and packaging component.


df = assemble(input, geos=geos, groups=groups)

schema = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought, non-blocking] Alternatively, if it turns out we do have to manage this schema directly, I think it would be more readable/maintainable if we defined it as a dictionary and then serialized it as a string for Spark, e.g.:

schema = {
    "geography_type": "string",
    "geography_id": "string",
    "geography_data_year": "string",
    # etc.
    ...
}

spark_df = spark_session.createDataFrame(
    df,
    schema=", ".join(f"{key}: {val}" for key, val in schema.items())
)

Comment on lines +110 to +113
AND NOT sales.is_multisale
AND NOT sales.sale_filter_deed_type
AND NOT sales.sale_filter_less_than_10k
AND NOT sales.sale_filter_same_sale_within_365
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway you mentioned being confused about how to interpret the outlier status in vw_pin_sale, I think we just want to use sale_filter_is_outlier since it also handles pre-2014 sales:

COALESCE(sales_val.sv_is_outlier, FALSE) AS sale_filter_is_outlier,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reporting SoT
3 participants