First draft of Reporting Source of Truth™ #496

wrridgeway · 2024-06-06T18:01:33Z

Notes

Ignoring av_quintile as a grouping for now since it doesn't apply to some tables and interacts oddly with class groupings.
PySpark is brutal when it comes to column types. Some of these columns are probably not the best types (primarily doubles that should be ints), but it was getting pretty awful rebuilding everything because pysark refused to let a np.int64 get cast as a bigint (this is at least partly due to issues with NA/nan/None). Nullable booleans are also currently an issue with reassessment_year.
Not sure what the best way to do delta columns for tables with stages is - right now we compare BOR 2020 to BOR 2019, etc.

Sales

Prior AVs seems a little weird since this isn't sale-level?
I'm not sure how to classify sales as "Valid" or "Invalid" based on our current "Outlier" schema in vw_pin_sale.

Ratios

Need to hammer out exactly what our SOPs should be and what min samples should be without them.

Priorities moving forward

Sort out runner memory issues (or, figure out a way around them, possibly by looping through data by year, though this will take a long time)
Improve column types
Code cleanup. The code isn't awful, but there are some specific portions that could absolutely be consolidated into loops or other more efficient methods. I gave up trying to implement these improvements for the sake of delivering an MVP, but it's pretty low hanging fruit.
Performance improvements

dfsnow

@wrridgeway This is an awesome first pass at this! All the pieces are here, just needs a bit more abstraction to speed/DRY things up:

I'd recommend trying out the Spark pandas API on the tables that don't use assesspy. I think you can get it going pretty easily just using the drop-in Spark version of the pandas API.
We need to figure out a way to abstract out some of this shared code. The most straightforward option is probably a shared, CCAO-specific Python package.

@jeancochrane I think you should also take a quick look at this to see if there are any obvious design improvements we could make.

dfsnow · 2024-07-16T19:37:59Z

dbt/models/reporting/docs.md

+Table to feed the Python dbt job that creates the
+`reporting.sot_assessment_roll` table. Feeds public reporting assets.
+
+**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id`


issue (blocking): This is the same as the description for sot_assesssment_roll_input. Let's change the actual sot_ non-input table descriptions to include the table's purpose, structure, and a short description of how to use the table.

dfsnow · 2024-07-16T19:41:06Z

dbt/models/reporting/docs.md

+**Primary Key**: `year`, `stage_name`, `geography_id`, `group_id`
+{% enddocs %}
+
+# sot_ratio_stats


issue (blocking): No other dbt models have plural names, let's stick to that convention. So:

sot_ratio_stats --> sot_ratio_stat

sot_sales --> sot_sale

etc.

dfsnow · 2024-08-03T01:16:32Z

dbt/models/reporting/reporting.sot_assessment_roll.py

+    print(geography_type, group_type)
+
+    group = [geography_type, group_type, "year", "stage_name"]
+    summary = data.groupby(group).agg(stats).round(2)


nitpick: It's sort of bad practice to use something from the global environment inside a function like this. Just pass the stats dictionary in the function arguments.

dfsnow · 2024-08-03T01:27:39Z

dbt/models/reporting/reporting.sot_assessment_roll.py

+# Define aggregation functions. These are just wrappers for basic python
+# functions that make using them easier to use with pandas.agg().
+def q10(x):
+    return x.quantile(0.1)
+
+
+def q25(x):
+    return x.quantile(0.25)
+
+
+def q75(x):
+    return x.quantile(0.75)
+
+
+def q90(x):
+    return x.quantile(0.9)
+
+
+def first(x):
+    if len(x) >= 1:
+        output = x.iloc[0]
+    else:
+        output = None
+
+    return output
+
+
+more_stats = [
+    "min",
+    q10,
+    q25,
+    "median",
+    q75,
+    q90,
+    "max",
+    "mean",
+    "sum",
+]


thought: It would be nice to abstract out some of the stuff like this into a shared script. Maybe we can move it into a dedicated ccao Python package. @jeancochrane What do you think?

dfsnow · 2024-08-03T01:33:28Z

dbt/models/reporting/reporting.sot_assessment_roll.py

+        output["av_tot_count"] / output["av_tot_size"]
+    )
+
+    output = output.sort_values("year")


question: Is it enough to sort by year when doing the lag below? Shouldn't you first sort by whatever grouping is used in the lag?

dfsnow · 2024-08-03T02:16:33Z

dbt/models/reporting/reporting.sot_ratio_stats.py

+    # Remove groups that only have one sale since we can't calculate stats
+    data = data.dropna(subset=["sale_price"])
+    data = data[data["sale_n_tot"] >= 20]


issue: There's a disconnect between the comment and code here.

dfsnow · 2024-08-03T02:17:21Z

dbt/models/reporting/reporting.sot_ratio_stats.py

+    # Remove groups that only have one sale since we can't calculate stats
+    data = data.dropna(subset=["sale_price"])
+    data = data[data["sale_n_tot"] >= 20]


issue: We also drop the distribution tails in reporting.ratio_stats. We should discuss whether or not that's something we should continue in this view.

dfsnow · 2024-08-03T02:24:31Z

dbt/models/reporting/reporting.sot_ratio_stats_input.sql

+        AS
+        census_congressional_district,


nitpick: Try to condense cases like this to one line:

Suggested change

AS

census_congressional_district,

AS census_congressional_district,

dfsnow · 2024-08-03T03:58:09Z

dbt/models/reporting/reporting.sot_sales_input.sql

+    CAST(sf.char_bldg_sf AS INT) AS sale_char_bldg_sf,
+    CAST(sf.char_land_sf AS INT) AS sale_char_land_sf,


todo: Need to clarify somehow (column name?) that this uses the total square footage of all cards.

dfsnow · 2024-08-03T04:02:52Z

dbt/models/reporting/reporting.sot_sales_input.sql

+    AND NOT sales.is_multisale
+    AND NOT sales.sale_filter_deed_type
+    AND NOT sales.sale_filter_less_than_10k
+    AND NOT sales.sale_filter_same_sale_within_365


issue: We should use the new sales val filters here, BUT we should also try to get a count of outliers in the sot_sales model. That means we would include outliers here, but exclude them later in the Python model (after they're counted).

jeancochrane

Looking good so far! I didn't give the logic a super thorough review, I was mostly looking at infra. Agreed with Dan that we should factor out shared logic and publish it in a Python ccao package so that we can share it between models; otherwise I think this approach looks right to me. Let me know if you want any help with the factoring and packaging component.

jeancochrane · 2024-08-05T16:26:10Z

dbt/models/reporting/reporting.sot_assessment_roll.py

+
+    df = assemble(input, geos=geos, groups=groups)
+
+    schema = (


[Thought, non-blocking] Alternatively, if it turns out we do have to manage this schema directly, I think it would be more readable/maintainable if we defined it as a dictionary and then serialized it as a string for Spark, e.g.:

schema = { "geography_type": "string", "geography_id": "string", "geography_data_year": "string", # etc. ... } spark_df = spark_session.createDataFrame( df, schema=", ".join(f"{key}: {val}" for key, val in schema.items()) )

jeancochrane · 2024-08-05T16:54:16Z

dbt/models/reporting/reporting.sot_ratio_stats_input.sql

+    AND NOT sales.is_multisale
+    AND NOT sales.sale_filter_deed_type
+    AND NOT sales.sale_filter_less_than_10k
+    AND NOT sales.sale_filter_same_sale_within_365


@wrridgeway you mentioned being confused about how to interpret the outlier status in vw_pin_sale, I think we just want to use sale_filter_is_outlier since it also handles pre-2014 sales:

data-architecture/dbt/models/default/default.vw_pin_sale.sql

Line 304 in b48861b

COALESCE(sales_val.sv_is_outlier, FALSE) AS sale_filter_is_outlier,

First draft of sales script

d780d76

wrridgeway linked an issue Jun 6, 2024 that may be closed by this pull request

Reporting SoT #387

Open

wrridgeway added 28 commits June 6, 2024 19:28

File renaming

00909fd

Cleaner for loop

2ac5982

First draft taxes and exemptions table

2107d2a

Wrap assessment_roll

c56aaaf

Correct size, count calculations

6c81308

Wrap sales table

1bf9b9c

Correct stage grouping, counting

0a9e1f3

Fix assessment roll stage grouping

030a7c5

Clean output before writing

1c2adae

Begin dbt building

672bd1e

Merge branch 'master' into 387-reporting-sot

0c42e23

Attempt to build assessment_roll table

3f60a77

Testing build on smaller input

fdff457

Trying to build on limited sample

6abd074

Try to build sales table

fd342b6

Try to build taxes and exemptions table

cccf8e1

Try to build taxes and exemptions table

3656964

Try to build taxes table

8b0f95f

Try to build ratio stats table

9383bdc

Add assesspy to ratio_stats table

08d3bd6

ratio_stats builds in dbt, excluding assesspy funcs

d2cac22

sot_ratio_stats table building in dbt

f559753

Add res_other group

1f8ad1f

Add reassessment year indicator for assessment roll

063591c

Retry assessment_year indicator

a9ffc64

Assessment_roll should run with reassessment year indicator

62dd68e

Add schema to assessment_roll table

c185e81

Correct output from sales and taxes tables

d08bc3d

wrridgeway added 15 commits July 3, 2024 16:53

Clean taxes table columns

20c9bd6

Clean assessment_roll columns

adc16ea

Fix delta columns

f8b87ab

Clean ratio table columns

54ebab8

Attempt to fix pin_n_tot type error that doesn't trigger locally

d2dddab

Try again to fix pin_n_tot

00e790c

Change ass roll sample to be able to compare across stages

408de56

Add commenting for input tables, try to partion assessment_roll table

fd95fcb

Comment python scripts

f296292

Clean up ratio_stats script

a23ff72

Back to fixing pin_n_tot

07f6dfe

Replace nan with None

b78a072

Partition input tables by year

337954e

Fix year partitioning

1031144

Use double for nullable columns

45ea305

wrridgeway marked this pull request as ready for review July 8, 2024 13:15

wrridgeway requested a review from a team as a code owner July 8, 2024 13:15

wrridgeway changed the title ~~First draft of sales script~~ First draft of source of truth™ Jul 8, 2024

wrridgeway changed the title ~~First draft of source of truth™~~ First draft of Source of Truth™ Jul 8, 2024

wrridgeway self-assigned this Jul 8, 2024

wrridgeway changed the title ~~First draft of Source of Truth™~~ First draft of Reporting Source of Truth™ Jul 8, 2024

wrridgeway and others added 6 commits July 9, 2024 17:11

Move data year specification to dbt seed

ca139f3

Formatting

788f971

Merge branch 'master' into 387-reporting-sot

4ea6718

Improve diff and pct_change syntax

5449d8c

Simplify reassessment year syntax

c87713f

More commenting

d1079f0

dfsnow requested changes Aug 3, 2024

View reviewed changes

dfsnow requested a review from jeancochrane August 3, 2024 04:23

jeancochrane reviewed Aug 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First draft of Reporting Source of Truth™ #496

First draft of Reporting Source of Truth™ #496

wrridgeway commented Jun 6, 2024 •

edited

Loading

dfsnow left a comment

dfsnow Jul 16, 2024

dfsnow Jul 16, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

dfsnow Aug 3, 2024

jeancochrane left a comment

jeancochrane Aug 5, 2024

jeancochrane Aug 5, 2024

	AS
	census_congressional_district,
	AS census_congressional_district,

		CAST(sf.char_bldg_sf AS INT) AS sale_char_bldg_sf,
		CAST(sf.char_land_sf AS INT) AS sale_char_land_sf,

First draft of Reporting Source of Truth™ #496

Are you sure you want to change the base?

First draft of Reporting Source of Truth™ #496

Conversation

wrridgeway commented Jun 6, 2024 • edited Loading

Notes

Sales

Ratios

Priorities moving forward

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wrridgeway commented Jun 6, 2024 •

edited

Loading