You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The recently merged #422 has a Python dbt model (ratio_stats.py) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of the ratio_stats model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.
We can also make a few other enhancements here at the same time. Namely:
Change the data types of the ratio_stats table to be slightly more sensible
Possibly factor out the ratio_stats_input table entirely
@wagnerlmichael This one is yours now. Let's use it to pilot use of Spark models within dbt, since we may want to convert sales val, source-of-truth, etc to Spark. Let's also take this opportunity to clean up the ratio_stats table a little bit (get the dtypes corrected, drop extraneous columns, etc.).
ratio_stats is used in production for our public-facing ratio study dashboards, which are published for mailed stage each reassessed township when it mails.
This is one dashboard serving all townships, with an extract of the ratio_stats table that is refreshed with each 2024 township mailing. Because of that I'd very strongly prefer to not make any changes to the production table until after we have mailed the last tri town this year.
If changes must be made now because it's blocking other work, please sequence with me on schedule so that changes aren't pushed close to a town mail date.
If it helps, the current structure of the reporting depends on no changes (data type, etc.) to the following columns in the production table:
geography_id
property_group
assessment_stage
sale_year
sale_n
detect_chasing
med_ratio, cod, prb, prd
ratio_met, cod_met, prb_met, prd_met
This table is filtered to geography_type = "Town", so if other types are added, it should be robust to those changes.
Which extraneous columns are you thinking of getting rid of?
The recently merged #422 has a Python dbt model (
ratio_stats.py
) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of theratio_stats
model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.We can also make a few other enhancements here at the same time. Namely:
ratio_stats
table to be slightly more sensibleratio_stats_input
table entirelyThese will need input from @ccao-jardine and @wrridgeway.
The text was updated successfully, but these errors were encountered: