-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor geospatial dedupe #190
Conversation
transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql
Show resolved
Hide resolved
tests: | ||
- dbt_utils.equal_rowcount: | ||
compare_model: source('building_footprints', 'california_building_footprints') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shame to lose this test, but with the inner join, the number of footprints is no longer conserved! One approach to restore it would be to do the "in-California" filtering in an intermediate model, then compare the row-count with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make that a to do!
'"COUNTYFP20"': '"county_fips"', | ||
'"TRACTCE20"': '"tract"', | ||
'"BLOCKCE20"': '"block"', | ||
'"GEOID20"': '"block_geoid"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AeriShan-ODI I'm curious if you have patterns for this that you like: I didn't like duplicating the list of columns in several places as it's annoying to keep them in sync if you move or rename things. But maybe this is too-much-clever-Jjinja?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's making you feel like it may be too much too clever? and what's the alternative (if not going back to doing the join and dedupe in the marts models themselves)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to just list out the column names every time I need them. It would be more repetition, but perhaps a bit easier to read. This is somewhat independent of whether the dedupe macro is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat: I don't know or really understand geospatial data.
TBH, I do feel like it's "too-much-clever-Jjinja". I mean, it certainly is clever... but in these situations I always try to curb my own desire to be clever with how that gets looked at and maintained down the road. For example, it wasn't immediately obvious what you meant by deduplication - deduping records or columns (seems like both). The sort of abstraction that happened in the macro makes it hard to understand the operation and purpose.
So with the caveat that I think you should keep this for the cleverness-as-interesting-and-educational factor, but with perhaps more descriptive documentation, my own approach in this would likely be to simply spell out the column names and use SELECT... EXCLUDE... or use a macro to only handle column name collisions.
All this said, I'm a bit on the fence - we have the opportunity to push boundaries and be clever so why not? I'm so used to situations where cleverness ends up costing money
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat: I don't know or really understand geospatial data.
I'm a proponent of "spatial isn't special", i.e, we should use the same set of tools for working with spatial data, it's just another data type with some functions that know how to operate on it. All of which is to say, I think any code style and performance notes you have are fair game!
Here I'm writing this jinja-inflected SQL as if it were Python, and it probably doesn't help the legibility for people who are expecting SQL. I'll change it to just list the column names, there is probably enough jinja-weirdness going around that this is a bridge too far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this discussion and the conclusion reached!
California (there are ~20k that are in OR, NV, AZ, or Mexico)
5941912
to
609d710
Compare
@@ -38,7 +38,7 @@ def write_building_footprints(conn): | |||
|
|||
gdf = gdf[gdf.geometry.geom_type != "GeometryCollection"] | |||
|
|||
file_prefix = f"footprints_with_blocks_for_county_fips_{county}" | |||
file_prefix = f"footprints_with_tiger_for_county_fips_{county}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this name, since the files now include county, tract, block, and place data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me as a name change!
Is this still draft now that #182 is in? |
Yeah, I'm still validating one or two things. Probably ready on Tuesday! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impressive!
transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all such complex work. Thank you for breaking it down and explaining it step by step with helpful comments and descriptions.
What's the latest on this PR? @ian-r-rose |
Some counts weren't quite what I expected, so I kept it as a draft so I could investigate further. I should be able to finish it this afternoon! Edit: or tomorrow afternoon :/ |
Okay, this is finally ready! fca5d58 was a tricky bug that took me too long to track down, and was resulting in some unexpected nulls at the ~10% level. |
…all of our columns in the deduplication step.
0873113
to
fca5d58
Compare
This refactors the geospatial deduplication logic from #175 and #178 into a reusable macro. It also consolidates the "blocks" and "places" model into a single model, which uses the macro twice.
Marking as a draft for now because I'll want to update a couple of things once #182 is in, but this should be ready for some review.