Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor geospatial dedupe #190

Merged
merged 11 commits into from
Sep 11, 2023
Merged

Refactor geospatial dedupe #190

merged 11 commits into from
Sep 11, 2023

Conversation

ian-r-rose
Copy link
Member

This refactors the geospatial deduplication logic from #175 and #178 into a reusable macro. It also consolidates the "blocks" and "places" model into a single model, which uses the macro twice.

Marking as a draft for now because I'll want to update a couple of things once #182 is in, but this should be ready for some review.

Comment on lines -42 to -44
tests:
- dbt_utils.equal_rowcount:
compare_model: source('building_footprints', 'california_building_footprints')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shame to lose this test, but with the inner join, the number of footprints is no longer conserved! One approach to restore it would be to do the "in-California" filtering in an intermediate model, then compare the row-count with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make that a to do!

Comment on lines 7 to 10
'"COUNTYFP20"': '"county_fips"',
'"TRACTCE20"': '"tract"',
'"BLOCKCE20"': '"block"',
'"GEOID20"': '"block_geoid"',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AeriShan-ODI I'm curious if you have patterns for this that you like: I didn't like duplicating the list of columns in several places as it's annoying to keep them in sync if you move or rename things. But maybe this is too-much-clever-Jjinja?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's making you feel like it may be too much too clever? and what's the alternative (if not going back to doing the join and dedupe in the marts models themselves)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative would be to just list out the column names every time I need them. It would be more repetition, but perhaps a bit easier to read. This is somewhat independent of whether the dedupe macro is a good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: I don't know or really understand geospatial data.

TBH, I do feel like it's "too-much-clever-Jjinja". I mean, it certainly is clever... but in these situations I always try to curb my own desire to be clever with how that gets looked at and maintained down the road. For example, it wasn't immediately obvious what you meant by deduplication - deduping records or columns (seems like both). The sort of abstraction that happened in the macro makes it hard to understand the operation and purpose.
So with the caveat that I think you should keep this for the cleverness-as-interesting-and-educational factor, but with perhaps more descriptive documentation, my own approach in this would likely be to simply spell out the column names and use SELECT... EXCLUDE... or use a macro to only handle column name collisions.

All this said, I'm a bit on the fence - we have the opportunity to push boundaries and be clever so why not? I'm so used to situations where cleverness ends up costing money

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: I don't know or really understand geospatial data.

I'm a proponent of "spatial isn't special", i.e, we should use the same set of tools for working with spatial data, it's just another data type with some functions that know how to operate on it. All of which is to say, I think any code style and performance notes you have are fair game!

Here I'm writing this jinja-inflected SQL as if it were Python, and it probably doesn't help the legibility for people who are expecting SQL. I'll change it to just list the column names, there is probably enough jinja-weirdness going around that this is a bridge too far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this discussion and the conclusion reached!

@@ -38,7 +38,7 @@ def write_building_footprints(conn):

gdf = gdf[gdf.geometry.geom_type != "GeometryCollection"]

file_prefix = f"footprints_with_blocks_for_county_fips_{county}"
file_prefix = f"footprints_with_tiger_for_county_fips_{county}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this name, since the files now include county, tract, block, and place data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me as a name change!

@ian-r-rose ian-r-rose marked this pull request as ready for review September 1, 2023 20:07
@ian-r-rose ian-r-rose marked this pull request as draft September 1, 2023 21:02
@britt-allen
Copy link
Contributor

Is this still draft now that #182 is in?

@ian-r-rose
Copy link
Member Author

Is this still draft now that #182 is in?

Yeah, I'm still validating one or two things. Probably ready on Tuesday!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressive!

Copy link
Contributor

@britt-allen britt-allen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all such complex work. Thank you for breaking it down and explaining it step by step with helpful comments and descriptions.

@britt-allen
Copy link
Contributor

What's the latest on this PR? @ian-r-rose

@ian-r-rose
Copy link
Member Author

ian-r-rose commented Sep 7, 2023

What's the latest on this PR? @ian-r-rose

Some counts weren't quite what I expected, so I kept it as a draft so I could investigate further. I should be able to finish it this afternoon!

Edit: or tomorrow afternoon :/

@ian-r-rose ian-r-rose marked this pull request as ready for review September 8, 2023 23:28
@ian-r-rose
Copy link
Member Author

ian-r-rose commented Sep 8, 2023

Okay, this is finally ready! fca5d58 was a tricky bug that took me too long to track down, and was resulting in some unexpected nulls at the ~10% level.

…all of our columns in the deduplication step.
@britt-allen britt-allen merged commit 1fa2cd7 into main Sep 11, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants