Refactor geospatial dedupe #190

ian-r-rose · 2023-09-01T19:13:26Z

This refactors the geospatial deduplication logic from #175 and #178 into a reusable macro. It also consolidates the "blocks" and "places" model into a single model, which uses the macro twice.

Marking as a draft for now because I'll want to update a couple of things once #182 is in, but this should be ready for some review.

transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql

ian-r-rose · 2023-09-01T19:15:39Z

transform/models/marts/geo_reference/_geo_reference__models.yml

-    tests:
-      - dbt_utils.equal_rowcount:
-          compare_model: source('building_footprints', 'california_building_footprints')


A shame to lose this test, but with the inner join, the number of footprints is no longer conserved! One approach to restore it would be to do the "in-California" filtering in an intermediate model, then compare the row-count with that.

We can make that a to do!

ian-r-rose · 2023-09-01T19:17:18Z

transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql

+      '"COUNTYFP20"': '"county_fips"',
+      '"TRACTCE20"': '"tract"',
+      '"BLOCKCE20"': '"block"',
+      '"GEOID20"': '"block_geoid"',


@AeriShan-ODI I'm curious if you have patterns for this that you like: I didn't like duplicating the list of columns in several places as it's annoying to keep them in sync if you move or rename things. But maybe this is too-much-clever-Jjinja?

what's making you feel like it may be too much too clever? and what's the alternative (if not going back to doing the join and dedupe in the marts models themselves)

An alternative would be to just list out the column names every time I need them. It would be more repetition, but perhaps a bit easier to read. This is somewhat independent of whether the dedupe macro is a good idea

Caveat: I don't know or really understand geospatial data.

TBH, I do feel like it's "too-much-clever-Jjinja". I mean, it certainly is clever... but in these situations I always try to curb my own desire to be clever with how that gets looked at and maintained down the road. For example, it wasn't immediately obvious what you meant by deduplication - deduping records or columns (seems like both). The sort of abstraction that happened in the macro makes it hard to understand the operation and purpose.
So with the caveat that I think you should keep this for the cleverness-as-interesting-and-educational factor, but with perhaps more descriptive documentation, my own approach in this would likely be to simply spell out the column names and use SELECT... EXCLUDE... or use a macro to only handle column name collisions.

All this said, I'm a bit on the fence - we have the opportunity to push boundaries and be clever so why not? I'm so used to situations where cleverness ends up costing money

Caveat: I don't know or really understand geospatial data.

I'm a proponent of "spatial isn't special", i.e, we should use the same set of tools for working with spatial data, it's just another data type with some functions that know how to operate on it. All of which is to say, I think any code style and performance notes you have are fair game!

Here I'm writing this jinja-inflected SQL as if it were Python, and it probably doesn't help the legibility for people who are expecting SQL. I'll change it to just list the column names, there is probably enough jinja-weirdness going around that this is a bridge too far.

I appreciate this discussion and the conclusion reached!

once

California (there are ~20k that are in OR, NV, AZ, or Mexico)

ian-r-rose · 2023-09-01T19:41:02Z

jobs/geo/write_building_footprints.py

@@ -38,7 +38,7 @@ def write_building_footprints(conn):

        gdf = gdf[gdf.geometry.geom_type != "GeometryCollection"]

-        file_prefix = f"footprints_with_blocks_for_county_fips_{county}"
+        file_prefix = f"footprints_with_tiger_for_county_fips_{county}"


What do you think about this name, since the files now include county, tract, block, and place data?

That makes sense to me as a name change!

britt-allen · 2023-09-01T21:31:22Z

Is this still draft now that #182 is in?

transform/macros/map_class_fp.sql

transform/macros/_macros.yml

ian-r-rose · 2023-09-01T22:13:39Z

Is this still draft now that #182 is in?

Yeah, I'm still validating one or two things. Probably ready on Tuesday!

britt-allen · 2023-09-01T22:16:11Z

transform/macros/spatial_join_with_deduplication.sql

impressive!

transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql

britt-allen

This is all such complex work. Thank you for breaking it down and explaining it step by step with helpful comments and descriptions.

britt-allen · 2023-09-07T16:57:12Z

What's the latest on this PR? @ian-r-rose

ian-r-rose · 2023-09-07T18:16:31Z

What's the latest on this PR? @ian-r-rose

Some counts weren't quite what I expected, so I kept it as a draft so I could investigate further. I should be able to finish it this afternoon!

Edit: or tomorrow afternoon :/

ian-r-rose · 2023-09-08T23:30:29Z

Okay, this is finally ready! fca5d58 was a tricky bug that took me too long to track down, and was resulting in some unexpected nulls at the ~10% level.

…all of our columns in the deduplication step.

ian-r-rose requested review from britt-allen and AeriShan-ODI September 1, 2023 19:13

ian-r-rose commented Sep 1, 2023

View reviewed changes

ian-r-rose added 9 commits September 1, 2023 12:35

WIP moving join-with-deduplication into a macro.

719ae0e

Nested CTEs are apparently allowed!

8eb9c31

Use macro with places again.

1c77c29

Add customizable prefix to temporary CTEs so it can be used more than

c48cfb7

once

Refactor to include places and blocks in the same enriched model

e8e0374

Clean up formatting, documentation

378b84c

Inner join for blocks model to remove footprints that are outside of

aa2795a

California (there are ~20k that are in OR, NV, AZ, or Mexico)

Rename model to reflect that it is no longer just blocks

c084eeb

Update name to reflect that it's not just blocks

609d710

ian-r-rose force-pushed the refactor-geospatial-dedupe branch from 5941912 to 609d710 Compare September 1, 2023 19:40

ian-r-rose commented Sep 1, 2023

View reviewed changes

ian-r-rose marked this pull request as ready for review September 1, 2023 20:07

ian-r-rose marked this pull request as draft September 1, 2023 21:02

britt-allen reviewed Sep 1, 2023

View reviewed changes

transform/macros/map_class_fp.sql Show resolved Hide resolved

britt-allen reviewed Sep 1, 2023

View reviewed changes

transform/macros/_macros.yml Outdated Show resolved Hide resolved

britt-allen reviewed Sep 1, 2023

View reviewed changes

transform/macros/spatial_join_with_deduplication.sql Outdated

Copy link

Contributor

britt-allen Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressive!

britt-allen reviewed Sep 1, 2023

View reviewed changes

transform/models/marts/geo_reference/geo_reference__building_footprints_with_tiger.sql Show resolved Hide resolved

britt-allen approved these changes Sep 1, 2023

View reviewed changes

britt-allen assigned ian-r-rose Sep 5, 2023

Be less clever with jinja, explicitly list column names

d982dd2

AeriShan-ODI approved these changes Sep 6, 2023

View reviewed changes

britt-allen mentioned this pull request Sep 7, 2023

DOF - Implementation of dbt models #156

Closed

3 tasks

ian-r-rose marked this pull request as ready for review September 8, 2023 23:28

Guard against nulls in max_by so that we don't accidentally null out …

fca5d58

…all of our columns in the deduplication step.

ian-r-rose force-pushed the refactor-geospatial-dedupe branch from 0873113 to fca5d58 Compare September 8, 2023 23:32

britt-allen approved these changes Sep 11, 2023

View reviewed changes

britt-allen merged commit 1fa2cd7 into main Sep 11, 2023

This was referenced Sep 11, 2023

Correct mart table name #202

Merged

DOF – Overlapping footprints code implementaion #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor geospatial dedupe #190

Refactor geospatial dedupe #190

ian-r-rose commented Sep 1, 2023

ian-r-rose Sep 1, 2023

britt-allen Sep 1, 2023

ian-r-rose Sep 1, 2023

britt-allen Sep 1, 2023

ian-r-rose Sep 1, 2023

AeriShan-ODI Sep 5, 2023

ian-r-rose Sep 5, 2023

britt-allen Sep 5, 2023

ian-r-rose Sep 1, 2023

britt-allen Sep 1, 2023

britt-allen commented Sep 1, 2023

ian-r-rose commented Sep 1, 2023

britt-allen Sep 1, 2023

britt-allen left a comment

britt-allen commented Sep 7, 2023

ian-r-rose commented Sep 7, 2023 •

edited

Loading

ian-r-rose commented Sep 8, 2023 •

edited

Loading

Refactor geospatial dedupe #190

Refactor geospatial dedupe #190

Conversation

ian-r-rose commented Sep 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

britt-allen commented Sep 1, 2023

ian-r-rose commented Sep 1, 2023

Choose a reason for hiding this comment

britt-allen left a comment

Choose a reason for hiding this comment

britt-allen commented Sep 7, 2023

ian-r-rose commented Sep 7, 2023 • edited Loading

ian-r-rose commented Sep 8, 2023 • edited Loading

ian-r-rose commented Sep 7, 2023 •

edited

Loading

ian-r-rose commented Sep 8, 2023 •

edited

Loading