AWS 2-5: Ingest Global Basins #3647

rajadain · 2024-09-04T17:37:07Z

Ingest the TDX Hydro Global Basins dataset into Postgres. This should match the additional fields added to the streams in #3646.

The source data will be made available by @ptomasula at s3://data.mmw.azavea.com/. Once it is:

Ingest it locally into Postgres using ogr2ogr
Export it to SQL using pg_dump
Import it on staging

The text was updated successfully, but these errors were encountered:

rajadain · 2024-11-12T02:02:58Z

Should use the following script to ingest them:

ogr2ogr -progress -if "Parquet" \
          -f "PostgreSQL" PG:"host=localhost user=mmw dbname=mmw password=XXX" \
          -nln tdxbasins \
          -lco GEOMETRY_NAME=geom \
          -lco FID=LINKNO \
          -append -skipfailures \
          TDX_streamreach_basins_mnsi_\*.parquet

Once imported, we should create this index to help with delineation:

CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time);

rajadain · 2024-11-12T02:05:40Z

Once that is in place, we can use the following query to delineate a watershed, given a basin clicked by the user:

WITH target AS (SELECT *
FROM tdxbasins
WHERE linkno = ?)

SELECT ST_Union(geom)
FROM tdxbasins
WHERE root_id = (SELECT root_id FROM target)
  AND discover_time <= (SELECT discover_time FROM target)
  AND finish_time >= (SELECT finish_time FROM target);

rajadain · 2024-11-13T17:07:50Z

Data has been ingested on Staging successfully. Now exporting for reuse using:

env PGPASSWORD=xxx pg_dump --clean --if-exists --no-owner --dbname=modelmywatershed --username=modelmywatershed --host=database.service.mmw.internal --table=tdxbasins --compress=9 --verbose > tdxbasins.sql.gz

rajadain · 2024-11-15T15:09:13Z

Data exported to s3://data.mmw.azavea.com/tdxbasins.sql.gz

rajadain · 2024-12-03T22:58:30Z

This data was not imported correctly. Particularly, by specifying -lco FID=LINKNO, we created a new monotonic ID field called LINKNO and lost the original values.

Instead, the data must be reimported as follows:

Delete the existing tdxbasins table
Run the ogr2ogr command again, without -lco FID=LINKNO

Do the following to set the LINKNO to be the primary key:

-- Step 1: Drop the existing primary key constraint
ALTER TABLE tdxbasins DROP CONSTRAINT tdxbasins_pkey;

-- Step 2: Drop the `ogc_fid` column
ALTER TABLE tdxbasins DROP COLUMN ogc_fid;

-- Step 3: Add a new primary key constraint on `linkno`
ALTER TABLE tdxbasins ADD CONSTRAINT tdxbasins_pkey PRIMARY KEY (linkno);

Create the index again:

CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time);

Re-export the data and upload to S3, overwriting the old export
Re-create a subset for development

rajadain · 2024-12-10T15:02:32Z

After deleting the ogc_fid, when I tried to make linkno the primary key I discovered that the column is not unique. Currently doing a full table scan (very slow) to discover the duplicates.

rajadain · 2024-12-10T15:40:22Z

The total number of duplicate linknos is 334,017.

Some of these have very high numbers of duplicates:

SELECT linkno, count
FROM "20241210_tdxbasins_duplicate_linkno.csv"
ORDER BY count DESC
LIMIT 10;

linkno	count
130565963	2105
130356958	1923
130690836	1774
130642990	1772
130565970	1767
130533444	1763
130612035	1646
130560773	1614
130343248	1569
130555645	1564

Looking at the full rows for these, they have identical values for every other column except for geom, which is a slightly different square for each case:

SELECT *
FROM tdxbasins
WHERE linkno IN (
    130565963,
    130356958,
    130690836,
    130642990,
    130565970,
    130533444,
    130612035,
    130560773,
    130343248,
    130555645)
ORDER BY linkno;

root_id	finish_time	discover_time	element_count	dissolve_root_id	linkno	geom
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.57016666666716 -4.330388888889166,29.57016666666716 -4.330500000000278,29.57027777777827 -4.330500000000278,29.57027777777827 -4.330388888889166,29.57016666666716 -4.330388888889166)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.57005555555605 -4.330500000000278,29.57005555555605 -4.33061111111139,29.57016666666716 -4.33061111111139,29.57016666666716 -4.330500000000278,29.57005555555605 -4.330500000000278)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56994444444494 -4.33061111111139,29.56994444444494 -4.3307222222225,29.57005555555605 -4.3307222222225,29.57005555555605 -4.33061111111139,29.56994444444494 -4.33061111111139)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56983333333383 -4.3307222222225,29.56983333333383 -4.330833333333612,29.56994444444494 -4.330833333333612,29.56994444444494 -4.3307222222225,29.56983333333383 -4.3307222222225)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56972222222272 -4.330833333333612,29.56972222222272 -4.330944444444722,29.56983333333383 -4.330944444444722,29.56983333333383 -4.330833333333612,29.56972222222272 -4.330833333333612)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56961111111161 -4.330944444444722,29.56961111111161 -4.331055555555833,29.56972222222272 -4.331055555555833,29.56972222222272 -4.330944444444722,29.56961111111161 -4.330944444444722)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.5695000000005 -4.331055555555833,29.5695000000005 -4.331166666666945,29.56961111111161 -4.331166666666945,29.56961111111161 -4.331055555555833,29.5695000000005 -4.331055555555833)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56938888888938 -4.331166666666945,29.56938888888938 -4.331277777778055,29.5695000000005 -4.331277777778055,29.5695000000005 -4.331166666666945,29.56938888888938 -4.331166666666945)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56927777777827 -4.331277777778055,29.56927777777827 -4.331388888889167,29.56938888888938 -4.331388888889167,29.56938888888938 -4.331277777778055,29.56927777777827 -4.331277777778055)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56916666666716 -4.331388888889167,29.56916666666716 -4.331500000000277,29.56927777777827 -4.331500000000277,29.56927777777827 -4.331388888889167,29.56916666666716 -4.331388888889167)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56905555555605 -4.331500000000277,29.56905555555605 -4.331611111111389,29.56916666666716 -4.331611111111389,29.56916666666716 -4.331500000000277,29.56905555555605 -4.331500000000277)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56894444444494 -4.331611111111389,29.56894444444494 -4.331722222222501,29.56905555555605 -4.331722222222501,29.56905555555605 -4.331611111111389,29.56894444444494 -4.331611111111389)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56883333333383 -4.331722222222501,29.56883333333383 -4.331833333333611,29.56894444444494 -4.331833333333611,29.56894444444494 -4.331722222222501,29.56883333333383 -4.331722222222501)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56872222222271 -4.331833333333611,29.56872222222271 -4.331944444444723,29.56883333333383 -4.331944444444723,29.56883333333383 -4.331833333333611,29.56872222222271 -4.331833333333611)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56527777777827 -4.335277777778057,29.56527777777827 -4.335388888889167,29.56538888888938 -4.335388888889167,29.56538888888938 -4.335277777778057,29.56527777777827 -4.335277777778057)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56516666666716 -4.335388888889167,29.56516666666716 -4.335500000000279,29.56527777777827 -4.335500000000279,29.56527777777827 -4.335388888889167,29.56516666666716 -4.335388888889167)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56505555555605 -4.335500000000279,29.56505555555605 -4.335611111111389,29.56516666666716 -4.335611111111389,29.56516666666716 -4.335500000000279,29.56505555555605 -4.335500000000279)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56494444444494 -4.335611111111389,29.56494444444494 -4.3357222222225,29.56505555555605 -4.3357222222225,29.56505555555605 -4.335611111111389,29.56494444444494 -4.335611111111389)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56483333333383 -4.3357222222225,29.56483333333383 -4.335833333333611,29.56494444444494 -4.335833333333611,29.56494444444494 -4.3357222222225,29.56483333333383 -4.3357222222225)))
130747391	326295	326294	1	130707940	130343248	SRID=4326;MULTIPOLYGON(((29.56472222222272 -4.335833333333611,29.56472222222272 -4.335944444444722,29.56483333333383 -4.335944444444722,29.56483333333383 -4.335833333333611,29.56472222222272 -4.335833333333611)))

Currently investigating if this is an ingest or source data issue.

rajadain · 2024-12-20T13:28:28Z

The recommendation from @aufdenkampe was to keep only one of the duplicate rows for each linkno. Since the rows are true duplicates with only minor differences in GEOM, we could keep any of the duplicates. I decided to keep the first.

First I made a companion table nondup which had the non-duplicated ogc_fid for every unique linkno:

CREATE TABLE nondup AS
    SELECT MIN(ogc_fid), linkno
    FROM tdxbasins
    GROUP BY linkno;

Then I joined the tdxbasins table with that and deleted all non-matching rows:

DELETE FROM tdxbasins
WHERE ogc_fid IN (
    SELECT ogc_fid
    FROM tdxbasins a LEFT JOIN nondup b ON a.ogc_fid = b.min
    WHERE b.min IS NULL
);

This left 15936428 rows in tdxbasins. After that, I was able to delete the ogc_fid column and make linkno the primary key:

-- Step 1: Drop the existing primary key constraint
ALTER TABLE tdxbasins DROP CONSTRAINT tdxbasins_pkey;

-- Step 2: Add a new primary key constraint on `linkno`
ALTER TABLE tdxbasins ADD CONSTRAINT tdxbasins_pkey PRIMARY KEY (linkno);

-- Step 3: Drop the `ogc_fid` column
ALTER TABLE tdxbasins DROP COLUMN ogc_fid;

-- Step 4: Add index for root id, discover time, finish time for fast lookup
CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time);

This was all done locally on my developer machine because our staging database ran out of CPU credits. Now I'm going to dump this, upload to S3, and re-import to staging.

rajadain · 2024-12-26T21:01:56Z

This is now complete and deployed to production.

rajadain added blocked AWS Funding Source: AWS labels Sep 4, 2024

rajadain removed the blocked label Oct 14, 2024

rajadain self-assigned this Oct 14, 2024

rajadain mentioned this issue Oct 30, 2024

AWS 2024 Contract #3611

Open

50 tasks

rajadain closed this as completed Nov 15, 2024

rajadain reopened this Dec 3, 2024

rajadain closed this as completed Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS 2-5: Ingest Global Basins #3647

AWS 2-5: Ingest Global Basins #3647

rajadain commented Sep 4, 2024 •

edited

Loading

rajadain commented Nov 12, 2024 •

edited

Loading

rajadain commented Nov 12, 2024

rajadain commented Nov 13, 2024

rajadain commented Nov 15, 2024

rajadain commented Dec 3, 2024

rajadain commented Dec 10, 2024

rajadain commented Dec 10, 2024

rajadain commented Dec 20, 2024

rajadain commented Dec 26, 2024

AWS 2-5: Ingest Global Basins #3647

AWS 2-5: Ingest Global Basins #3647

Comments

rajadain commented Sep 4, 2024 • edited Loading

rajadain commented Nov 12, 2024 • edited Loading

rajadain commented Nov 12, 2024

rajadain commented Nov 13, 2024

rajadain commented Nov 15, 2024

rajadain commented Dec 3, 2024

rajadain commented Dec 10, 2024

rajadain commented Dec 10, 2024

rajadain commented Dec 20, 2024

rajadain commented Dec 26, 2024

rajadain commented Sep 4, 2024 •

edited

Loading

rajadain commented Nov 12, 2024 •

edited

Loading