-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS 2-5: Ingest Global Basins #3647
Comments
Should use the following script to ingest them: ogr2ogr -progress -if "Parquet" \
-f "PostgreSQL" PG:"host=localhost user=mmw dbname=mmw password=XXX" \
-nln tdxbasins \
-lco GEOMETRY_NAME=geom \
-lco FID=LINKNO \
-append -skipfailures \
TDX_streamreach_basins_mnsi_\*.parquet Once imported, we should create this index to help with delineation: CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time); |
Once that is in place, we can use the following query to delineate a watershed, given a basin clicked by the user: WITH target AS (SELECT *
FROM tdxbasins
WHERE linkno = ?)
SELECT ST_Union(geom)
FROM tdxbasins
WHERE root_id = (SELECT root_id FROM target)
AND discover_time <= (SELECT discover_time FROM target)
AND finish_time >= (SELECT finish_time FROM target); |
Data has been ingested on Staging successfully. Now exporting for reuse using: env PGPASSWORD=xxx pg_dump --clean --if-exists --no-owner --dbname=modelmywatershed --username=modelmywatershed --host=database.service.mmw.internal --table=tdxbasins --compress=9 --verbose > tdxbasins.sql.gz |
Data exported to s3://data.mmw.azavea.com/tdxbasins.sql.gz |
This data was not imported correctly. Particularly, by specifying Instead, the data must be reimported as follows:
|
After deleting the |
The total number of duplicate Some of these have very high numbers of duplicates: SELECT linkno, count
FROM "20241210_tdxbasins_duplicate_linkno.csv"
ORDER BY count DESC
LIMIT 10;
Looking at the full rows for these, they have identical values for every other column except for geom, which is a slightly different square for each case: SELECT *
FROM tdxbasins
WHERE linkno IN (
130565963,
130356958,
130690836,
130642990,
130565970,
130533444,
130612035,
130560773,
130343248,
130555645)
ORDER BY linkno;
Currently investigating if this is an ingest or source data issue. |
The recommendation from @aufdenkampe was to keep only one of the duplicate rows for each linkno. Since the rows are true duplicates with only minor differences in GEOM, we could keep any of the duplicates. I decided to keep the first. First I made a companion table CREATE TABLE nondup AS
SELECT MIN(ogc_fid), linkno
FROM tdxbasins
GROUP BY linkno; Then I joined the DELETE FROM tdxbasins
WHERE ogc_fid IN (
SELECT ogc_fid
FROM tdxbasins a LEFT JOIN nondup b ON a.ogc_fid = b.min
WHERE b.min IS NULL
); This left 15936428 rows in -- Step 1: Drop the existing primary key constraint
ALTER TABLE tdxbasins DROP CONSTRAINT tdxbasins_pkey;
-- Step 2: Add a new primary key constraint on `linkno`
ALTER TABLE tdxbasins ADD CONSTRAINT tdxbasins_pkey PRIMARY KEY (linkno);
-- Step 3: Drop the `ogc_fid` column
ALTER TABLE tdxbasins DROP COLUMN ogc_fid;
-- Step 4: Add index for root id, discover time, finish time for fast lookup
CREATE INDEX idx_tdxbasins_root_discover_finish
ON tdxbasins (root_id, discover_time, finish_time); This was all done locally on my developer machine because our staging database ran out of CPU credits. Now I'm going to dump this, upload to S3, and re-import to staging. |
This is now complete and deployed to production. |
Ingest the TDX Hydro Global Basins dataset into Postgres. This should match the additional fields added to the streams in #3646.
The source data will be made available by @ptomasula at
s3://data.mmw.azavea.com/
. Once it is:ogr2ogr
pg_dump
The text was updated successfully, but these errors were encountered: