Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoParquet reader revamp #660

Merged
merged 24 commits into from
Aug 11, 2024
Merged

GeoParquet reader revamp #660

merged 24 commits into from
Aug 11, 2024

Conversation

kylebarron
Copy link
Member

The difficulty here is that you need to be able to know the output schema at the start of the iterator, before accessing any data

  • if the geometry type is known in the parquet metadata, parse to that type
  • if the geometry type is not known, parse to a MixedArray

Also, we should look more closely into the parquet ArrowReaderBuilder. That has a lot of functionality to cover both sync and async readers. Can we put all the geospatial functionality in a GeoParquetReader<ArrowReaderBuilder<T>>, and then do the same batch transforms for each geoparquet batch for both async and sync readers?

@kylebarron kylebarron changed the title Parquet record batch reader GeoParquet reader revamp Jul 4, 2024
@H-Plus-Time
Copy link
Contributor

H-Plus-Time commented Jul 27, 2024

One thing before I dive too deeply into tweaking the request flow - this doesn't happen to cover HEAD request elimination or the metadata size guess stuff, right?

(the latter point, every file in the overture maps dataset undershoots by about 320kB - well, it's either that or there's something immediately preceding the FileMetaData region that's always read)

@kylebarron
Copy link
Member Author

this doesn't happen to cover HEAD request elimination or the metadata size guess stuff, right?

No it doesn't

@kylebarron
Copy link
Member Author

something immediately preceding the FileMetaData region that's always read

That might be the PageIndex

@kylebarron kylebarron mentioned this pull request Aug 10, 2024
@kylebarron kylebarron enabled auto-merge (squash) August 11, 2024 16:35
@kylebarron kylebarron merged commit 2a7f150 into main Aug 11, 2024
22 checks passed
@kylebarron kylebarron deleted the kyle/parquet-record-batch-reader branch August 11, 2024 16:43
kylebarron pushed a commit that referenced this pull request Aug 23, 2024
If it's ok, I'd be stoked to get a v0.3.0 release of **geoarrow** — some
of my **stac-geoparquet** is getting close to being release-able.

Here's a checklist of things that have been mentioned as part of a v0.3
(including
#628 (comment)
and https://github.com/geoarrow/geoarrow-rs/milestone/3):

- #660 is done ✅ 
- Some (but not all) of the doc updates are done in
#696, and I've got a
tracking issue for the rest in
#689
- "Broader support for 3d geometries" isn't done as far as I know, but I
haven't really been touching that at all yet
- #539 is a Python thing,
not a Rust crate thing

As a part of this release PR I've updated our deps when possible (`sqlx`
will require code change to support an update so I haven't done that
one).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants