PostGIS to partitioned GeoParquet #955
Replies: 1 comment
-
Hello 👋 I've been exploring spatially partitioned GeoParquet for a while. And it's something I'm motivated to get integrated here, but in the short term the answer to your question is no, this isn't fully ready yet via this library. I'm getting ready for a new Rust and Python release in January, and then focusing more on integrating spatial support for DataFusion, an extensible query engine built on Rust, Arrow, and Parquet. I'm hoping that this integration will enable spatial partitioning of large datasets easy to do out of the box. I'd also point you to https://github.com/kylebarron/geo-index, which is a very memory efficient way to create a KD tree and which has Python bindings. So it may be possible to create the KDTree in memory, even if you can't put the full dataset in memory. In the near term I'd like to add APIs to that which let you extract the spatial partitioning information from the generated kdtree. (Also, I'm not sure why but in that linked discussion, they seemed to say that generating the KDTree partitioning information took a few minutes with the recursive SQL approach, but with geo-index it only took 13 seconds from Python). I'd like to have better spatial partitioning functionality in place here (and ideally easy to use) in Q1-Q2 2025. Contributions are also welcome! If your data is already in PostGIS, I'd start by trying to construct the output partition information using PostGIS and then grouping by each of those output partitions and writing those to disk directly from PostGIS. |
Beta Was this translation helpful? Give feedback.
-
I've been following the discussion on GeoParquet partitioning strategies and came across @kylebarron's comment about this library which might be of use.
We're looking to convert versions of somewhat large datasets to spatially partitioned GeoParquets. Each vary in size, for example the point dataset is ~150GB in postgres, whereas the polygons are ~1-3TB.
The datasets are available in annual snapshots going back ~20 years. The way people query it is usually temporally then spatially (i.e. Find me data in 2018 for this area).
We've been looking at various approaches such as KD-Trees with DuckDB mentioned in the above discussion.
I'm wondering if there's a way to construct the partitions straight from PostGIS using the PostGIS and GeoParquet functions?
Any thoughts or pointers would be amazing, in the meantime I'll look to construct a reproducible example if anyone is interested.
Beta Was this translation helpful? Give feedback.
All reactions