Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Spatial Partitioning From RDD to Dataframe #1268

Open
jwass opened this issue Mar 5, 2024 · 2 comments
Open

Preserve Spatial Partitioning From RDD to Dataframe #1268

jwass opened this issue Mar 5, 2024 · 2 comments

Comments

@jwass
Copy link

jwass commented Mar 5, 2024

Is there a way to spatially partition a dataframe and write it out using that partitioning scheme (presumably by converting to/from a spatial rdd)? This is my guess as to how to accomplish this but I'm not sure if I'm misunderstanding things... I'm also relatively new to working with Spark and Sedona.

Expected behavior

Loading a dataframe, converting to rdd, spatially partition it, convert back to dataframe, and save the result - I'd expect the final dataframe partitioning to be preserved from the rdd.

Actual behavior

Adapter.toDf() does not preserve partitioning - or I'm doing something else wrong.

Steps to reproduce the problem

df =  sedona.read.format("geoparquet").load(path)
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)

df2 = Adapter.toDf(rdd, spark)
df2.write.format("geoparquet").save(output_path)

But it looked like that doesn't work - number of partitions written in df2 was far greater than 6.

Settings

Sedona version = 1.5.1

Apache Spark version = ?

API type = Python

Python version = ?

Environment = Databricks

@jiayuasu
Copy link
Member

jiayuasu commented Mar 6, 2024

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

@jwass
Copy link
Author

jwass commented Mar 6, 2024

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

@jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use df.repartition with the spatial rdd's partitioner to make it all work. But let me know if this is not the right use for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants