Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-630] Improve ST_Union_Aggr performance #1526

Merged
merged 5 commits into from
Jul 22, 2024

Conversation

zhangfengcdt
Copy link
Contributor

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

Switch to JTS OverlayNGRobust.union function to perform geometry union and add geometry cache capability.
https://locationtech.github.io/jts/javadoc/org/locationtech/jts/operation/overlayng/OverlayNGRobust.html

How was this patch tested?

All existing unit tests should pass.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

Switch to JTS `OverlayNGRobust.union` function to perform geometry union and add
geometry cache capability.
@zhangfengcdt zhangfengcdt marked this pull request as ready for review July 18, 2024 20:56
@zhangfengcdt
Copy link
Contributor Author

@jiayuasu I noticed that after switching from geo.buffer to OverlayNGRobust.union, the complex geometry representation returned from ST_Union_Aggr might change due to the reordering of polygon/polyline vertex. For example:

New: POLYGON ((1 0, 0 0, 0 1, 1 1, 2 1, 2 0, 1 0))

Old: POLYGON ((0 0, 0 1, 1 1, 2 1, 2 0, 1 0, 0 0))

They represent the same polygon, but the vertex order has changed.

@jiayuasu
Copy link
Member

@jiayuasu I noticed that after switching from geo.buffer to OverlayNGRobust.union, the complex geometry representation returned from ST_Union_Aggr might change due to the reordering of polygon/polyline vertex. For example:

New: POLYGON ((1 0, 0 0, 0 1, 1 1, 2 1, 2 0, 1 0))

Old: POLYGON ((0 0, 0 1, 1 1, 2 1, 2 0, 1 0, 0 0))

They represent the same polygon, but the vertex order has changed.

I think this is fine.

In addition, we want to make sure the behavior of ST_Union_Aggr is similar to PostGIS ST_Union (array variant): https://postgis.net/docs/ST_Union.html

Copy link
Member

@jiayuasu jiayuasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

@zhangfengcdt
Copy link
Contributor Author

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

Yeah, I am adding some tests to report the performance measure and we can see the improvements for different cases there.

@jiayuasu jiayuasu added this to the sedona-1.6.1 milestone Jul 20, 2024
@zhangfengcdt
Copy link
Contributor Author

zhangfengcdt commented Jul 22, 2024

@zhangfengcdt Did you see performance improvement using this implementation, compared to the previous one?

@jiayuasu I have used the newly added test to measure both old and new runtime for different number of geometries. Here are the results:

 Number of Polygons   |  OLD ST_Union_Aggr (in ms) |  NEW ST_Union_Aggr (in ms)
 -------------------------------------------------------------------------------
 100                  |             297            |              354     
 500                  |             750            |              386
 1,000                |           2,231            |              430
 5,000                |          53,870            |            1,400
 10,000               |         243,465            |            3,474

I think it shows clearly the new method is much efficient and scalable.

|SELECT explode(array($polygonArrayStr)) AS geom
""".stripMargin

sparkSession.sql(sqlQuery).createOrReplaceTempView("geometry_table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you return a reference of the DF as the return value of the function, instead of creating a new temp view? Otherwise this might pollute the global namespace and lead to bugs that are hard to find.

createPolygonDataFrame(numPolygons)

// cache the table to eliminate the time of table scan
sparkSession.sql("cache table geometry_table")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also unpersist this table at the end of the test case? Otherwise this will lead to memory leak.

@jiayuasu jiayuasu merged commit bab1f77 into apache:master Jul 22, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants