Performance of raster to grid #401

JimShady · 2023-06-09T21:00:28Z

We have discussed this already Milos, but I thought it would be good to log an official ticket - I believe that the performance of the raster to grid reader could/should be improved please.

https://databrickslabs.github.io/mosaic/api/raster-format-readers.html#mos-read-format-raster-to-grid

For files of 3-4GB it takes hours and hours. Too long that we are even able to make use of it. I believe you explained that this is because it is not reading the file using the spark workers, and is only using one processer on the worker node?

My feeling is that "window" reads should be used by each processor/spark worker, and that way the file will be read much faster.

Thanks.

milos-colic · 2023-06-10T13:55:02Z

@JimShady Thanks for formally logging it. It is good to have a formal track of it. It will also flag the PR once it is opened and merged.

FYI the work going on in this PR #393 will really unlock these type of capabilities due to the ability to more efficiently retile rasters on the fly. It is still in draft since we are activelly working on it.

The re-tiling is important for the blob storage and unlocking GDAL outside of dbfs.
Also 3-4 GB files are too big for single thread and re-tiling is the proper way forward also for other operations on rasters. Smaller tiles will work much better on distributed systems.

I will drop in updates here as we progress.

MajaHildebrandt · 2023-11-14T06:53:02Z

Hello, we have a similar issue. We are trying to convert tif files (stored on an Azure storage account) using the raster-to-grid function and save the results as delta tables in a Databricks notebook. The files sizes are less the 1GB.

We have used these options:
df = (mos.read().format("raster_to_grid")
.option("fileExtension", "*.tif")
.option("resolution", "6")
.option("retile", "true")
.option("tileSize", "1000")
.load(f"{landing_path}/{file_name}"))
We have tried several types of cluster, but the task never ends.

Sometimes we get a GC error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:257)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at com.databricks.labs.mosaic.expressions.raster.base.RasterToGridExpression.$anonfun$serialize$1(RasterToGridExpression.scala:107)
at com.databricks.labs.mosaic.expressions.raster.base.RasterToGridExpression$$Lambda$2391/269566009.apply(Unknown Source)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.TraversableLike$$Lambda$87/966567431.apply(Unknown Source)
at scala.collection.Iterator.foreach(Iterator.scala:943)

JimShady mentioned this issue Oct 6, 2023

Turning a raster (geotiff) into a H3 table? apache/sedona#1047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of raster to grid #401

Performance of raster to grid #401

JimShady commented Jun 9, 2023

milos-colic commented Jun 10, 2023

MajaHildebrandt commented Nov 14, 2023

Performance of raster to grid #401

Performance of raster to grid #401

Comments

JimShady commented Jun 9, 2023

milos-colic commented Jun 10, 2023

MajaHildebrandt commented Nov 14, 2023