Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of raster to grid #401

Open
JimShady opened this issue Jun 9, 2023 · 2 comments
Open

Performance of raster to grid #401

JimShady opened this issue Jun 9, 2023 · 2 comments

Comments

@JimShady
Copy link

JimShady commented Jun 9, 2023

We have discussed this already Milos, but I thought it would be good to log an official ticket - I believe that the performance of the raster to grid reader could/should be improved please.

https://databrickslabs.github.io/mosaic/api/raster-format-readers.html#mos-read-format-raster-to-grid

For files of 3-4GB it takes hours and hours. Too long that we are even able to make use of it. I believe you explained that this is because it is not reading the file using the spark workers, and is only using one processer on the worker node?

My feeling is that "window" reads should be used by each processor/spark worker, and that way the file will be read much faster.

Thanks.

@milos-colic
Copy link
Contributor

@JimShady Thanks for formally logging it. It is good to have a formal track of it. It will also flag the PR once it is opened and merged.

FYI the work going on in this PR #393 will really unlock these type of capabilities due to the ability to more efficiently retile rasters on the fly. It is still in draft since we are activelly working on it.

The re-tiling is important for the blob storage and unlocking GDAL outside of dbfs.
Also 3-4 GB files are too big for single thread and re-tiling is the proper way forward also for other operations on rasters. Smaller tiles will work much better on distributed systems.

I will drop in updates here as we progress.

@MajaHildebrandt
Copy link

Hello, we have a similar issue. We are trying to convert tif files (stored on an Azure storage account) using the raster-to-grid function and save the results as delta tables in a Databricks notebook. The files sizes are less the 1GB.

We have used these options:
df = (mos.read().format("raster_to_grid")
.option("fileExtension", "*.tif")
.option("resolution", "6")
.option("retile", "true")
.option("tileSize", "1000")
.load(f"{landing_path}/{file_name}"))
We have tried several types of cluster, but the task never ends.

Sometimes we get a GC error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
      at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
      at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:257)
      at scala.collection.TraversableLike.map(TraversableLike.scala:286)
      at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
      at scala.collection.AbstractTraversable.map(Traversable.scala:108)
      at com.databricks.labs.mosaic.expressions.raster.base.RasterToGridExpression.$anonfun$serialize$1(RasterToGridExpression.scala:107)
      at com.databricks.labs.mosaic.expressions.raster.base.RasterToGridExpression$$Lambda$2391/269566009.apply(Unknown Source)
      at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
      at scala.collection.TraversableLike$$Lambda$87/966567431.apply(Unknown Source)
      at scala.collection.Iterator.foreach(Iterator.scala:943)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants