-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of raster to grid #401
Comments
@JimShady Thanks for formally logging it. It is good to have a formal track of it. It will also flag the PR once it is opened and merged. FYI the work going on in this PR #393 will really unlock these type of capabilities due to the ability to more efficiently retile rasters on the fly. It is still in draft since we are activelly working on it. The re-tiling is important for the blob storage and unlocking GDAL outside of dbfs. I will drop in updates here as we progress. |
Hello, we have a similar issue. We are trying to convert tif files (stored on an Azure storage account) using the raster-to-grid function and save the results as delta tables in a Databricks notebook. The files sizes are less the 1GB. We have used these options: Sometimes we get a GC error: |
We have discussed this already Milos, but I thought it would be good to log an official ticket - I believe that the performance of the raster to grid reader could/should be improved please.
https://databrickslabs.github.io/mosaic/api/raster-format-readers.html#mos-read-format-raster-to-grid
For files of 3-4GB it takes hours and hours. Too long that we are even able to make use of it. I believe you explained that this is because it is not reading the file using the spark workers, and is only using one processer on the worker node?
My feeling is that "window" reads should be used by each processor/spark worker, and that way the file will be read much faster.
Thanks.
The text was updated successfully, but these errors were encountered: