Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning a raster (geotiff) into a H3 table? #1047

Closed
JimShady opened this issue Oct 4, 2023 · 17 comments · Fixed by #1140
Closed

Turning a raster (geotiff) into a H3 table? #1047

JimShady opened this issue Oct 4, 2023 · 17 comments · Fixed by #1140

Comments

@JimShady
Copy link
Contributor

JimShady commented Oct 4, 2023

Hello,

Been following development of Apache-Sedona with interest. Thanks for the development.

I was wondering if there are now sufficient tools/functions to use Apache-Sedona to convert a geotiff into a H3 spark table? How would I go about this?

Thanks.

@jiayuasu
Copy link
Member

jiayuasu commented Oct 6, 2023

@JimShady I wonder what exactly you mean for geotiff to h3?

With the new H3 functions added in Sedona, the following are now possible:

  1. GeoTiff geo bounding envelope -> H3 cells
df1 = df.selectExpr("raster", "ST_H3CellIDs(RS_Envelope(raster) as h3)")
df2 = df1.select(col("raster"), explode(col("h3")).as("h3")) 
  1. Individual pixel location + corresponding observed band value -> H3 cells + values
df1 = df.selectExpr("ST_Point(RS_RasterToWorldCoordX(raster, colX, rowY), RS_RasterToWorldCoordY(raster, raster, colX, rowY)) as pixelWorldCoord", "RS_Value(raster, pixelWorldCoord) as observed_value")
df2 = df1.selectExpr("observed_value", "ST_H3CellIDs(pixelWorldCoord as h3)")
df3 = df2.select(col("observed_value"), explode(col("h3")).as("h3")) 

@JimShady
Copy link
Contributor Author

JimShady commented Oct 6, 2023

Your second example is, I think, the type of operation I am thinking of. Obviously a raster is essentially a grid of values. So I'm thinking to use a H3 resolution that is similar to the raster resolution, and then extract the values from the raster into a H3 table.

In your example above, how is the H3 resolution chosen in example 2 please?

@JimShady
Copy link
Contributor Author

JimShady commented Oct 6, 2023

I'm wondering really if Apache-Sedona has a similar function to this:

https://databrickslabs.github.io/mosaic/api/raster-format-readers.html#mos-read-format-raster-to-grid

Because I've tried to use this mosaic function and the performance is really slow.

databrickslabs/mosaic#401

@jiayuasu
Copy link
Member

jiayuasu commented Oct 6, 2023

@JimShady

Here is the full doc of ST_H3CellIds in Sedona 1.5.0 (https://github.com/apache/sedona/blob/master/docs/api/sql/Function.md#st_h3cellids). You can choose a level (resolution) of H3. Will be released next week

Speaking of the raster_to_grid function, we don't have the same function because exploding a raster grid to individual rows will likely crash the memory of Sedona/Spark and it might NOT be the best way to explore/manipulate raster images.

But we provide the following functions to allow you to directly manipulate pixels of bands

  1. RS_Resample: https://github.com/apache/sedona/blob/master/docs/api/sql/Raster-operators.md#rs_resample
  2. RS_MapAlgebra: https://github.com/apache/sedona/blob/master/docs/api/sql/Raster-operators.md#rs_mapalgebra
  3. Of course, RS_RasterToWorldCoordX, RS_RasterToWorldCoordY, RS_Value, RS_Values

Is raster_to_grid really needed? If you are comfortable describing your use case, we can probably give you better alternatives.

@JimShady
Copy link
Contributor Author

JimShady commented Oct 6, 2023

I've a few use cases. ☺️ I'll explain them when I'm back in work Monday. Thanks again.

@JimShady
Copy link
Contributor Author

JimShady commented Oct 17, 2023

Hi @jiayuasu . Sorry for the delay in getting back to you. What I am trying to do is something like:

  1. Read in 6 different rasters (they are all the same origin, resolution and pixel type etc).
  2. Use algebra to 'combine' them on a pixel by pixel basis to a new raster. For example something like:

WHERE raster6 > 10 AND raster5 < 4 THEN (raster1 + raster2 + raster3) / raster4

Then I would write out the result to a new geotiff file.

Is that kind of thing possible? I think so but I'm struggling with the syntax. All the examples in the documentation presume that we are dealing with rasters with multiple bands. For example here. I understand it. But my rasters are separate files, not bands of the same file.

image

But my actual situation is this:

Capture

@jiayuasu
Copy link
Member

jiayuasu commented Oct 18, 2023

@JimShady This is definitely possible in Sedona 1.5.0. We need to do this in 3 steps in PySpark:

Please first use sedona.read.format("binaryFile"). load("/tmp/AU_202105*") to read these geotiffs. Then use RS_FromGeoTiff() to create the raster type column for this DataFrame. Pay attention to the 1 in ID column. You can do that by SELECT 1 as ID, path, raster FROM df

|ID|path|raster|
|1|XX_1|GridCoverageXXX|
|1|XX_2|GridCoverageXXX|
|1|XX_3|GridCoverageXXX|
|1|XX_4|GridCoverageXXX|
|1|XX_5|GridCoverageXXX|
|1|XX_6|GridCoverageXXX|

Now the exciting part comes:

  1. Transpose your DataFrame (swap row and column) using pivot function (https://sparkbyexamples.com/spark/spark-transpose-rows-to-columns-in-dataframe/amp/).
from pyspark.sql.functions import first

df.groupBy("ID").pivot("path").agg(first("raster"))

The resulting DF should be

|ID|XX_1|XX_2|XX_3|XX_4|XX_5|XX_6|
|1|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|
  1. Now add up all bands together using RS_AddBand(https://sedona.apache.org/1.5.0/api/sql/Raster-operators/#rs_addband) assuming the band order is the same as the original file names:
sedona.sql("SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(XX_1, XX_2), XX_3), XX_4), XX_5), XX_6) as raster FROM pivot_df")

The resulting df should be like this

|raster|
|GridCoverageXXX|

Note that: now this raster is a single raster with 6 bands.

  1. Now let's use RS_MapAlgebra to do the trick. Make sure you understand the logical operators in Jiffle script (https://github.com/geosolutions-it/jai-ext/wiki/Jiffle---language-summary#logical-operators). Note that MapAlgebra uses 0-indexed band ID while other functions in Sedona rasters use 1-indexed band ID.
sedona.sql("SELECT RS_MapAlgebra(raster, 'D', 'out = rast[5] > 10 && rast[4] < 4 ? (rast[0] + rast[1] + rast[2]) / rast[3] : 0')) FROM df")

We are exploring the possibility to support manipulating multiple rasters (not only bands) in the Jiffle script. If we add the support in the future, we can get rid of Step 2

@JimShady
Copy link
Contributor Author

I'm stuck a little with Step 2 I think. This is my progress:

my_files = sedona.read.format("binaryFile").load("/tmp/PE_FLRF_UD_Q*")
my_files.createOrReplaceTempView("my_files")
my_files_new = spark.sql("SELECT 1 as id, RS_FromGeoTiff(content) AS raster, modificationTime, length, path FROM my_files")

from pyspark.sql.functions import first

pivot_df = my_files_new.groupBy("id").pivot("path").agg(first("raster"))
pivot_df.createOrReplaceTempView("pivot_df")

added_df = sedona.sql('SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand("dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif", "dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q100_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q200_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q500_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q1500_RE_02.tif") as raster FROM pivot_df')

But I'm getting this error:

AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "rs_addband(dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif, dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif)" due to data type mismatch: parameter 1 requires "BINARY" type, however, "dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif" is of "STRING" type.; line 1 pos 51;

@jiayuasu
Copy link
Member

jiayuasu commented Oct 18, 2023

@JimShady The quote is wrong in the last query. Try sedona.sql("XXX"). And the column name should not be quoted. The error message simple says that Spark treats the input as string value, which is not a column name.

In addition, not sure if in Spark, value like this dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif is considered as a legal column name. If my comment cannot solve the problem, then after loading your data via sedona.read, please clean up the path values by replacing all special characters. See this answer to learn how to drop special characters in PySpark: https://stackoverflow.com/a/53149013

@JimShady
Copy link
Contributor Author

I am getting there I think. This is my complete code now.

my_files = sedona.read.format("binaryFile").load("/tmp/PE_FLRF_UD_Q*")
my_files.createOrReplaceTempView("my_files")
my_files_new = spark.sql("SELECT 1 as id, RS_FromGeoTiff(content) AS raster, modificationTime, length, path FROM my_files")

from pyspark.sql.functions import first

pivot_df = my_files_new.groupBy("id").pivot("path").agg(first("raster"))

newColumns = []
problematic_chars = './,;{}()=:'
for column in pivot_df.columns:
    column = column.lower()
    column = column.replace(' ', '_')
    for c in problematic_chars:
        column = column.replace(c, '')
    newColumns.append(column)
new_pivot_df = pivot_df.toDF(*newColumns)

new_pivot_df.createOrReplaceTempView("new_pivot_df")

added_df = sedona.sql("SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(dbfstmppe_flrf_ud_q20_re_02tif, dbfstmppe_flrf_ud_q50_re_02tif), dbfstmppe_flrf_ud_q100_re_02tif), dbfstmppe_flrf_ud_q200_re_02tif), dbfstmppe_flrf_ud_q500_re_02tif), dbfstmppe_flrf_ud_q1500_re_02tif) as raster FROM new_pivot_df")
added_df.createOrReplaceTempView("added_df")

output = sedona.sql("SELECT RS_MapAlgebra(raster, 'D', 'out = rast[5] > 10 && rast[4] < 4 ? (rast[0] + rast[1] + rast[2]) / rast[3] : 0') raster FROM added_df")
output.createOrReplaceTempView("output")

new = sedona.sql("SELECT RS_AsGeoTiff(raster, 'LZW', '0.75') FROM output")

new.write.format("raster").save("my_raster_file.tif")

It seems to run fine until I try to save to file. I get " java.lang.OutOfMemoryError: Java heap space":

Py4JJavaError: An error occurred while calling o6369.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 52.0 failed 4 times, most recent failure: Lost task 3.3 in stage 52.0 (TID 91) (10.130.180.118 executor 8): java.lang.OutOfMemoryError: Java heap space

@jiayuasu
Copy link
Member

@JimShady It seems that your data is pretty large. Can you try to give more RAM to your executor?

@JimShady
Copy link
Contributor Author

JimShady commented Oct 20, 2023

I'll try that when I get to work. Is there no "memory safe" way to do this though? Maybe I should write tables instead of views?

@JimShady
Copy link
Contributor Author

JimShady commented Oct 20, 2023

I wouldn't say that the files are that big? Here are the sizes:

image

I changed my Databricks cluster configuration to the below, but still it crashed with memory java heap issues.

image

@JimShady
Copy link
Contributor Author

Any thoughts on this @jiayuasu ?

@jiayuasu
Copy link
Member

@JimShady This is actually quite big because the pivot function essentially merges all the tiff image to a giant row which is 1.6GB in total, which means the executor which gets assigned of this image needs to be prepared for this huge image. And, geotiff images are in compressed format on disk. So once they are loaded to the memory, its in-memory size is probably 10X larger than before.

@JimShady
Copy link
Contributor Author

I see. Do you have any suggestions as to how this should be approached using Apache-Sedona then? Or is this sort of operation not possible?

@JimShady
Copy link
Contributor Author

@JimShady This is actually quite big because the pivot function essentially merges all the tiff image to a giant row which is 1.6GB in total, which means the executor which gets assigned of this image needs to be prepared for this huge image. And, geotiff images are in compressed format on disk. So once they are loaded to the memory, its in-memory size is probably 10X larger than before.

@jiayuasu -- 1.6GB x 10 = 160GB. Given my cluster has 256GB of memory I'd have thought that it would be ok?

Do you have any ideas about to to do this? Should the raster be tiled maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants