Turning a raster (geotiff) into a H3 table? #1047

JimShady · 2023-10-04T11:36:30Z

Hello,

Been following development of Apache-Sedona with interest. Thanks for the development.

I was wondering if there are now sufficient tools/functions to use Apache-Sedona to convert a geotiff into a H3 spark table? How would I go about this?

Thanks.

jiayuasu · 2023-10-06T17:11:00Z

@JimShady I wonder what exactly you mean for geotiff to h3?

With the new H3 functions added in Sedona, the following are now possible:

GeoTiff geo bounding envelope -> H3 cells

df1 = df.selectExpr("raster", "ST_H3CellIDs(RS_Envelope(raster) as h3)")
df2 = df1.select(col("raster"), explode(col("h3")).as("h3"))

Individual pixel location + corresponding observed band value -> H3 cells + values

df1 = df.selectExpr("ST_Point(RS_RasterToWorldCoordX(raster, colX, rowY), RS_RasterToWorldCoordY(raster, raster, colX, rowY)) as pixelWorldCoord", "RS_Value(raster, pixelWorldCoord) as observed_value")
df2 = df1.selectExpr("observed_value", "ST_H3CellIDs(pixelWorldCoord as h3)")
df3 = df2.select(col("observed_value"), explode(col("h3")).as("h3"))

JimShady · 2023-10-06T17:28:26Z

Your second example is, I think, the type of operation I am thinking of. Obviously a raster is essentially a grid of values. So I'm thinking to use a H3 resolution that is similar to the raster resolution, and then extract the values from the raster into a H3 table.

In your example above, how is the H3 resolution chosen in example 2 please?

JimShady · 2023-10-06T17:31:30Z

I'm wondering really if Apache-Sedona has a similar function to this:

https://databrickslabs.github.io/mosaic/api/raster-format-readers.html#mos-read-format-raster-to-grid

Because I've tried to use this mosaic function and the performance is really slow.

databrickslabs/mosaic#401

jiayuasu · 2023-10-06T18:00:06Z

@JimShady

Here is the full doc of ST_H3CellIds in Sedona 1.5.0 (https://github.com/apache/sedona/blob/master/docs/api/sql/Function.md#st_h3cellids). You can choose a level (resolution) of H3. Will be released next week

Speaking of the raster_to_grid function, we don't have the same function because exploding a raster grid to individual rows will likely crash the memory of Sedona/Spark and it might NOT be the best way to explore/manipulate raster images.

But we provide the following functions to allow you to directly manipulate pixels of bands

RS_Resample: https://github.com/apache/sedona/blob/master/docs/api/sql/Raster-operators.md#rs_resample
RS_MapAlgebra: https://github.com/apache/sedona/blob/master/docs/api/sql/Raster-operators.md#rs_mapalgebra
Of course, RS_RasterToWorldCoordX, RS_RasterToWorldCoordY, RS_Value, RS_Values

Is raster_to_grid really needed? If you are comfortable describing your use case, we can probably give you better alternatives.

JimShady · 2023-10-06T18:02:59Z

I've a few use cases. ☺️ I'll explain them when I'm back in work Monday. Thanks again.

JimShady · 2023-10-17T10:16:55Z

Hi @jiayuasu . Sorry for the delay in getting back to you. What I am trying to do is something like:

Read in 6 different rasters (they are all the same origin, resolution and pixel type etc).
Use algebra to 'combine' them on a pixel by pixel basis to a new raster. For example something like:

WHERE raster6 > 10 AND raster5 < 4 THEN (raster1 + raster2 + raster3) / raster4

Then I would write out the result to a new geotiff file.

Is that kind of thing possible? I think so but I'm struggling with the syntax. All the examples in the documentation presume that we are dealing with rasters with multiple bands. For example here. I understand it. But my rasters are separate files, not bands of the same file.

But my actual situation is this:

jiayuasu · 2023-10-18T07:28:20Z

@JimShady This is definitely possible in Sedona 1.5.0. We need to do this in 3 steps in PySpark:

Please first use sedona.read.format("binaryFile"). load("/tmp/AU_202105*") to read these geotiffs. Then use RS_FromGeoTiff() to create the raster type column for this DataFrame. Pay attention to the 1 in ID column. You can do that by SELECT 1 as ID, path, raster FROM df

|ID|path|raster|
|1|XX_1|GridCoverageXXX|
|1|XX_2|GridCoverageXXX|
|1|XX_3|GridCoverageXXX|
|1|XX_4|GridCoverageXXX|
|1|XX_5|GridCoverageXXX|
|1|XX_6|GridCoverageXXX|

Now the exciting part comes:

Transpose your DataFrame (swap row and column) using pivot function (https://sparkbyexamples.com/spark/spark-transpose-rows-to-columns-in-dataframe/amp/).

from pyspark.sql.functions import first

df.groupBy("ID").pivot("path").agg(first("raster"))

The resulting DF should be

|ID|XX_1|XX_2|XX_3|XX_4|XX_5|XX_6|
|1|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|GridCoverageXXX|

Now add up all bands together using RS_AddBand(https://sedona.apache.org/1.5.0/api/sql/Raster-operators/#rs_addband) assuming the band order is the same as the original file names:

sedona.sql("SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(XX_1, XX_2), XX_3), XX_4), XX_5), XX_6) as raster FROM pivot_df")

The resulting df should be like this

|raster|
|GridCoverageXXX|

Note that: now this raster is a single raster with 6 bands.

Now let's use RS_MapAlgebra to do the trick. Make sure you understand the logical operators in Jiffle script (https://github.com/geosolutions-it/jai-ext/wiki/Jiffle---language-summary#logical-operators). Note that MapAlgebra uses 0-indexed band ID while other functions in Sedona rasters use 1-indexed band ID.

sedona.sql("SELECT RS_MapAlgebra(raster, 'D', 'out = rast[5] > 10 && rast[4] < 4 ? (rast[0] + rast[1] + rast[2]) / rast[3] : 0')) FROM df")

We are exploring the possibility to support manipulating multiple rasters (not only bands) in the Jiffle script. If we add the support in the future, we can get rid of Step 2

JimShady · 2023-10-18T15:48:32Z

I'm stuck a little with Step 2 I think. This is my progress:

my_files = sedona.read.format("binaryFile").load("/tmp/PE_FLRF_UD_Q*")
my_files.createOrReplaceTempView("my_files")
my_files_new = spark.sql("SELECT 1 as id, RS_FromGeoTiff(content) AS raster, modificationTime, length, path FROM my_files")

from pyspark.sql.functions import first

pivot_df = my_files_new.groupBy("id").pivot("path").agg(first("raster"))
pivot_df.createOrReplaceTempView("pivot_df")

added_df = sedona.sql('SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand("dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif", "dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q100_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q200_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q500_RE_02.tif"), "dbfs:/tmp/PE_FLRF_UD_Q1500_RE_02.tif") as raster FROM pivot_df')

But I'm getting this error:

AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "rs_addband(dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif, dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif)" due to data type mismatch: parameter 1 requires "BINARY" type, however, "dbfs:/tmp/PE_FLRF_UD_Q20_RE_02.tif" is of "STRING" type.; line 1 pos 51;

jiayuasu · 2023-10-18T17:30:16Z

@JimShady The quote is wrong in the last query. Try sedona.sql("XXX"). And the column name should not be quoted. The error message simple says that Spark treats the input as string value, which is not a column name.

In addition, not sure if in Spark, value like this dbfs:/tmp/PE_FLRF_UD_Q50_RE_02.tif is considered as a legal column name. If my comment cannot solve the problem, then after loading your data via sedona.read, please clean up the path values by replacing all special characters. See this answer to learn how to drop special characters in PySpark: https://stackoverflow.com/a/53149013

JimShady · 2023-10-19T08:58:51Z

I am getting there I think. This is my complete code now.

my_files = sedona.read.format("binaryFile").load("/tmp/PE_FLRF_UD_Q*")
my_files.createOrReplaceTempView("my_files")
my_files_new = spark.sql("SELECT 1 as id, RS_FromGeoTiff(content) AS raster, modificationTime, length, path FROM my_files")

from pyspark.sql.functions import first

pivot_df = my_files_new.groupBy("id").pivot("path").agg(first("raster"))

newColumns = []
problematic_chars = './,;{}()=:'
for column in pivot_df.columns:
    column = column.lower()
    column = column.replace(' ', '_')
    for c in problematic_chars:
        column = column.replace(c, '')
    newColumns.append(column)
new_pivot_df = pivot_df.toDF(*newColumns)

new_pivot_df.createOrReplaceTempView("new_pivot_df")

added_df = sedona.sql("SELECT RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(RS_AddBand(dbfstmppe_flrf_ud_q20_re_02tif, dbfstmppe_flrf_ud_q50_re_02tif), dbfstmppe_flrf_ud_q100_re_02tif), dbfstmppe_flrf_ud_q200_re_02tif), dbfstmppe_flrf_ud_q500_re_02tif), dbfstmppe_flrf_ud_q1500_re_02tif) as raster FROM new_pivot_df")
added_df.createOrReplaceTempView("added_df")

output = sedona.sql("SELECT RS_MapAlgebra(raster, 'D', 'out = rast[5] > 10 && rast[4] < 4 ? (rast[0] + rast[1] + rast[2]) / rast[3] : 0') raster FROM added_df")
output.createOrReplaceTempView("output")

new = sedona.sql("SELECT RS_AsGeoTiff(raster, 'LZW', '0.75') FROM output")

new.write.format("raster").save("my_raster_file.tif")

It seems to run fine until I try to save to file. I get " java.lang.OutOfMemoryError: Java heap space":

Py4JJavaError: An error occurred while calling o6369.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 52.0 failed 4 times, most recent failure: Lost task 3.3 in stage 52.0 (TID 91) (10.130.180.118 executor 8): java.lang.OutOfMemoryError: Java heap space

jiayuasu · 2023-10-20T05:07:08Z

@JimShady It seems that your data is pretty large. Can you try to give more RAM to your executor?

JimShady · 2023-10-20T06:26:53Z

I'll try that when I get to work. Is there no "memory safe" way to do this though? Maybe I should write tables instead of views?

JimShady · 2023-10-20T17:08:57Z

I wouldn't say that the files are that big? Here are the sizes:

I changed my Databricks cluster configuration to the below, but still it crashed with memory java heap issues.

JimShady · 2023-10-25T11:36:26Z

Any thoughts on this @jiayuasu ?

jiayuasu · 2023-10-26T03:31:10Z

@JimShady This is actually quite big because the pivot function essentially merges all the tiff image to a giant row which is 1.6GB in total, which means the executor which gets assigned of this image needs to be prepared for this huge image. And, geotiff images are in compressed format on disk. So once they are loaded to the memory, its in-memory size is probably 10X larger than before.

JimShady · 2023-10-26T13:29:26Z

I see. Do you have any suggestions as to how this should be approached using Apache-Sedona then? Or is this sort of operation not possible?

JimShady · 2023-10-28T14:07:57Z

@JimShady This is actually quite big because the pivot function essentially merges all the tiff image to a giant row which is 1.6GB in total, which means the executor which gets assigned of this image needs to be prepared for this huge image. And, geotiff images are in compressed format on disk. So once they are loaded to the memory, its in-memory size is probably 10X larger than before.

@jiayuasu -- 1.6GB x 10 = 160GB. Given my cluster has 256GB of memory I'd have thought that it would be ok?

Do you have any ideas about to to do this? Should the raster be tiled maybe?

furqaankhan mentioned this issue Dec 8, 2023

[SEDONA-439] Add RS_Union_Aggr #1140

Merged

jiayuasu closed this as completed in #1140 Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning a raster (geotiff) into a H3 table? #1047

Turning a raster (geotiff) into a H3 table? #1047

JimShady commented Oct 4, 2023

jiayuasu commented Oct 6, 2023 •

edited

Loading

JimShady commented Oct 6, 2023

JimShady commented Oct 6, 2023

jiayuasu commented Oct 6, 2023

JimShady commented Oct 6, 2023

JimShady commented Oct 17, 2023 •

edited

Loading

jiayuasu commented Oct 18, 2023 •

edited

Loading

JimShady commented Oct 18, 2023

jiayuasu commented Oct 18, 2023 •

edited

Loading

JimShady commented Oct 19, 2023

jiayuasu commented Oct 20, 2023

JimShady commented Oct 20, 2023 •

edited

Loading

JimShady commented Oct 20, 2023 •

edited

Loading

JimShady commented Oct 25, 2023

jiayuasu commented Oct 26, 2023

JimShady commented Oct 26, 2023

JimShady commented Oct 28, 2023

Turning a raster (geotiff) into a H3 table? #1047

Turning a raster (geotiff) into a H3 table? #1047

Comments

JimShady commented Oct 4, 2023

jiayuasu commented Oct 6, 2023 • edited Loading

JimShady commented Oct 6, 2023

JimShady commented Oct 6, 2023

jiayuasu commented Oct 6, 2023

JimShady commented Oct 6, 2023

JimShady commented Oct 17, 2023 • edited Loading

jiayuasu commented Oct 18, 2023 • edited Loading

JimShady commented Oct 18, 2023

jiayuasu commented Oct 18, 2023 • edited Loading

JimShady commented Oct 19, 2023

jiayuasu commented Oct 20, 2023

JimShady commented Oct 20, 2023 • edited Loading

JimShady commented Oct 20, 2023 • edited Loading

JimShady commented Oct 25, 2023

jiayuasu commented Oct 26, 2023

JimShady commented Oct 26, 2023

JimShady commented Oct 28, 2023

jiayuasu commented Oct 6, 2023 •

edited

Loading

JimShady commented Oct 17, 2023 •

edited

Loading

jiayuasu commented Oct 18, 2023 •

edited

Loading

jiayuasu commented Oct 18, 2023 •

edited

Loading

JimShady commented Oct 20, 2023 •

edited

Loading

JimShady commented Oct 20, 2023 •

edited

Loading