Avoid pickling shapely object in Exposures.write_hdf5 #1051

emanuel-schmid · 2025-04-30T14:18:33Z

Changes proposed in this PR:

By default convert geometry data to wkb format before writing Exposures to HDF5 files.
Add a flag to Exposures.write_hdf5 that allows to keep the previous behavior.
Adapt the from_hdf5 methods accordingly.

The data is still be pickled but pickled byte arrays are unlikely to be subject to change.
This is not a performance boost. Quite likely there are faster ways to store geometry data as wkb arrays,
but they are not obvious and maybe hard to get right. In light of future plans to introduce the parquet file format as standard in climada, any more efforts to optimize have been omitted.

This PR fixes #

PR Author Checklist

PR Reviewer Checklist

…le for geometry serialization

peanutfun

Thanks for the update! I am not sure that the current version actually converts the data to WKB, see my comment. Is there a particular reason why we need to keep the old behavior of write_hdf5 at all? We could simply switch to the new one, and avoid adding a new parameter (which will need to be deprecated at some point etc...)

peanutfun · 2025-05-06T14:10:50Z

climada/entity/exposures/base.py

@@ -1121,20 +1122,31 @@ def plot_basemap(
        self.to_crs(crs_ori, inplace=True)
        return axis

-    def write_hdf5(self, file_name):
+    def write_hdf5(self, file_name, pickle_geometry_as_shapely=False):


The parameter name is very long and the fact that data is pickled internally might confuse users. How about

Suggested change

def write_hdf5(self, file_name, pickle_geometry_as_shapely=False):

def write_hdf5(self, file_name, geometry_to_wkb=True):

peanutfun · 2025-05-06T14:16:06Z

climada/entity/exposures/base.py

+                if pickle_geometry_as_shapely:
+                    pandas_df[col] = np.asarray(self.gdf[col])
+                else:
+                    pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col])


GeoSeries.to_wkb calls shapely.to_wkb on self. Are you sure this does what it's supposed to do?

Suggested change

pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col])

pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()

It might be more elegant to use GeoDataFrame.to_wkb on self, which already returns a pandas.DataFrame.

Are you sure this does what it's supposed to do?

Sure! 😁 what makes you doubt it? Do you think we are missing a test?

Because you are calling a method of an uninstantiated object. Maybe use shapely.to_wkb(pandas_df[col]) directly?

I'll keep it as it is (pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()) so geopandas can take care of it.

emanuel-schmid · 2025-05-06T15:47:23Z

Thanks for the review!

Is there a particular reason why we need to keep the old behavior of write_hdf5 at all?

Two reasons:

I'm afraid the new way is slower, so the old way would be preferable if files don't have to be kept forever.
For testing readability of old style files.

Co-authored-by: Lukas Riedel <[email protected]>

peanutfun · 2025-05-07T07:18:28Z

I'm afraid the new way is slower, so the old way would be preferable if files don't have to be kept forever.

If it's only a little, I would still be in favor of dropping it. If it's considerable, we need to tell that in the docs and probably change the default value?

For testing readability of old style files.

I only want to drop the support of writing old-style files. Reading should "silently" read both types.

emanuel-schmid · 2025-05-07T16:55:23Z

If it's only a little, ..

I've checked for Litpop from Russia: wkb is 5 (reading) to 10 (writing) times faster than shapely pickles.
So I agree, probably it's better to drop it.

For the test it's a pity though. Proof that the result is still the same. depends on a file that has been written with the deprecated method.
I'm tempted to skip that regression test as well and just trust the equality of the current implementation. 🤔

emanuel-schmid · 2025-05-08T11:38:46Z

I decided to skip tests for the old style files as well.
When all exposures from the data api are being replaced we have plenty of proof that reading old files is working and afterwards we don't need it anymore.

peanutfun · 2025-05-09T09:31:45Z

climada/entity/exposures/base.py

        pandas_df = pd.DataFrame(self.gdf)
+        wkb_columns = []
        for col in pandas_df.columns:
            if str(pandas_df[col].dtype) == "geometry":
-                pandas_df[col] = np.asarray(self.gdf[col])
+                pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()
+                wkb_columns.append(col)


We can simplify this with GeoDataFrame.to_wkb():

wkb_columns = self.gdf.columns[self.gdf.dtypes == "geometry"].to_list() pandas_df = self.gdf.to_wkb() # Done!

See https://github.com/geopandas/geopandas/blob/7255c57371eafbff5a09f881db2b905d1f9b28a7/geopandas/geodataframe.py#L1254
(I cannot make this a suggestion on GitHub because there is a deleted line 🤷 )

Hm. I thought I have checked that and found that gdf.to_wkb() only converts the 'geometry' column itself, leaving all other columns of type "geometry" untouched. Wrong?

Code and docs say it converts all geometry-type columns. The code I linked above is equivalent to our original one except for the explicit str cast on the returned dtype, which apparently is not necessary.

According to your test, it works fine, right?

emanuel-schmid added 4 commits April 16, 2025 10:07

refactor Exposures.write_hdf5 and .from_hdf5: use wkb instead of pick…

297326e

…le for geometry serialization

refactor Exposures.write_hdf5

5e8e583

Merge branch 'develop' into feature/exposures_hdf5_io

66e1d47

change of plan: just pickle geometries in wkb format

ab18675

emanuel-schmid requested review from chahank and peanutfun as code owners April 30, 2025 14:18

emanuel-schmid changed the title ~~Feature/exposures hdf5 io~~ Avoid pickling shapely object in Exposures.write_hdf5 Apr 30, 2025

peanutfun requested changes May 6, 2025

View reviewed changes

Update climada/entity/exposures/base.py

347f1f9

Co-authored-by: Lukas Riedel <[email protected]>

abandon shapely pickling

9312b30

peanutfun reviewed May 9, 2025

View reviewed changes

emanuel-schmid added 3 commits May 12, 2025 08:38

simplify wkb columns collection

8973490

simplify wkb conversion

5ff1e19

cosmetics

dea3e99

emanuel-schmid merged commit 1a6df8e into develop May 16, 2025
19 checks passed

emanuel-schmid deleted the feature/exposures_hdf5_io branch May 16, 2025 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid pickling shapely object in Exposures.write_hdf5 #1051

Avoid pickling shapely object in Exposures.write_hdf5 #1051

emanuel-schmid commented Apr 30, 2025 •

edited

Loading

peanutfun left a comment

peanutfun May 6, 2025

peanutfun May 6, 2025

emanuel-schmid May 6, 2025 •

edited

Loading

peanutfun May 7, 2025

emanuel-schmid May 8, 2025

emanuel-schmid commented May 6, 2025

peanutfun commented May 7, 2025

emanuel-schmid commented May 7, 2025

emanuel-schmid commented May 8, 2025

peanutfun May 9, 2025 •

edited

Loading

emanuel-schmid May 12, 2025

peanutfun May 14, 2025

peanutfun May 14, 2025

	def write_hdf5(self, file_name, pickle_geometry_as_shapely=False):
	def write_hdf5(self, file_name, geometry_to_wkb=True):

	pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col])
	pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()

Avoid pickling shapely object in Exposures.write_hdf5 #1051

Avoid pickling shapely object in Exposures.write_hdf5 #1051

Conversation

emanuel-schmid commented Apr 30, 2025 • edited Loading

PR Author Checklist

PR Reviewer Checklist

peanutfun left a comment

Choose a reason for hiding this comment

peanutfun May 6, 2025

Choose a reason for hiding this comment

peanutfun May 6, 2025

Choose a reason for hiding this comment

emanuel-schmid May 6, 2025 • edited Loading

Choose a reason for hiding this comment

peanutfun May 7, 2025

Choose a reason for hiding this comment

emanuel-schmid May 8, 2025

Choose a reason for hiding this comment

emanuel-schmid commented May 6, 2025

peanutfun commented May 7, 2025

emanuel-schmid commented May 7, 2025

emanuel-schmid commented May 8, 2025

peanutfun May 9, 2025 • edited Loading

Choose a reason for hiding this comment

emanuel-schmid May 12, 2025

Choose a reason for hiding this comment

peanutfun May 14, 2025

Choose a reason for hiding this comment

peanutfun May 14, 2025

Choose a reason for hiding this comment

emanuel-schmid commented Apr 30, 2025 •

edited

Loading

emanuel-schmid May 6, 2025 •

edited

Loading

peanutfun May 9, 2025 •

edited

Loading