Skip to content

Avoid pickling shapely object in Exposures.write_hdf5 #1051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 16, 2025

Conversation

emanuel-schmid
Copy link
Collaborator

@emanuel-schmid emanuel-schmid commented Apr 30, 2025

Changes proposed in this PR:

  • By default convert geometry data to wkb format before writing Exposures to HDF5 files.
  • Add a flag to Exposures.write_hdf5 that allows to keep the previous behavior.
  • Adapt the from_hdf5 methods accordingly.

The data is still be pickled but pickled byte arrays are unlikely to be subject to change.
This is not a performance boost. Quite likely there are faster ways to store geometry data as wkb arrays,
but they are not obvious and maybe hard to get right. In light of future plans to introduce the parquet file format as standard in climada, any more efforts to optimize have been omitted.

This PR fixes #

PR Author Checklist

PR Reviewer Checklist

@emanuel-schmid emanuel-schmid changed the title Feature/exposures hdf5 io Avoid pickling shapely object in Exposures.write_hdf5 Apr 30, 2025
Copy link
Member

@peanutfun peanutfun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I am not sure that the current version actually converts the data to WKB, see my comment. Is there a particular reason why we need to keep the old behavior of write_hdf5 at all? We could simply switch to the new one, and avoid adding a new parameter (which will need to be deprecated at some point etc...)

@@ -1121,20 +1122,31 @@ def plot_basemap(
self.to_crs(crs_ori, inplace=True)
return axis

def write_hdf5(self, file_name):
def write_hdf5(self, file_name, pickle_geometry_as_shapely=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name is very long and the fact that data is pickled internally might confuse users. How about

Suggested change
def write_hdf5(self, file_name, pickle_geometry_as_shapely=False):
def write_hdf5(self, file_name, geometry_to_wkb=True):

if pickle_geometry_as_shapely:
pandas_df[col] = np.asarray(self.gdf[col])
else:
pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoSeries.to_wkb calls shapely.to_wkb on self. Are you sure this does what it's supposed to do?

Suggested change
pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col])
pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()

It might be more elegant to use GeoDataFrame.to_wkb on self, which already returns a pandas.DataFrame.

Copy link
Collaborator Author

@emanuel-schmid emanuel-schmid May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this does what it's supposed to do?

Sure! 😁 what makes you doubt it? Do you think we are missing a test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are calling a method of an uninstantiated object. Maybe use shapely.to_wkb(pandas_df[col]) directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep it as it is (pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()) so geopandas can take care of it.

@emanuel-schmid
Copy link
Collaborator Author

Thanks for the review!

Is there a particular reason why we need to keep the old behavior of write_hdf5 at all?

Two reasons:

  • I'm afraid the new way is slower, so the old way would be preferable if files don't have to be kept forever.
  • For testing readability of old style files.

@peanutfun
Copy link
Member

I'm afraid the new way is slower, so the old way would be preferable if files don't have to be kept forever.

If it's only a little, I would still be in favor of dropping it. If it's considerable, we need to tell that in the docs and probably change the default value?

For testing readability of old style files.

I only want to drop the support of writing old-style files. Reading should "silently" read both types.

@emanuel-schmid
Copy link
Collaborator Author

If it's only a little, ..

I've checked for Litpop from Russia: wkb is 5 (reading) to 10 (writing) times faster than shapely pickles.
So I agree, probably it's better to drop it.

For the test it's a pity though. Proof that the result is still the same. depends on a file that has been written with the deprecated method.
I'm tempted to skip that regression test as well and just trust the equality of the current implementation. 🤔

@emanuel-schmid
Copy link
Collaborator Author

I decided to skip tests for the old style files as well.
When all exposures from the data api are being replaced we have plenty of proof that reading old files is working and afterwards we don't need it anymore.

Comment on lines 1135 to 1140
pandas_df = pd.DataFrame(self.gdf)
wkb_columns = []
for col in pandas_df.columns:
if str(pandas_df[col].dtype) == "geometry":
pandas_df[col] = np.asarray(self.gdf[col])
pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()
wkb_columns.append(col)
Copy link
Member

@peanutfun peanutfun May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this with GeoDataFrame.to_wkb():

wkb_columns = self.gdf.columns[self.gdf.dtypes == "geometry"].to_list()
pandas_df = self.gdf.to_wkb()  # Done!

See https://github.com/geopandas/geopandas/blob/7255c57371eafbff5a09f881db2b905d1f9b28a7/geopandas/geodataframe.py#L1254
(I cannot make this a suggestion on GitHub because there is a deleted line 🤷 )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I thought I have checked that and found that gdf.to_wkb() only converts the 'geometry' column itself, leaving all other columns of type "geometry" untouched. Wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and docs say it converts all geometry-type columns. The code I linked above is equivalent to our original one except for the explicit str cast on the returned dtype, which apparently is not necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your test, it works fine, right?

@emanuel-schmid emanuel-schmid merged commit 1a6df8e into develop May 16, 2025
19 checks passed
@emanuel-schmid emanuel-schmid deleted the feature/exposures_hdf5_io branch May 16, 2025 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants