-
Notifications
You must be signed in to change notification settings - Fork 128
Avoid pickling shapely object in Exposures.write_hdf5 #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I am not sure that the current version actually converts the data to WKB, see my comment. Is there a particular reason why we need to keep the old behavior of write_hdf5
at all? We could simply switch to the new one, and avoid adding a new parameter (which will need to be deprecated at some point etc...)
climada/entity/exposures/base.py
Outdated
@@ -1121,20 +1122,31 @@ def plot_basemap( | |||
self.to_crs(crs_ori, inplace=True) | |||
return axis | |||
|
|||
def write_hdf5(self, file_name): | |||
def write_hdf5(self, file_name, pickle_geometry_as_shapely=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name is very long and the fact that data is pickled internally might confuse users. How about
def write_hdf5(self, file_name, pickle_geometry_as_shapely=False): | |
def write_hdf5(self, file_name, geometry_to_wkb=True): |
climada/entity/exposures/base.py
Outdated
if pickle_geometry_as_shapely: | ||
pandas_df[col] = np.asarray(self.gdf[col]) | ||
else: | ||
pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeoSeries.to_wkb
calls shapely.to_wkb
on self
. Are you sure this does what it's supposed to do?
pandas_df[col] = gpd.GeoSeries.to_wkb(pandas_df[col]) | |
pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb() |
It might be more elegant to use GeoDataFrame.to_wkb
on self
, which already returns a pandas.DataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this does what it's supposed to do?
Sure! 😁 what makes you doubt it? Do you think we are missing a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you are calling a method of an uninstantiated object. Maybe use shapely.to_wkb(pandas_df[col])
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep it as it is (pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb()
) so geopandas can take care of it.
Thanks for the review!
Two reasons:
|
Co-authored-by: Lukas Riedel <[email protected]>
If it's only a little, I would still be in favor of dropping it. If it's considerable, we need to tell that in the docs and probably change the default value?
I only want to drop the support of writing old-style files. Reading should "silently" read both types. |
I've checked for Litpop from Russia: wkb is 5 (reading) to 10 (writing) times faster than shapely pickles. For the test it's a pity though. Proof that the result is still the same. depends on a file that has been written with the deprecated method. |
I decided to skip tests for the old style files as well. |
climada/entity/exposures/base.py
Outdated
pandas_df = pd.DataFrame(self.gdf) | ||
wkb_columns = [] | ||
for col in pandas_df.columns: | ||
if str(pandas_df[col].dtype) == "geometry": | ||
pandas_df[col] = np.asarray(self.gdf[col]) | ||
pandas_df[col] = gpd.GeoSeries(pandas_df[col]).to_wkb() | ||
wkb_columns.append(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this with GeoDataFrame.to_wkb()
:
wkb_columns = self.gdf.columns[self.gdf.dtypes == "geometry"].to_list()
pandas_df = self.gdf.to_wkb() # Done!
See https://github.com/geopandas/geopandas/blob/7255c57371eafbff5a09f881db2b905d1f9b28a7/geopandas/geodataframe.py#L1254
(I cannot make this a suggestion on GitHub because there is a deleted line 🤷 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I thought I have checked that and found that gdf.to_wkb()
only converts the 'geometry' column itself, leaving all other columns of type "geometry" untouched. Wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code and docs say it converts all geometry-type columns. The code I linked above is equivalent to our original one except for the explicit str
cast on the returned dtype
, which apparently is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to your test, it works fine, right?
Changes proposed in this PR:
Exposures.write_hdf5
that allows to keep the previous behavior.from_hdf5
methods accordingly.The data is still be pickled but pickled byte arrays are unlikely to be subject to change.
This is not a performance boost. Quite likely there are faster ways to store geometry data as wkb arrays,
but they are not obvious and maybe hard to get right. In light of future plans to introduce the parquet file format as standard in climada, any more efforts to optimize have been omitted.
This PR fixes #
PR Author Checklist
develop
)PR Reviewer Checklist