Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_msgpack and pandas.read_msgpack removed from pandas #82

Closed
zakv opened this issue Nov 5, 2020 · 9 comments · Fixed by #95
Closed

df.to_msgpack and pandas.read_msgpack removed from pandas #82

zakv opened this issue Nov 5, 2020 · 9 comments · Fixed by #95

Comments

@zakv
Copy link
Contributor

zakv commented Nov 5, 2020

Pandas dropped support for msgpack, I believe in the 0.25 to 1.0 transition judging by comparing different tagged versions here. Unfortunately this breaks lyse's "save dataframe" and "load dataframe" features.

Maybe it's best to switch to df.to_pickle() and pandas.read_pickle()? I don't see any good ways to support loading dataframes from old msgpack saves though, except maybe through some third party library.

@lakodarmstadt
Copy link

I think using pandas.to_parquet is an option. I tried pandas.to_feather first but this has difficulties with the multiindex structure of our dataframes. For pandas >=1.2 this is solved for pandas.to_parquet.
How can I set the requirement for pandas>=1.2? I can use the requirement textfile but this is only used during installation if I understand correctly.

@dihm
Copy link
Contributor

dihm commented Dec 1, 2021

Sorry for being very late to the party here, but I'm finally getting to grips with the issue here. I'm a little concerned about using parquet. I just tried to test #91. Not only does it require an optional dependency with lots of its own dependenices (pyarrow or fastparquet+numba), but the current pyarrow conda package is bugged and doesn't even work (you have to install via pip, see AnacondaRecipes/pyarrow-feedstock#2).

For now, I'm more inclined to use the pickle interface (with or without compression) as a more portable, working out of the box solution. Based on this random blog post from years ago, it seems pickle isn't that off in terms of performance either.

As for backwards compatibility, I'm not really sure what to do. There is a third party package, but support seems extremely limited. Perhaps the best solution is to just not support loading old msgpacked files and telling the user to either manually convert to pickle (using 3rd party resources or an older version of pandas) or reload the source files and resave the dataframes as pickle files.

Does anyone have any thoughts?

@zakv
Copy link
Contributor Author

zakv commented Dec 1, 2021

Yeah I don't think there's a good way to handle this. Either we add a bunch of dependencies or we lose the ability to load old dataframes. My vote is to lose the ability to load old dataframes.

Maybe the call to pandas.read_msgpack can be left in a try/except block so that users who have an old version of pandas installed can still load dataframes saved the old way. The saving should now always call the new saving method though. That way a user could update their dataframe save format by opening then re-saving their dataframe using an up-to-date version of lyse in an environment with an old version of pandas.

As for parquet vs pickle, I agree that we should use pickle if using parquet requires adding dependencies. That's unfortunate given that lakodarmstadt was kind enough to put in the work to implement saving with parquet, but I think going with pickle instead is the right decision.

@philipstarkey
Copy link
Member

I would be interested to know how well the to_hdf option worked in pandas....

Might get confusing having both HDF shot files and HDF data frame files (since we wouldn't want to save the data frame in an existing shot file), but having it human readable via HDFView might be beneficial and it's a format we could manually process ourselves should support be dropped like it was for msgpack. The biggest issue is probably whether it handles the multi-index properly (or any of the other weird thins we do with the data frame)

@dihm
Copy link
Contributor

dihm commented Dec 2, 2021

@philipstarkey hdf5 is an interesting prospect! According to my quick read of the documentation, it should handle multiindexes and even arbitrary objects (via pickling) without issue.

Looking more closely, it accomplishes this via the pytables package. Trying to install pytables from conda downgrades the hdf5 version unless you are using python 3.9. Installing from pip (using a different name: tables) does not have this problem, though it likely is less capable when it comes to compression. It also hasn't had a release since Oct 2019, which is mildly concerning.

Some quick testing (using the pip install) is even less inspiring. Up front caveat, I'm testing with three very simple single shot runs.

Here is my global list that is being saved.
image

When I pickle a shot, the resulting file is 3kB. When I hdf5 it with default options, it is 1056kB. The structure it uses to save it is also pretty involved, to the point of being completely unintelligible. It appears a lot of fancy magic is happening under the hood with plenty of repetition. For instance, each level0 dataset shown is all of the column labels from the dataframe. My brief understanding is that each block stores data from all the columns of a single datatype. No doubt this is all necessary and performant, but it isn't very human readable in my opinion.
image

I'm willing to be persuaded otherwise, but I'm still leaning towards pickle. Pickle support should basically never be dropped (as it is based on core python). I like the idea of being able to read it with hdfview, but that doesn't really seem reasonable and I worry that trying to parse all that out manually should support drop is going to entail just reproducing the functionality of pytables.

@dihm
Copy link
Contributor

dihm commented Dec 2, 2021

Maybe the call to pandas.read_msgpack can be left in a try/except block so that users who have an old version of pandas installed can still load dataframes saved the old way. The saving should now always call the new saving method though. That way a user could update their dataframe save format by opening then re-saving their dataframe using an up-to-date version of lyse in an environment with an old version of pandas.

I do really like this method of maintaining some semblance of backwards compat, whichever method we end up using.

@philipstarkey
Copy link
Member

Eugh. Requires pytables. Forget that idea then!

@dihm dihm mentioned this issue Dec 3, 2021
@dihm
Copy link
Contributor

dihm commented Dec 3, 2021

Eugh. Requires pytables. Forget that idea then!

Glad to see I'm not alone in that opinion.

I'll go ahead and make a PR using pickle.

@dihm dihm closed this as completed in #95 Dec 7, 2021
@lakodarmstadt
Copy link

Thanks for solving this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants