-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df.to_msgpack and pandas.read_msgpack removed from pandas #82
Comments
I think using pandas.to_parquet is an option. I tried pandas.to_feather first but this has difficulties with the multiindex structure of our dataframes. For pandas >=1.2 this is solved for pandas.to_parquet. |
Sorry for being very late to the party here, but I'm finally getting to grips with the issue here. I'm a little concerned about using parquet. I just tried to test #91. Not only does it require an optional dependency with lots of its own dependenices (pyarrow or fastparquet+numba), but the current pyarrow conda package is bugged and doesn't even work (you have to install via pip, see AnacondaRecipes/pyarrow-feedstock#2). For now, I'm more inclined to use the pickle interface (with or without compression) as a more portable, working out of the box solution. Based on this random blog post from years ago, it seems pickle isn't that off in terms of performance either. As for backwards compatibility, I'm not really sure what to do. There is a third party package, but support seems extremely limited. Perhaps the best solution is to just not support loading old msgpacked files and telling the user to either manually convert to pickle (using 3rd party resources or an older version of pandas) or reload the source files and resave the dataframes as pickle files. Does anyone have any thoughts? |
Yeah I don't think there's a good way to handle this. Either we add a bunch of dependencies or we lose the ability to load old dataframes. My vote is to lose the ability to load old dataframes. Maybe the call to As for parquet vs pickle, I agree that we should use pickle if using parquet requires adding dependencies. That's unfortunate given that lakodarmstadt was kind enough to put in the work to implement saving with parquet, but I think going with pickle instead is the right decision. |
I would be interested to know how well the Might get confusing having both HDF shot files and HDF data frame files (since we wouldn't want to save the data frame in an existing shot file), but having it human readable via HDFView might be beneficial and it's a format we could manually process ourselves should support be dropped like it was for msgpack. The biggest issue is probably whether it handles the multi-index properly (or any of the other weird thins we do with the data frame) |
@philipstarkey hdf5 is an interesting prospect! According to my quick read of the documentation, it should handle multiindexes and even arbitrary objects (via pickling) without issue. Looking more closely, it accomplishes this via the pytables package. Trying to install pytables from conda downgrades the hdf5 version unless you are using python 3.9. Installing from pip (using a different name: tables) does not have this problem, though it likely is less capable when it comes to compression. It also hasn't had a release since Oct 2019, which is mildly concerning. Some quick testing (using the pip install) is even less inspiring. Up front caveat, I'm testing with three very simple single shot runs. Here is my global list that is being saved. When I pickle a shot, the resulting file is 3kB. When I hdf5 it with default options, it is 1056kB. The structure it uses to save it is also pretty involved, to the point of being completely unintelligible. It appears a lot of fancy magic is happening under the hood with plenty of repetition. For instance, each level0 dataset shown is all of the column labels from the dataframe. My brief understanding is that each block stores data from all the columns of a single datatype. No doubt this is all necessary and performant, but it isn't very human readable in my opinion. I'm willing to be persuaded otherwise, but I'm still leaning towards pickle. Pickle support should basically never be dropped (as it is based on core python). I like the idea of being able to read it with hdfview, but that doesn't really seem reasonable and I worry that trying to parse all that out manually should support drop is going to entail just reproducing the functionality of pytables. |
I do really like this method of maintaining some semblance of backwards compat, whichever method we end up using. |
Eugh. Requires pytables. Forget that idea then! |
Glad to see I'm not alone in that opinion. I'll go ahead and make a PR using pickle. |
Thanks for solving this. |
Pandas dropped support for msgpack, I believe in the 0.25 to 1.0 transition judging by comparing different tagged versions here. Unfortunately this breaks lyse's "save dataframe" and "load dataframe" features.
Maybe it's best to switch to
df.to_pickle()
andpandas.read_pickle()
? I don't see any good ways to support loading dataframes from oldmsgpack
saves though, except maybe through some third party library.The text was updated successfully, but these errors were encountered: