-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Registered serializer for common classes of additional array-like objects #762
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #762 +/- ##
===========================================
- Coverage 88.01% 15.22% -72.80%
===========================================
Files 53 53
Lines 15817 15801 -16
Branches 1610 2817 +1207
===========================================
- Hits 13922 2406 -11516
- Misses 1893 13062 +11169
- Partials 2 333 +331
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thank you for your contribution! do you think you can create a simple test that shows the issue you are addressing? we could perhaps add the libraries to the optional dependencies list |
I don't think we should add these as-is, and would strongly suggest avoiding adding anything but numpy directly. However, we could use protocols to try to identify these and similar types that can be coerced to numpy arrays. Perhaps looking for an |
@effigies - you mean not adding |
@effigies I'm not aware of a common method like |
I mean something like: class ArrayLike(ty.Protocol):
def __array__(self, *args, **kwargs): ...
@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
yield f"{obj.__class__.__module__}{obj.__class__.__name__}:{obj.size}:".encode()
array = np.asanyarray(obj)
if array.dtype == "object":
yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
else:
yield array.tobytes(order="C") As a quick proof of concept: from functools import singledispatch
import typing as ty
import numpy as np
import pandas as pd
import tensorflow as tf
import torch
class ArrayLike(ty.Protocol):
def __array__(self, *args, **kwargs): ...
@singledispatch
def identify(obj: object) -> str:
return obj.__class__.__name__
@identify.register
def _(obj: ArrayLike) -> str:
return "ArrayLike"
print(f"{identify([0, 0])=}") # list
print(f"{identify(np.array([0, 0]))=}") # ArrayLike
print(f"{identify(pd.DataFrame({"a": [0, 0]}))=}") # ArrayLike
print(f"{identify(tf.constant([0, 0]))=}") # ArrayLike
print(f"{identify(torch.tensor([0, 0]))=}") # ArrayLike |
@effigies I tried something like you mention, or at least how I understood it, and copied it below. I haven't written official test cases, but have tried it in a colab notebook where I was originally getting the error this was meant to fix. The issue is that though your proof of concept works, this doesn't work for DataFrames (which throw an unserializable error) nor torch Tensors which continue to fail silently. For some reason it did seem to work for tensorflow. class ArrayLike(ty.Protocol):
def __array__(self, *args, **kwargs): ...
def __array_interface__(self, *args, **kwargs): ...
if HAVE_NUMPY:
@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
yield f"{obj.__class__.__module__}{obj.__class__.__name__}:".encode()
array = numpy.asanyarray(obj)
yield f"{array.size}:".encode()
if array.dtype == "object":
yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
else:
yield array.tobytes(order="C") |
This works for me in a quick IPython session: class ArrayLike(ty.Protocol):
def __array__(self, *args, **kwargs): ...
@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
yield f"{obj.__class__.__module__}.{obj.__class__.__name__}:".encode()
array = np.asanyarray(obj)
yield f"{array.dtype.str}[{array.shape}]:".encode()
if array.dtype == "object":
yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
else:
yield array.tobytes(order="C") Testing with the test objects above: print(f"{b''.join(bytes_repr([0, 0], Cache()))}")
print(f"{b''.join(bytes_repr(np.array([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(pd.DataFrame({"a": [0, 0]}), Cache()))}")
print(f"{b''.join(bytes_repr(tf.constant([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(torch.tensor([0, 0]), Cache()))}") b'list:(\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-)'
b'numpyndarray:2:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'pandas.core.frame.DataFrame:<i8[(2, 1)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'tensorflow.python.framework.ops.EagerTensor:<i4[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00'
b'torch.Tensor:<i8[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' Can you share a pandas dataframe that fails? Mixing and matching types, I see: >>> b''.join(bytes_repr(pd.DataFrame({'a': [0, 0], 'b': [1.5, 2.5], 'c': ['one', 'two']}), Cache()))
b'pandas.core.frame.DataFrame:|O[(2, 3)]:\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xd1\x94\x8a?2\x95a\x85=\x8c\xb8|06\xb2\xfe\xa7\xce\xcb+\xd1\xbbS\xd2\xd1\xe5\xa9\x8c|O\xed\xf9\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xe3\xcd\x89$_K\x88\x98\xf8\x19l\x95fk\xf1\xee;\xf5\xfdW|\xf5>\xf8\xf7\x94a\x8c\xfa\xff\x15\x04' Another approach could be to use |
Running your above code in a Colab notebook produces the following for me: I don't see where there should be a difference in the way iPython runs from using a Colab but perhaps there is. and this behavior of torch and tensorflow is consistent with the issue I have otherwise been running into, where the hash will be the same for all tensors. |
That's really weird. Can you activate |
apologies I am not sure how to use pdb in this context. I have added made the link to the colab to allow editing. will continue to try and figure it out if you don't have the time currently to look into it. |
I've never used colab and don't have time to figure it out. I don't understand why you're getting weak references in your environment, hence why I suggest using pdb. It appears to be a Jupyter notebook, so you should be able to run the |
In hash.py, when calling with the DataFrame to bytes_repr, we go to the general function where it breaks the dataframe into
and when the key/value I think that the issue is somehow your code is registering the serializer for DataFrames to be converted to NumPy whereas mine is not. Theoretically mine is the result you would expect if we don't register a new serializer/leave the code as is. If I run with |
Types of changes
Summary
bytes_repr_numpy
Checklist
bytes_repr_numpy
which all changes are based off of