Registered serializer for common classes of additional array-like objects #762

wilke0818 · 2024-07-18T02:40:15Z

Types of changes

Bug fix (non-breaking change which fixes an issue), see splitting list of torch.Tensor causes "repeating" behavior of the output #761

Summary

Registers serializers for 3 common array-like packages: torch, tensorflow, and pandas
Copies existing methodology for serializing numpy in bytes_repr_numpy

Checklist

I have added tests to cover my changes (if necessary)
- No tests exist for bytes_repr_numpy which all changes are based off of
I have updated documentation (if necessary)
- No, let me know if there is a specific place that references this hash code that needs updating

…ects

for more information, see https://pre-commit.ci

codecov · 2024-07-18T02:44:46Z

Codecov Report

Attention: Patch coverage is 40.00000% with 18 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d674840). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
pydra/utils/hash.py	40.00%	15 Missing and 3 partials ⚠️

❌ Your patch check has failed because the patch coverage (40.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #762   +/-   ##
=======================================
  Coverage        ?   15.22%           
=======================================
  Files           ?       53           
  Lines           ?    15801           
  Branches        ?     2817           
=======================================
  Hits            ?     2406           
  Misses          ?    13062           
  Partials        ?      333

Flag	Coverage Δ
unittests	`15.18% <40.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

djarecka · 2024-07-18T15:10:59Z

Thank you for your contribution! do you think you can create a simple test that shows the issue you are addressing?

we could perhaps add the libraries to the optional dependencies list
and run the tests in one of the environments we are using in GA
(if this is not clear, I can help you with it)

effigies · 2024-07-18T15:21:22Z

I don't think we should add these as-is, and would strongly suggest avoiding adding anything but numpy directly. However, we could use protocols to try to identify these and similar types that can be coerced to numpy arrays. Perhaps looking for an __array__() method, or one of these .to_numpy() methods?

djarecka · 2024-07-18T15:33:54Z

@effigies - you mean not adding bytes_repr_tensorflowat all, and just try to re-use bytes_repr_numpy for tensorflow objects?

wilke0818 · 2024-07-18T15:44:42Z

@effigies I'm not aware of a common method like __array__ that many of these libraries use. I considered the to_numpy() and numpy() route as you mention, but this would require changing bytes_repr_numpy to accept more than just numpy types, right? Doing so might then make the optional dependencies weird? I fallback on y'all's judgement though. @djarecka I will try to look at creating a test in the meantime.

effigies · 2024-07-18T16:23:10Z

I mean something like:

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
        yield f"{obj.__class__.__module__}{obj.__class__.__name__}:{obj.size}:".encode()
        array = np.asanyarray(obj)
        if array.dtype == "object":
            yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
        else:
            yield array.tobytes(order="C")

As a quick proof of concept:

from functools import singledispatch
import typing as ty

import numpy as np
import pandas as pd
import tensorflow as tf
import torch

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@singledispatch
def identify(obj: object) -> str:
    return obj.__class__.__name__

@identify.register
def _(obj: ArrayLike) -> str:
    return "ArrayLike"

print(f"{identify([0, 0])=}")  # list
print(f"{identify(np.array([0, 0]))=}")  # ArrayLike
print(f"{identify(pd.DataFrame({"a": [0, 0]}))=}")  # ArrayLike
print(f"{identify(tf.constant([0, 0]))=}")  # ArrayLike
print(f"{identify(torch.tensor([0, 0]))=}")  # ArrayLike

wilke0818 · 2024-07-18T20:58:35Z

@effigies I tried something like you mention, or at least how I understood it, and copied it below. I haven't written official test cases, but have tried it in a colab notebook where I was originally getting the error this was meant to fix. The issue is that though your proof of concept works, this doesn't work for DataFrames (which throw an unserializable error) nor torch Tensors which continue to fail silently. For some reason it did seem to work for tensorflow.

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

    def __array_interface__(self, *args, **kwargs): ...


if HAVE_NUMPY:
    @register_serializer
    def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
            yield f"{obj.__class__.__module__}{obj.__class__.__name__}:".encode()
            array = numpy.asanyarray(obj)
            yield f"{array.size}:".encode()
            if array.dtype == "object":
                yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
            else:
                yield array.tobytes(order="C")

effigies · 2024-07-19T18:02:17Z

This works for me in a quick IPython session:

class ArrayLike(ty.Protocol):
    def __array__(self, *args, **kwargs): ...

@register_serializer
def bytes_repr_arraylike(obj: ArrayLike, cache: Cache) -> Iterator[bytes]:
    yield f"{obj.__class__.__module__}.{obj.__class__.__name__}:".encode()
    array = np.asanyarray(obj)
    yield f"{array.dtype.str}[{array.shape}]:".encode()
    if array.dtype == "object":
        yield from bytes_repr_sequence_contents(iter(array.ravel()), cache)
    else:
        yield array.tobytes(order="C")

Testing with the test objects above:

print(f"{b''.join(bytes_repr([0, 0], Cache()))}")
print(f"{b''.join(bytes_repr(np.array([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(pd.DataFrame({"a": [0, 0]}), Cache()))}")
print(f"{b''.join(bytes_repr(tf.constant([0, 0]), Cache()))}")
print(f"{b''.join(bytes_repr(torch.tensor([0, 0]), Cache()))}")

b'list:(\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-)'
b'numpyndarray:2:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'pandas.core.frame.DataFrame:<i8[(2, 1)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
b'tensorflow.python.framework.ops.EagerTensor:<i4[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00'
b'torch.Tensor:<i8[(2,)]:\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Can you share a pandas dataframe that fails? Mixing and matching types, I see:

>>> b''.join(bytes_repr(pd.DataFrame({'a': [0, 0], 'b': [1.5, 2.5], 'c': ['one', 'two']}), Cache()))
b'pandas.core.frame.DataFrame:|O[(2, 3)]:\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xd1\x94\x8a?2\x95a\x85=\x8c\xb8|06\xb2\xfe\xa7\xce\xcb+\xd1\xbbS\xd2\xd1\xe5\xa9\x8c|O\xed\xf9\xb8\xa7\xf7\x08q;\x12\x80rw^\xc4\x12L:-\xe3\xcd\x89$_K\x88\x98\xf8\x19l\x95fk\xf1\xee;\xf5\xfdW|\xf5>\xf8\xf7\x94a\x8c\xfa\xff\x15\x04'

Another approach could be to use yield pickle.dumps(obj) (via @satra), which will work for pickle-able objects, but if a dataframe is not serializable, that may not work either.

wilke0818 · 2024-07-19T18:35:15Z

Running your above code in a Colab notebook produces the following for me:

and when including the DataFrame I get

I don't see where there should be a difference in the way iPython runs from using a Colab but perhaps there is.
https://colab.research.google.com/drive/1f-XYbg2Gr4m2YyzFj8zmoYzoNCLA0_jN?usp=sharing

and this behavior of torch and tensorflow is consistent with the issue I have otherwise been running into, where the hash will be the same for all tensors.

effigies · 2024-07-19T18:49:07Z

That's really weird. Can you activate pdb and see what object is actually causing the problem?

wilke0818 · 2024-07-19T19:04:23Z

apologies I am not sure how to use pdb in this context. I have added made the link to the colab to allow editing. will continue to try and figure it out if you don't have the time currently to look into it.

effigies · 2024-07-19T19:12:37Z

I've never used colab and don't have time to figure it out. I don't understand why you're getting weak references in your environment, hence why I suggest using pdb. It appears to be a Jupyter notebook, so you should be able to run the %pdb magic before calling the erroring function and walk through the call stack to find where you got a weak reference. pd.DataFrame([[0,0],[1,1]]) gives me a normal Pandas dataframe and serializes to numpy without problem.

wilke0818 · 2024-07-19T21:17:00Z

In hash.py, when calling with the DataFrame to bytes_repr, we go to the general function where it breaks the dataframe into
{'_is_copy': None, '_mgr': BlockManager Items: Index(['b'], dtype='object') Axis 1: RangeIndex(start=0, stop=2, step=1) ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, '_item_cache': {}, '_attrs': {}, '_flags': <Flags(allows_duplicate_labels=True)>} using the __dict__ function in pandas which is then passed to bytes_repr_mapping_contents and this whole process continues for a while until we try to hash the key/value pair of '_flags': <Flags(allows_duplicate_labels=True)>. The issue is that when Flags.__dict__ is called, it returns

{'_allows_duplicate_labels': True, '_obj': <weakref at 0x7eac8b3931f0; to 'DataFrame' at 0x7eac888e9a20>}

and when the key/value '_obj': <weakref at 0x7eac8b3931f0; to 'DataFrame' at 0x7eac888e9a20>} is attempted to be hash, we try to run __dict__ on the weakref which fails.

I think that the issue is somehow your code is registering the serializer for DataFrames to be converted to NumPy whereas mine is not. Theoretically mine is the result you would expect if we don't register a new serializer/leave the code as is. If I run with @register_serializer(pd.DataFrame) it works (for pandas, not torch or tensorflow), but with @register_serializer or @register_serializer(ArrayLike) it doesn't work

wilke0818 and others added 2 commits July 17, 2024 16:32

Registered serializer for common classes of additional array-like obj…

233470d

…ects

[pre-commit.ci] auto fixes from pre-commit.com hooks

83c721b

for more information, see https://pre-commit.ci

effigies mentioned this pull request Jul 18, 2024

splitting list of torch.Tensor causes "repeating" behavior of the output #761

Open

effigies closed this Mar 17, 2025

effigies reopened this Mar 17, 2025

tclose added the to consider suggesting changes that require more discussion label Apr 2, 2025

tclose mentioned this pull request Apr 29, 2025

Better hashing of generic objects #776

Open

tclose added this to Pydra Roadmap Apr 29, 2025

github-project-automation bot moved this to In progress in Pydra Roadmap Apr 29, 2025

tclose moved this from In progress to v1.0 in Pydra Roadmap Apr 29, 2025

tclose removed the to consider suggesting changes that require more discussion label Apr 29, 2025

tclose assigned effigies May 8, 2025

tclose mentioned this pull request May 8, 2025

[wip] adding a part for pandas.df to hash_value #575

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Registered serializer for common classes of additional array-like objects #762

Registered serializer for common classes of additional array-like objects #762

Uh oh!

wilke0818 commented Jul 18, 2024

Uh oh!

codecov bot commented Jul 18, 2024 •

edited

Loading

Uh oh!

djarecka commented Jul 18, 2024 •

edited

Loading

Uh oh!

effigies commented Jul 18, 2024

Uh oh!

djarecka commented Jul 18, 2024

Uh oh!

wilke0818 commented Jul 18, 2024

Uh oh!

effigies commented Jul 18, 2024

Uh oh!

wilke0818 commented Jul 18, 2024

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024 •

edited

Loading

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

Registered serializer for common classes of additional array-like objects #762

Are you sure you want to change the base?

Registered serializer for common classes of additional array-like objects #762

Uh oh!

Conversation

wilke0818 commented Jul 18, 2024

Types of changes

Summary

Checklist

Uh oh!

codecov bot commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

djarecka commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

effigies commented Jul 18, 2024

Uh oh!

djarecka commented Jul 18, 2024

Uh oh!

wilke0818 commented Jul 18, 2024

Uh oh!

effigies commented Jul 18, 2024

Uh oh!

wilke0818 commented Jul 18, 2024

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024

Uh oh!

effigies commented Jul 19, 2024

Uh oh!

wilke0818 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jul 18, 2024 •

edited

Loading

djarecka commented Jul 18, 2024 •

edited

Loading

wilke0818 commented Jul 19, 2024 •

edited

Loading

wilke0818 commented Jul 19, 2024 •

edited

Loading