Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: control attrs better as described in issue #3277 #3344

Merged
merged 6 commits into from
Dec 18, 2024

Conversation

pfackeldey
Copy link
Collaborator

This changes the behavior of attrs as described in #3277. This PR closes #3277.

An example that works with this PR is:

import awkward as ak

arr = ak.Array([1])
arr.attrs["foo"] = "bar"

arr2 = ak.copy(arr)
assert arr2.attrs == arr.attrs

arr2.attrs["foo"] = "baz"
assert arr2.attrs != arr.attrs

Because of this change (and that we have now a Attrs class as proposed in #3277) I had to change is to == in some of the tests.

@pfackeldey pfackeldey requested a review from jpivarski December 16, 2024 17:52
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is both a nice API design and it's the first good use of weak references that I've seen. These are intended to be short-lived objects that need to refer back to the ak.Array that made them, and creating a cyclic reference (putting the array into a list of things that won't be deleted until garbage collection) just for that doesn't make sense.

In fact, there's one more proxy like this: the ak.Array.mask property makes a proxy like this, for applying a slice as ak.mask instead of __getitem__. The fact that masking creates cyclic references might be why some physics analyses are using more memory than others, in a way that would surely be surprising/unexpected.

@jpivarski
Copy link
Member

Oh, I forgot to mention: I don't see anything here that enforces the attribute type or immutability. I like the mutability in this object, though: seeing how this can be used, I'm convinced that

my_array.attrs["something"] = 123

is a good thing. About the type: this PR enforces the key type to be strings, but it doesn't enforce the value type to be anything in particular, right? Persistent attributes would eventually need to be JSON or pickleable to be saved in files, perhaps Parquet files or pickles, but the transient attributes don't need to be restricted. There might be good reason to let a transient attribute be an unpickleable lambda function.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 16, 2024

I can have a look at ak.Array.mask aswell with weakref in a different PR.

this PR enforces the key type to be strings, but it doesn't enforce the value type to be anything in particular, right?

No, the types have never been enforced, it's just type annotations. With and without this PR the types can be anything that a Mapping would accept (key: Hashable, value: Any).

I don't know how the serialization of attrs works. Shouldn't the serialization step just try to serialize attrs, and if it can't drop a warning and save the arrays without attrs? (I'd expect something like this as a behavior)

edit: from a brief look at the ak.to_parquet serialization method I don't see any serialization of .attrs right now, it seems to serialize on the layout level, which doesn't have .attrs. The serialization of .attrs is only defined on __reduce_ex__, so for pickle. Are you sure that ever worked for parquet?

@jpivarski
Copy link
Member

I mentioned immutability and controlling the type because that's what the issue (#3277) was saying, although we can certainly change our minds on that.

Ah! "Controlling the type" meant ensuring that it's a dict. It doesn't say anything about other interpretations:

  • controlling the key type: I think we always want these to be strings (as it is for parameters)
  • controlling persistent value types: maybe—at least to raise an exception when trying to serialize it if it's not serializable (JSON only or fallback to pickle?)
  • controlling transient value types: no. That's too strict.

I'd rather not raise a warning when failing to serialize. This is a general problem with warnings in Python: user A of package B that depends on package C that depends on Awkward will see the warning and (1) not know what it's for, since it's something internal to B or C, and (2) not get a stack trace to report. It could be an exception or silently skip writing the attribute, but there are more problems with silently skipping.

I just checked parameters (here): it doesn't do any runtime guarantee of the key type, either. Okay, at least that's consistent. Our policy is that the key types should be strings, but that's only enforced at the type annotation level.

For persistent value types, if it's not serializable, I guess it raises an exception now, right?

Yes, both before

>>> a = ak.Array([1, 2, 3])
>>> a.attrs["yikes"] = lambda x: x
>>> pickle.dumps(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x752e88cd9a80>: attribute lookup <lambda> on __main__ failed

and after

>>> a = ak.Array([1, 2, 3])
>>> a.attrs["yikes"] = lambda x: x
>>> pickle.dumps(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x791809fd5a80>: attribute lookup <lambda> on __main__ failed

this PR. I think everything above is in a good state and this PR is ready to merge.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 16, 2024

Yes, exactly, that's the policy right now. Maybe it does make sense to enforce at least the key type to be a string because otherwise this is a bad error:

a = ak.Array([1, 2, 3])
a.attrs[1] = lambda x: x
pickle.dumps(a)
# > AttributeError: 'int' object has no attribute 'startswith'

because the pickling step checks for transient keys first.

I have an idea:
We don't enforce the key type to be str, but instead define transient keys only as strings; i.e. 'if you're not using a string, it can't be transient'. This would be a simple change in this function that would additionally check for isinstance(key, str).

What do you think about this?

@jpivarski
Copy link
Member

We don't enforce the key type to be str, but instead define transient keys only as strings; i.e. 'if you're not using a string, it can't be transient'.

That makes sense to me. The current rule for identifying a transient is that it starts with an "@"; the new rule would be that it's a string and starts with an "@" (which was kind-of implicit in the current rule).

We do want people to only make parameter and attr keys that are strings. The type annotations should be enough to assert that this is the API. (In the past, I would have made this a runtime check.)

src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/_attrs.py Show resolved Hide resolved
Co-authored-by: Angus Hollands <[email protected]>
@pfackeldey pfackeldey merged commit 27faa82 into main Dec 18, 2024
39 checks passed
@pfackeldey pfackeldey deleted the pfackeldey/control_attrs_better branch December 18, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We should control the type of ak.Array.attrs, and it should be immutable
3 participants