Skip to content

feat: add ak.enforce_type #2365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 62 commits into from
May 5, 2023
Merged

feat: add ak.enforce_type #2365

merged 62 commits into from
May 5, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Apr 5, 2023

Warning
Depends upon #2424 and #2425

Closes #1805, closes #2338 (?) by adding a function ak.enforce_type that converts an existing layout to a new high-level type. The following rules apply:

  • Indexed layouts are projected out if their type changes.
  • Option types can be either:
    • added
    • removed (if there are no missing values)
  • Unions can either:
    • grow
    • shrink (if the lost content(s) are not referenced by any tags)
    • remain equivalent
    • change by a single content
    • project to a single type
    • convert to a single type
  • Records can be converted:
    • tuple→tuple
      • N contents to M types (N < M) if the trailing (M-N) types are options
      • N contents to M types (N > M) drops the trailing (N-M) contents
    • record→record
      • matching fields are kept, missing fields dropped, new fields added (but must be option types)
  • List types can become regular/irregular (regularity is validated with known data)
  • NumPy arrays can change dtypes
  • Unknown contents (EmptyArray) can:
    • be removed at any node,
    • be added at any node, provided that they're wrapped in an option

@agoose77 agoose77 temporarily deployed to docs-preview April 5, 2023 22:05 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Apr 6, 2023

Codecov Report

Merging #2365 (f8dd513) into main (b9c7c6e) will increase coverage by 0.07%.
The diff coverage is 82.00%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/operations/ak_enforce_type.py 81.84% <81.84%> (ø)
src/awkward/forms/form.py 85.46% <83.33%> (+0.57%) ⬆️
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_from_buffers.py 90.90% <100.00%> (ø)

... and 6 files with indirect coverage changes

@agoose77 agoose77 temporarily deployed to docs-preview April 6, 2023 05:58 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview April 6, 2023 13:02 — with GitHub Actions Inactive
@jpivarski
Copy link
Member

It's looking good so far!

@agoose77 agoose77 force-pushed the agoose77/feat-enforce-type branch from fcea2d2 to d3adb09 Compare April 24, 2023 20:55
@agoose77 agoose77 temporarily deployed to docs-preview April 24, 2023 21:16 — with GitHub Actions Inactive
@agoose77 agoose77 requested a review from jpivarski April 24, 2023 21:59
@agoose77 agoose77 temporarily deployed to docs-preview April 24, 2023 22:07 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read over the tests, the main public function API, and the coding style within ak.enforce_type, and everything looks very nice! Let me know if there are particular points you want me to evaluate in detail.

@agoose77 agoose77 temporarily deployed to docs-preview April 24, 2023 22:26 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented Apr 24, 2023

@jpivarski fab! Thank you, it's a lot of code. I'm heading off until tomorrow; I plan to revisit the rules for records and tuples.

Given that record fields can't be renamed with this function (or fields added/removed), record → record shouldn't need to compute permutations. Similarly, if we require that tuples can't be reordered), we don't need permutations there. I think removing these d.o.f should simplify things.

@agoose77 agoose77 marked this pull request as ready for review April 25, 2023 06:53
@agoose77
Copy link
Collaborator Author

So, the original implementation of this function permuted record contents, and required that the permutation had matching fields if both layout and type were tuples. I don't think re-ordering records makes sense for this function; that's a fundamentally different operation at which we can only guess by convertability.

So, now this function respects fields/positions.

@agoose77 agoose77 temporarily deployed to docs-preview April 25, 2023 07:04 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview April 25, 2023 08:41 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

@jpivarski the motivation for the "thunk" pattern was to separate convertibility tests from the actual conversion logic. However, the only practical need for this was for the branches that occured with unions and records.

In the above commits & comments, you can see that I decided to make records and unions more strict; records cannot convert between tuple-record, and unions must now be strictly equal to the type in all "existing" contents, unless the total number of contents does not change (in which case, up-to one content can differ in order to be converted).

Regarding these changes for unions, it makes reasoning about them far simpler; we only have to consider type equality, rather than type enforceability.

As such, the need for a two-phase conversion is gone, and this logic is much easier to write as a direct recursive function. I found the need for a layout_equals_type function; although we can just write layout.form.type.is_equal_to, I suspect we want to avoid generating the type at multiple points during the tree.

Now, an existing example of where we keep the separate "can I do X" and "do X" logic is the the content merging protocol, which is defined over _mergeable_next and _mergemany. I don't think the motivations for separating these two processes holds here; the mergeable logic essentially looks like:

  1. is X mergeable with Y?
  2. if so, merge, else create union
  3. proceed ...

whereas in our case, if something isn't enforceable to T, that's an error (and we blow up).

So, I'm happy that this code no longer separates those two code pathways.

N.B., to be explicit I'm not talking about "thunks" vs two separate functions, rather I'm referring to the need to have "can I do X" at all.

@agoose77
Copy link
Collaborator Author

@jpivarski this is converging on a final implementation. I'd benefit from the following:

  • Thoughts on whether layout_equals_type should be a public function, e.g. defined on layout as def is_equal_to_type()?
  • Thoughts on rules regarding unions and records (see main PR description for details)

@agoose77 agoose77 temporarily deployed to docs-preview April 25, 2023 08:54 — with GitHub Actions Inactive
@agoose77 agoose77 force-pushed the agoose77/feat-enforce-type branch 3 times, most recently from 8d94ae8 to 8b3bc04 Compare April 25, 2023 09:51
@pre-commit-ci pre-commit-ci bot temporarily deployed to docs-preview April 25, 2023 10:09 Inactive
@jpivarski
Copy link
Member

  • Thoughts on whether layout_equals_type should be a public function, e.g. defined on layout as def is_equal_to_type()?

Just for wording, a layout has a type, but it can't be equal to a type. It's not a big deal for the internal function, but would be for an external API.

If an array has both ak.Array.type == t and ak.Array.layout.has_type(t), users will wonder which one they should use. It looks to me like layout_equals_type does more than just type equality because it allows for some fields to be unknown_length and the types are still considered equal. From inside the codebase, it can be clear that such a thing is needed, but from outside the codebase, that's a subtle difference.

I think we should not expose this as a public API unless it's needed for some application. Before we get to this one, dask-awkward and Coffea are reaching into L3 functions, and those are functions with demonstrated need to be L2.

  • Thoughts on rules regarding unions and records (see main PR description for details)

A few:

  • Indexed layouts are projected out.

Are they necessarily projected out? If the node above the IndexedArray and the node below the IndexedArray do not need to be changed (determined via types only, not values), how about if the IndexedArray is left as-is? Such a determination can only be made on the bottom-up return-pass of the recursion (i.e. after recursing down, when you're on the way back up again, you'll be able to tell that the IndexedArray's content didn't change).

The reason that's valuable is because it minimally touches RecordArrays with many fields; maybe most of them are unaffected. The RecordArray may be within an IndexedArray (a common case).

  • Option types are packed, and can be either:
    • added
    • removed (if there are no missing values)

Great!

  • Unions can either:
    • grow
    • shrink (if the lost content(s) are not referenced by any tags)
    • remain equivalent
    • change by a single content

Great!

  • Records can be converted:
    • tuple→tuple
    • record→record (fields must match)

Maybe fields of a tuple can be dropped if they're at the end (tuples get shorter). That might be expected, though I can't think of any use-cases in which it would matter.

For record→record, it can be converted into a subset of fields, right?

Also for record→record, can it be converted into a superset of fields if the new fields are option-type? That would be a useful feature.

  • NumPy arrays can change dtypes

Great!

  • 1D list types cannot be changed

What about

  • regular → variable-length (always)
  • variable-length → regular (runtime exception if the lengths don't match)?

We already have ak.from_regular and ak.to_regular, but forcing users to do that part of the type conversion as a separate step (and figure out the axis parameters) is unfriendly.

  • Unknown contents (EmptyArray) can be added/removed at any node

By "removing EmptyArray" (outside the context of a RecordArray field), do you mean "convert EmptyArray into any type"? That's what I would expect: just create a zero-length array from the Form and replace the EmptyArray with that.

@agoose77
Copy link
Collaborator Author

List conversion is already there - I can see how my description didn't give that thought! Let me update it.

Records and tuples can drop fields, let's add this. I didn't notice that we can introduce none's for new contents - that would be a good feature too! I hadn't originally thought it would be sensible / possible.

Currently we touch all indexed nodes if we walk into the layout, and we walk into all nodes to set parameters. We could perhaps early exit if a type matches, but that would be double walking the tree at each level.

@jpivarski
Copy link
Member

Double-walking is better than touching, since double-walking is metadata-only and touching means that a whole column will be loaded.

Setting parameters shouldn't touch any buffers; that's a different part of the Content instance. If you need to make a new Content instance with different parameters, it can just reference the Index or buffer that the original Content had without touching.

The usefulness of adding new record fields that are all None is because this function will be used in normalizing dynamic types as specific static types. Suppose some system doesn't make attributes if their value would be missing (i.e. {"x": 123} instead of {"x": 123, "y": null}).

  • If all instances of the attribute are present, ArrayBuilder would make the field non-option-type, and the user would need to insert their knowledge that it should be option-type with no missing values.
  • If only some instances of the attribute are present, ArrayBuilder would make the field option-type (instead of making a union of the record with and without the field: that was a choice).
  • If no the instances of the attribute are present, ArrayBuilder would not and could not make the field. The user would need to insert their knowledge that it should be there.

So I think this would come up quite a bit.

@agoose77 agoose77 temporarily deployed to docs-preview May 4, 2023 14:12 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview May 4, 2023 14:47 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

agoose77 commented May 4, 2023

OK @jpivarski this is now nearly done.

I've summarised the rules in the docstring.

Since your last review, I've:

  • added a new rule - introducing option[unknown] in place of any existing type
  • removed the union_erasure flag in favour of figuring this out from the type1
  • added docstrings with doctest examples
  • added an _type_is_enforceable which tests enforceability and whether the operation will require a projection of the current node contents
  • fixed some bugs

If you get 5 minutes, I'd appreciate if you could play around with some conversions to ensure it works :P

On my remaining todo list:

  • check typetracer arrays
  • strictly check layouts, rather than almost_equal

Footnotes

  1. I was worried about some corner cases here, such as length-zero regulararrays. After some experimenting, I have since convinced myself that we are able to do this robustly.

@agoose77 agoose77 temporarily deployed to docs-preview May 4, 2023 17:17 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I heartily approve it for merging!

You said "nearly done," so I'll let you decide when to merge it.

@agoose77
Copy link
Collaborator Author

agoose77 commented May 5, 2023

I've now finished with this for now. There are no typetracer tests, but that is because the layout equality checks need to be extended to handle this. I'll do so in a follow up PR.

@agoose77 agoose77 enabled auto-merge (squash) May 5, 2023 15:04
@agoose77 agoose77 temporarily deployed to docs-preview May 5, 2023 15:12 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview May 5, 2023 15:27 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit bbd4fc4 into main May 5, 2023
@agoose77 agoose77 deleted the agoose77/feat-enforce-type branch May 5, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make a function that assigns types to unknowns from an overall type/Form description Allow specifying type for empty Array
2 participants