Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pretty nbytes repr to .show and jupyter repr #3348

Merged
merged 6 commits into from
Dec 18, 2024

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Dec 17, 2024

This is a little PR that adds (optionally, default is False) a new line to .show(nbytes=True) and the jupyter repr of high-level ak.Arrays and ak.Records. It's for the lazy (like me) who don't want to do the calculation in their head...

Preview (.show()):

Screenshot 2024-12-16 at 8 40 21 PM

Preview (Jupyter repr):
Screenshot 2024-12-16 at 8 44 03 PM

What do you think?

@agoose77
Copy link
Collaborator

Love this improvement! There's the usual caveat note about nbytes changing depending upon how you share an array, but that's true already!

@pfackeldey
Copy link
Collaborator Author

Yes 👍, also it does not take into account the python objects/structure and whatever is in attrs, but I'd argue that this is only reasonable - the array(s) (should) make up 99% of consumed memory.

@jpivarski
Copy link
Member

This is very nice! But can we make it say "nbytes:" instead of "size:" (even if it's converted into units like "MB")? That would do two things: it would provide a hint as to how to access this number programmatically and it would avoid a confusion in which NumPy uses "size" to mean the number of items (size * itemsize = nbytes).

Also, are you making the right distinction between "MB" (megabytes) and "MiB" (mebibytes)? One system uses factors of 1000 and the other uses factors of 1024.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 17, 2024

I chose size initially because of the 4 letters that align visually nicely with axes and type for the repr (this may not be a good argument though). I can change it to nbytes 👍

Also, are you making the right distinction between "MB" (megabytes) and "MiB" (mebibytes)? One system uses factors of 1000 and the other uses factors of 1024.

That should be correct, I'm using factors of 1000 on ak.Array.nbytes for formatting the unit, which should then be in "MB" and not "MiB", right?

@pfackeldey
Copy link
Collaborator Author

Oh btw, one thing that I noticed is that the typetracer backend reports nbytes as if it is a numpy array. I can see why we want to mimic numpy arrays in every aspect with the typetracer backend, but wouldn't it make sense to set nbytes for typetracers explicitly to 0 ? This is just a minor design choice...

@jpivarski
Copy link
Member

https://simple.wikipedia.org/wiki/Mebibyte (I'm in a meeting)

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 17, 2024

as discussed I changed:

  • "KB" -> "kB"
  • "size:" -> "nbytes:"
  • added backend as an argument
  • sorted the rows (without array and type) in increasing/decreasing order for .show/jupyter-repr

Example:

In [1]: import awkward as ak; import numpy as np

In [2]: array = ak.with_named_axis(
   ...:     ak.zip({
   ...:         "one": np.full((1234567,), 1.0),
   ...:         "two": np.full((1234567,), 2.0),
   ...:         "three": np.full((1234567,), 3.0),
   ...:     }),
   ...:     named_axis=("some",),
   ...: )

In [3]: array.show(type=True, named_axis=True, nbytes=True, backend=True)
type: 1234567 * {
    one: float64,
    two: float64,
    three: float64
}
axes: some:0
nbytes: 29.6 MB
backend: cpu
[{one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 {one: 1, two: 2, three: 3},
 ...,

For .show(...) the order is:
(1) type (2) ascending order of axes, nbytes, backend (based on the <prefix>: length) (3) array

For the Jupyter repr the order is:
(1) array (2) ------ (3) descending order of axes, nbytes, backend (based on the <prefix>: length) (4) type

@pfackeldey pfackeldey requested a review from jpivarski December 17, 2024 20:00
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
src/awkward/highlevel.py Outdated Show resolved Hide resolved
@pfackeldey
Copy link
Collaborator Author

@jpivarski could you have one last look at this PR? thanks 🙏

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! I see that the code duplication is gone (now in highlevel_array_show_rows) and it has an all argument. I think the PR is ready to merge.

@pfackeldey pfackeldey merged commit 55f1909 into main Dec 18, 2024
39 checks passed
@pfackeldey pfackeldey deleted the pfackeldey/add_bytes_repr branch December 18, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants