Skip to content

How to create an empty tree with several var-length branches that share the var length #759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HDembinski opened this issue Oct 18, 2022 · 3 comments
Labels
docs Improvements or additions to documentation

Comments

@HDembinski
Copy link
Member

HDembinski commented Oct 18, 2022

I want to write several variable length arrays that have the same length to an empty tree. https://uproot.readthedocs.io/en/latest/basic.html#extending-ttrees-with-large-datasets explains

  1. How to generate an empty tree with mktree, which can be extended later, I need this
  2. How to generate a tree with several arrays that share the length and already contain data, with awkward.zip

What I am missing is the combination of the two. I want to start with an empty tree like in 1) that has branches which share the varlength. Then I want to extend this tree subsequently. Is this possible? If not, may I suggest the following intuitive API:

Instead of {"x": "var * float32", "y": "var * float32"}, which generates two extra branches "nx" and "ny", please allow this declaration, where the counting branch is explicit:

{"n": "int32", "x": "n * float32", "y": "n * float32"}

As the Python of Zen says, explicit is better than implicit.

@HDembinski HDembinski added the docs Improvements or additions to documentation label Oct 18, 2022
@HDembinski
Copy link
Member Author

Related: scikit-hep/awkward#1805

@HDembinski
Copy link
Member Author

HDembinski commented Oct 18, 2022

A workaround for my situation is this:

with uproot.recreate(file) as f:
    for ichunk in range(chunks):
        px = []
        py = []
        for event in generator(chunk):
            px.append(event.px)
            py.append(event.py)
        data = {"": ak.zip({"px": ak.Array(px), "py": ak.Array(py)})}
        if "tree" in f:
            f["tree"].extend(data)
        else:
            f["tree"] = data
  1. It would be great to add this to the docs.
  2. It would be great if calling f["tree"].extend() would be enough here, in other words, if the tree is still empty, make f["tree"].extend(data) do the same as f["tree"] = data if that's technically possible.
  3. Thanks for supporting the special case {"": ak.zip(...)} correctly.

@jpivarski
Copy link
Member

Adopting

{"n": "int32", "x": "n * float32", "y": "n * float32"}

would be hard because "n * float32" doesn't parse to a valid ak.types.Type, and Uproot is just using Awkward Array here; this request cuts across the boundaries of modular code, and across the boundary between Uproot and Awkward. It could, perhaps, be implemented by adding a flag to the Awkward type-parser to allow identifiers in place of dimension sizes (as long as they're not named "var"!) and the resulting tree would contain these non-ak.types.Type placeholders, which would then have to be resolved by Uproot across the whole dict, but that makes this a much larger project than you might have been thinking.

About adding your workaround to the docs, I'm trying to figure out how it would fit in. Uproot currently has only one tutorial, the Getting Started Guide. I'm trying to understand your original problem well enough that this code block would be a helpful solution to the stated problem.

Oh! I get it: you have two Awkward Arrays with var * float32 type and they happen to align (all of their lists have the same lengths, list by list). You want the resulting TTree to have

{"n": "int32", "x": "var * float32", "y": "var * float32"}

with TBranches x and y both using TBranch n as their shared counting branch. This has come up multiple times, and my answer is to fill the TTree with exactly what you've found:

{"": ak.zip({"x": x, "y": y})}

(though I usually name the outer field, but this is fine).

The question for documentation, then, is where it should go in that Getting Started Guide (because creating a new top-level tutorial for this would make people wonder why it's singled out like this). There's a Writing TTrees to a file section, then a Extending TTrees with large datasets section, then Specifying the compression. Probably before Specifying the compression.

How about this?


Ragged arrays with a shared "counter" TBranch

Often, a computation on Awkward Arrays results in several jagged arrays that have the same list lengths, list by list. For example, suppose we have

>>> pt = ak.Array([[0.0, 11, 22], [], [33, 44], [55], [66, 77, 88, 99]])
>>> eta = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

The lengths of all the lists in pt and eta are

>>> ak.num(pt)
<Array [3, 0, 2, 1, 4] type='5 * int64'>
>>> ak.num(eta)
<Array [3, 0, 2, 1, 4] type='5 * int64'>

which are the same.

>>> ak.all(ak.num(pt) == ak.num(eta))
True

Since dynamic-length arrays in ROOT require a "counter" branch (see Writing TTrees to a file, above), simply putting these jagged arrays into a TTree results in a separate "counter" branch for each array:

>>> file["tree6"] = {"Muon_pt": pt, "Muon_eta": eta}
>>> file["tree6"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon_pt             | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
nMuon_eta            | int32_t                  | AsDtype('>i4')
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

Only one "counter" branch is needed, so this is a waste of disk space as well as hiding the fact that pt and eta have the same length lists.

In Awkward Array, jagged arrays with the same length lists can be zipped together with ak.zip, and Uproot recognizes zipped arrays as ones that can have a common "counter" branch.

If you have this case, you'll most likely want to zip your jagged arrays together at some point before writing to a file. Here's an example of zipping immediately before writing to a file:

>>> file["tree7"] = {"Muon": ak.zip({"pt": pt, "eta": eta})}
>>> file["tree7"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

If you need to declare the TTree before filling it using mktree (see Extending TTrees with large datasets, above), the type given by the zipped array is what you want:

>>> muons = ak.zip({"pt": pt, "eta": eta})
>>> file.mktree("tree8", {"Muon": muons.type})
<WritableTree '/tree8' at 0x00011eda67d0>
>>> file["tree8"].extend({"Muon": ak.zip({"pt": pt, "eta": eta})})
>>> file["tree8"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

As a string, muons.type is

>>> print(muons.type)
5 * var * {pt: float64, eta: float64}

so

>>> file.mktree("tree8", {"Muon": "var * {pt: float64, eta: float64}"})

is equivalent to the above.


That should cover it, right? (If so, resolving this issue would be a matter of copy-pasting the above into the Uproot documentation. I hope I don't have to convert it to reST...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants