How to create an empty tree with several var-length branches that share the var length #759

HDembinski · 2022-10-18T11:53:18Z

I want to write several variable length arrays that have the same length to an empty tree. https://uproot.readthedocs.io/en/latest/basic.html#extending-ttrees-with-large-datasets explains

How to generate an empty tree with mktree, which can be extended later, I need this
How to generate a tree with several arrays that share the length and already contain data, with awkward.zip

What I am missing is the combination of the two. I want to start with an empty tree like in 1) that has branches which share the varlength. Then I want to extend this tree subsequently. Is this possible? If not, may I suggest the following intuitive API:

Instead of {"x": "var * float32", "y": "var * float32"}, which generates two extra branches "nx" and "ny", please allow this declaration, where the counting branch is explicit:

{"n": "int32", "x": "n * float32", "y": "n * float32"}

As the Python of Zen says, explicit is better than implicit.

The text was updated successfully, but these errors were encountered:

HDembinski · 2022-10-18T12:14:19Z

Related: scikit-hep/awkward#1805

HDembinski · 2022-10-18T12:36:12Z

A workaround for my situation is this:

with uproot.recreate(file) as f:
    for ichunk in range(chunks):
        px = []
        py = []
        for event in generator(chunk):
            px.append(event.px)
            py.append(event.py)
        data = {"": ak.zip({"px": ak.Array(px), "py": ak.Array(py)})}
        if "tree" in f:
            f["tree"].extend(data)
        else:
            f["tree"] = data

It would be great to add this to the docs.
It would be great if calling f["tree"].extend() would be enough here, in other words, if the tree is still empty, make f["tree"].extend(data) do the same as f["tree"] = data if that's technically possible.
Thanks for supporting the special case {"": ak.zip(...)} correctly.

jpivarski · 2022-10-18T17:02:49Z

Adopting

{"n": "int32", "x": "n * float32", "y": "n * float32"}

would be hard because "n * float32" doesn't parse to a valid ak.types.Type, and Uproot is just using Awkward Array here; this request cuts across the boundaries of modular code, and across the boundary between Uproot and Awkward. It could, perhaps, be implemented by adding a flag to the Awkward type-parser to allow identifiers in place of dimension sizes (as long as they're not named "var"!) and the resulting tree would contain these non-ak.types.Type placeholders, which would then have to be resolved by Uproot across the whole dict, but that makes this a much larger project than you might have been thinking.

About adding your workaround to the docs, I'm trying to figure out how it would fit in. Uproot currently has only one tutorial, the Getting Started Guide. I'm trying to understand your original problem well enough that this code block would be a helpful solution to the stated problem.

Oh! I get it: you have two Awkward Arrays with var * float32 type and they happen to align (all of their lists have the same lengths, list by list). You want the resulting TTree to have

{"n": "int32", "x": "var * float32", "y": "var * float32"}

with TBranches x and y both using TBranch n as their shared counting branch. This has come up multiple times, and my answer is to fill the TTree with exactly what you've found:

{"": ak.zip({"x": x, "y": y})}

(though I usually name the outer field, but this is fine).

The question for documentation, then, is where it should go in that Getting Started Guide (because creating a new top-level tutorial for this would make people wonder why it's singled out like this). There's a Writing TTrees to a file section, then a Extending TTrees with large datasets section, then Specifying the compression. Probably before Specifying the compression.

How about this?

Ragged arrays with a shared "counter" TBranch

Often, a computation on Awkward Arrays results in several jagged arrays that have the same list lengths, list by list. For example, suppose we have

>>> pt = ak.Array([[0.0, 11, 22], [], [33, 44], [55], [66, 77, 88, 99]])
>>> eta = ak.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])

The lengths of all the lists in pt and eta are

>>> ak.num(pt)
<Array [3, 0, 2, 1, 4] type='5 * int64'>
>>> ak.num(eta)
<Array [3, 0, 2, 1, 4] type='5 * int64'>

which are the same.

>>> ak.all(ak.num(pt) == ak.num(eta))
True

Since dynamic-length arrays in ROOT require a "counter" branch (see Writing TTrees to a file, above), simply putting these jagged arrays into a TTree results in a separate "counter" branch for each array:

>>> file["tree6"] = {"Muon_pt": pt, "Muon_eta": eta}
>>> file["tree6"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon_pt             | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
nMuon_eta            | int32_t                  | AsDtype('>i4')
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

Only one "counter" branch is needed, so this is a waste of disk space as well as hiding the fact that pt and eta have the same length lists.

In Awkward Array, jagged arrays with the same length lists can be zipped together with ak.zip, and Uproot recognizes zipped arrays as ones that can have a common "counter" branch.

If you have this case, you'll most likely want to zip your jagged arrays together at some point before writing to a file. Here's an example of zipping immediately before writing to a file:

>>> file["tree7"] = {"Muon": ak.zip({"pt": pt, "eta": eta})}
>>> file["tree7"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

If you need to declare the TTree before filling it using mktree (see Extending TTrees with large datasets, above), the type given by the zipped array is what you want:

>>> muons = ak.zip({"pt": pt, "eta": eta})
>>> file.mktree("tree8", {"Muon": muons.type})
<WritableTree '/tree8' at 0x00011eda67d0>
>>> file["tree8"].extend({"Muon": ak.zip({"pt": pt, "eta": eta})})
>>> file["tree8"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | int32_t                  | AsDtype('>i4')
Muon_pt              | double[]                 | AsJagged(AsDtype('>f8'))
Muon_eta             | double[]                 | AsJagged(AsDtype('>f8'))

As a string, muons.type is

>>> print(muons.type)
5 * var * {pt: float64, eta: float64}

so

>>> file.mktree("tree8", {"Muon": "var * {pt: float64, eta: float64}"})

is equivalent to the above.

That should cover it, right? (If so, resolving this issue would be a matter of copy-pasting the above into the Uproot documentation. I hope I don't have to convert it to reST...)

HDembinski added the docs Improvements or additions to documentation label Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create an empty tree with several var-length branches that share the var length #759

How to create an empty tree with several var-length branches that share the var length #759

HDembinski commented Oct 18, 2022 •

edited

Loading

HDembinski commented Oct 18, 2022

HDembinski commented Oct 18, 2022 •

edited

Loading

jpivarski commented Oct 18, 2022

How to create an empty tree with several var-length branches that share the var length #759

How to create an empty tree with several var-length branches that share the var length #759

Comments

HDembinski commented Oct 18, 2022 • edited Loading

HDembinski commented Oct 18, 2022

HDembinski commented Oct 18, 2022 • edited Loading

jpivarski commented Oct 18, 2022

Ragged arrays with a shared "counter" TBranch

HDembinski commented Oct 18, 2022 •

edited

Loading

HDembinski commented Oct 18, 2022 •

edited

Loading