-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: removing output arrays from reference cycles so they don't have to wait for GC #1305
perf: removing output arrays from reference cycles so they don't have to wait for GC #1305
Conversation
--- a/src/uproot/behaviors/TBranch.py
+++ b/src/uproot/behaviors/TBranch.py
@@ -212,10 +212,12 @@ def iterate(
report = report.to_global(global_offset)
popper = [arrays]
del arrays
+ del item
yield popper.pop(), report
else:
popper = [library.global_index(item, global_offset)]
+ del item
yield popper.pop() I'm happy to run your memory-over-time graph with this change, if you have a script. The inner code from the slack thread was this - you did import gc
import uproot
filenames = [{"/tmp/zlib9-jagged0.root": "tree"}] * 5
for batch in uproot.iterate(filenames, step_size="1 GB"):
# gc.collect()
print(repr(batch)) |
I can't believe I missed that! If I'm reading this right, I don't have scripts to produce the memray plots from Slack—I'd lose them anyway, if I kept them—so the messages in Slack include the code. |
I made a to-do item to finish this off. I'll include your diff. |
My loop for lowest memory usage was like this: for i, batch in enumerate(uproot.iterate(filenames, step_size="1 GB")):
# gc.collect()
if i:
objgraph.show_growth()
else:
objgraph.growth()
print(repr(batch))
del batch
(so not very useful; I don't know what ReferenceType is) I am thinking there might be many of these non-functional-programming invocations down the graph (with or without generators) that is making refcounting hard for python. |
I discovered the same thing with Pympler: it was clearly calling |
Well, I added that commit (7cfc360), but it didn't fix the original memory issue. Here's memray of import gc
import uproot
filenames = [{"/tmp/zlib9-jagged0.root": "tree"}] * 5
for batch in uproot.iterate(filenames, step_size="1 GB"):
# gc.collect()
print(repr(batch)) in this PR, before the commit: Here's after the commit: And, just for reference, here it is with the Even if objgraph isn't showing it, the big arrays are still held until garbage collection. I'm setting up this PR to be merged, anyway, since it's a step forward, at least. |
Yeah; I spent some time trying to find out where the cycle(s) is, but no luck. The code goes deep, and I couldn't figure out which tbranch/basket/etc refer to one-enother. Doing more |
Some more investigations for cyclic references in uproot:
# uproot_refcycles.py
import gc
import uproot
import refcycle
filenames = [{"~/Downloads/zlib9-jagged0.root": "tree"}] * 5
def loop():
for i, batch in enumerate(uproot.iterate(filenames, step_size="1 GB", library="np")):
# let's just run 2 iterations
if i:
break
if __name__ == "__main__":
graph = refcycle.creators.cycles_created_by(loop)
# this filters the graph by the subgraphs that can't be
# properly cleaned up by GC because they contain strong cyclic refs
cyclic_refs = graph.source_components()
# plot the subgraphs that contain problematic cyclic refs
for i, cyclic in enumerate(cyclic_refs):
cyclic.export_image(f"cyclic_{i}.pdf") as an output we get 3 plots (attached). I'm not sure how to proceed from here, especially given how |
So it is TTree and TBranch. This is the half-conclusion I had come to too, but I couldn't see it. Branch referes back to its Tree?? Each of the Att objects in the list hanging off Tree refer back to it?? The closure ones were already seen by @jpivarski and maybe fixed by the |
TBranch does refer back to its TTree. Since more data might be read from an open ROOT file at any time, and those reads may be triggered by an object that came from a ROOT file, all extracted objects point back to the thing they were extracted from:
That's how it's possible to read an array by calling But then, why are any arrays referenced from that reference cycle? Arrays are not part of the There is an optional |
Can we try to break a few of these cycles with weakref to see if it makes a difference? |
@pfackeldey, you can break the object cycles (the references between |
I'm not sure how |
Applying this patch: diff --git a/src/uproot/behaviors/TBranch.py b/src/uproot/behaviors/TBranch.py
index 2d27785..64a36c3 100644
--- a/src/uproot/behaviors/TBranch.py
+++ b/src/uproot/behaviors/TBranch.py
@@ -860,26 +860,29 @@ class HasBranches(Mapping):
cache_key = f"{self.cache_key}:{expression}:{interpretation.cache_key}:{entry_start}-{entry_stop}:{library.name}"
array_cache[cache_key] = arrays[branch.cache_key]
- output = language.compute_expressions(
- self,
- arrays,
- expression_context,
- keys,
- aliases,
- self.file.file_path,
- self.object_path,
- )
-
- # no longer needed; save memory
- del arrays
- expression_context = [
- (e, c) for e, c in expression_context if c["is_primary"] and not c["is_cut"]
- ]
-
- return _ak_add_doc(
- library.group(output, expression_context, how), self, ak_add_doc
- )
+ return arrays
+ # output = language.compute_expressions(
+ # self,
+ # arrays,
+ # expression_context,
+ # keys,
+ # aliases,
+ # self.file.file_path,
+ # self.object_path,
+ # )
+
+ # # no longer needed; save memory
+ # del arrays
+
+ # expression_context = [
+ # (e, c) for e, c in expression_context if c["is_primary"] and not c["is_cut"]
+ # ]
+
+ # return _ak_add_doc(
+ # library.group(output, expression_context, how), self, ak_add_doc
+ # )
+ # return output
def iterate(
self, seems to fix the issue. import uproot
def loop():
fname, tname, n = next(iter(uproot.num_entries("~/Downloads/zlib9-jagged0.root:tree")))
tree = uproot.open(fname, array_cache=None, object_cache=None)["tree"]
how_often = 8
start = 0
stepsize = n // how_often
for _ in range(how_often):
batch = tree.arrays(entry_start=start, entry_stop=start + stepsize, library="np")
print(repr(batch))
start += stepsize
if __name__ == "__main__":
loop() I checked that _ak_add_doc(
library.group(output, expression_context, how), self, ak_add_doc
) is not the problem, so I think it the issue has to be in It's the same for |
The problem is fixed by: diff --git a/src/uproot/language/python.py b/src/uproot/language/python.py
index e3ce883..a189569 100644
--- a/src/uproot/language/python.py
+++ b/src/uproot/language/python.py
@@ -516,6 +516,7 @@ class PythonLanguage(uproot.language.Language):
else:
output[name] = output[name][cut]
+ values.clear()
return output |
No description provided.