-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch gradient_checkpoint_scope could trigger segmentation fault? #1581
Comments
I got this now a second time (CI log). It's occurs 10% of the cases (very approximately). I assume the |
I just pushed sth which should check for this. So let's see if this occurs again. |
I can reproduce the crash locally.
|
I was playing around with iterating through all alive objects at the end, and that also triggers the crash. Sth like this: print("**** remaining objects:")
import gc
for obj in gc.get_objects():
if type(obj) in {tuple, list, dict}:
continue
print(type(obj), obj) Crash:
|
With python3-dbg some more:
|
With: print("**** remaining objects:")
import gc
for obj in gc.get_objects():
print("0x%x" % id(obj), type(obj), obj)
print("**** done.") Another variant of the crash:
|
Note, this object you see here in |
Ok, I added this def __init__(self):
self.record_graph_scope = _RecordGraph()
self.record_graph_scope.graph.gradient_checkpoint_scope_backref = self
# Note: saved_tensors_hooks is thread local.
self.saved_tensors_hooks_scope = torch.autograd.graph.saved_tensors_hooks(self._pack_hook, self._unpack_hook)
print("*** pack hook: 0x%x" % id(self.saved_tensors_hooks_scope.pack_hook)) Then I get this at the end:
So I guess we have already freed the method but we are still trying to access it here. |
Added some debug code: def _custom_saved_tensors_hooks_exit(
self: torch.autograd.graph.saved_tensors_hooks, exc_type=None, exc_val=None, exc_tb=None
):
print(f"*** _custom_saved_tensors_hooks_exit, stack {_custom_saved_tensors_hooks_tls_ctx.stack}")
f = sys._getframe()
while f:
co = f.f_code
print("-", co.co_name, co.co_filename, f.f_lineno)
f = f.f_back
... Then:
So, maybe the problem is that we call |
I have a standalone test case: def test_saved_tensors_hooks_gc_segfault2():
# https://github.com/rwth-i6/returnn/issues/1581
shape = (101, 103)
for i in range(10):
v1 = torch.nn.Parameter(torch.randn(shape))
v2 = torch.nn.Parameter(torch.randn(shape))
class _Handler:
def __init__(self, exit_in_unpack: bool = False):
self.scope = torch.autograd.graph.saved_tensors_hooks(self._pack_hook, self._unpack_hook)
self.exit_in_unpack = exit_in_unpack
self.exited = False
def _pack_hook(self, x):
print(f"*** _pack_hook {self}")
return self, x
@staticmethod
def _unpack_hook(x):
self, x = x
print(f"*** _unpack_hook {self}")
if self.exit_in_unpack and not self.exited:
self.exited = True
self.scope.__exit__()
return x
handler1 = _Handler(exit_in_unpack=False)
handler1.scope.__enter__()
v1_ = v1 + torch.randn(shape)
handler2 = _Handler(exit_in_unpack=True)
handler2.scope.__enter__()
v2_ = v2 + torch.randn(shape)
x = v1_ * v2_
x.sum().backward()
del x
handler1.scope.__exit__() I'm trying to simplify this now further. |
Slightly different version: def test_saved_tensors_hooks_gc_segfault2():
# https://github.com/rwth-i6/returnn/issues/1581
shape = (101, 103)
for i in range(10):
print("**** iter", i)
v = torch.nn.Parameter(torch.randn(shape))
class _Handler:
def __init__(self):
self.scope = torch.autograd.graph.saved_tensors_hooks(self._pack_hook, self._unpack_hook)
self.scope.__enter__()
self.exited = False
def _pack_hook(self, x):
print(f"*** _pack_hook {self}")
return x
def _unpack_hook(self, x):
print(f"*** _unpack_hook {self}")
if not self.exited:
self.exited = True
self.scope.__exit__()
return x
with torch.autograd.graph.saved_tensors_hooks(lambda x: x, lambda x: x):
handler = _Handler() # keep ref... # noqa
x = v * torch.randn(shape)
x.sum().backward() |
I reported that upstream: pytorch/pytorch#130734 |
I pushed a workaround now. See |
Actually let's keep this open until we got some response, and then wait until we can update |
Also note, the current solution is maybe not so optimal. The current potential ways that we would exit the
So, this means, in practice, with the current The |
I just saw this in the CI (at commit d5b954b):
So tests ran through but at the exit, we got some segmentation fault. Maybe the gradient scope was cleaned up at that late point?
The text was updated successfully, but these errors were encountered: