You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I recently found an issue when stress testing pybind related code in PyTorch but have been able to repro it with just pure pybind11 code (d4l3k@4988742).
The issue seems to be that when using gil_scoped_release in an init() function with multiple threads it causes memory corruption in the object registration and thus crashes when deleting the pybinded object.
terminate called after throwing an instance of 'std::runtime_error'
what(): pybind11_object_dealloc(): Tried to deallocate unregistered instance!
I suspect what's happening is that the gil_scoped_release is causing the object registration to not be correctly protected resulting in memory corruption. When we then delete it, the entry has been overridden and thus can't be found.
C++ stack trace (originally discovered in PyTorch)
#0 0x00007ffff7c8bacc in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007ffff7c3e686 in raise () from /lib64/libc.so.6
#2 0x00007ffff7c28833 in abort () from /lib64/libc.so.6
#3 0x00007ffff4cd4f00 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#4 0x00007ffff4cd343c in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#5 0x00007ffff4cd348e in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#6 0x00007ffff4cd3680 in __cxxabiv1::__cxa_throw (obj=0x7fdd2805b700, tinfo=0x7ffff4df0560 <typeinfo for std::runtime_error>, dest=0x7ffff4ce0272 <std::runtime_error::~runtime_error()>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:98
#7 0x00007fffeec78c82 in pybind11::pybind11_fail(char const*) () from /home/tristanr/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#8 0x00007fffeeec93f7 in pybind11::detail::clear_instance(_object*) () from /home/tristanr/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#9 0x00007fffeeec9c71 in pybind11_object_dealloc () from /home/tristanr/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so
#10 0x00000000004f10a7 in _Py_DECREF (op=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>) at /usr/local/src/conda/python-3.10.13/Include/object.h:500
python stack trace (gdb py-bt)
File "/home/tristanr/pybind11/tests/test_init_race.py", line 17, in run
del store
File "/home/tristanr/.conda/envs/pybind-3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/tristanr/.conda/envs/pybind-3.10/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
work_item.run()
File "/home/tristanr/.conda/envs/pybind-3.10/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/tristanr/.conda/envs/pybind-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/tristanr/.conda/envs/pybind-3.10/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
`py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe.
For more details see #143593pybind/pybind11#5473
Test plan:
```
python setup.py develop
```
CI
```py
import time
from concurrent.futures import ThreadPoolExecutor
from torch import distributed as dist
def run():
store = dist.TCPStore(
host_name="localhost",
port=0,
is_master=True,
wait_for_workers=False,
)
# this sleep is required to trigger the crash
time.sleep(0.1)
del store
futures = []
with ThreadPoolExecutor(
max_workers=100,
) as executor:
for i in range(100000):
print(i)
futures.append(executor.submit(run))
if len(futures) > 100:
futures.pop(0).result()
```
Pull Request resolved: #143598
Approved by: https://github.com/c-p-i-o
Required prerequisites
What version (or hash if on master) of pybind11 are you using?
2.13.6, 2.12.1, 741d86f
Problem description
Hi, I recently found an issue when stress testing pybind related code in PyTorch but have been able to repro it with just pure pybind11 code (d4l3k@4988742).
The issue seems to be that when using
gil_scoped_release
in an init() function with multiple threads it causes memory corruption in the object registration and thus crashes when deleting the pybinded object.I suspect what's happening is that the gil_scoped_release is causing the object registration to not be correctly protected resulting in memory corruption. When we then delete it, the entry has been overridden and thus can't be found.
C++ stack trace (originally discovered in PyTorch)
python stack trace (gdb
py-bt
)Reproducible example code
full repro: d4l3k@4988742
invoked with:
Python
Using Python 3.10 installed via conda.
Is this a regression? Put the last known working version here if it is.
Not a regression
The text was updated successfully, but these errors were encountered: