-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Segmentation fault occurs on libarrow load when using the pyarrow 17.0.0 arm64 wheel #44342
Comments
Could you also share Is there any other Python extension module that also uses jemalloc? |
The output is quite large, so I've attached it in a file. None of the extensions that I built use jemalloc, but it's possible that something else being loaded into the environment does (e.g. numpy or scipy). |
Thanks but sorry. |
Hi @vyasr , I would recommend you switch to mimalloc instead of jemalloc, see https://arrow.apache.org/docs/cpp/memory.html#default-memory-pool Note that mimalloc becomes the default in 18.0.0 as well (see #43254). On our side, perhaps we should simply disable jemalloc on Linux aarch64 wheels? @raulcd |
No problem @kou, I know these kinds of issues can be a huge pain to track down, especially from this limited information. If it helps, you can see the error in this GHA run on this PR.
@pitrou thanks for finding that! That makes sense since it certainly seems like the underlying issue comes from jemalloc and is not arrow-specific.
Good idea, at least for testing. I'm testing that now in this GH workflow. The arm wheel-tests-cudf job is the one to look out for, let's see if using mimalloc bypasses the issue. That being said:
This seems like the right long-term solution if your suggestion to try mimalloc works for me above. pyarrow is a common enough dependency that a user could end up having pyarrow loaded in their environment without even realizing it, and if the import alone is sufficient to trigger the seg fault it would be quite challenging for the average user to debug. Making mimalloc the default seems sufficient to me since IMHO it's reasonable to expect a user explicitly setting the allocator to recognize this as a potential cause, but I wouldn't be opposed to disabling jemalloc altogether on arm either. |
Hmm, @pitrou I still see segfaults in the job that I linked above. Am I configuring the allocator in the correct way in rapidsai/cudf@635b5e0? If so, that suggests that there is an issue with jemalloc that occurs by simply loading the relevant parts of the binary even if no allocation subroutine is invoked, in which case building aarch64 wheels without jemalloc is definitely the way to go because this is beyond the realm of user configuration. |
jemalloc may have a problem on ARM. See also: jemalloc/jemalloc#467
It seems that the jemalloc/jemalloc#467 problem was solved by #10940 . |
Could you try nightly wheel that use mimalloc by default? |
Ah, that might be the case indeed, if the crash occurs right when importing PyArrow :( |
@vyasr is there any way to validate the issue has gone away with the nightly wheels? |
I am happy to test out a nightly wheel, but unfortunately I'm not confident that it will tell us anything conclusive. As I mentioned above, in my use case I had a lot of difficulty constructing a true MWE because even small changes like defining a new variable, moving around my imports, or moving imports from one file into another but preserving the order (which still has some effect due to the logic for loading the importing module itself) were sufficient to change whether the error appeared or not, which suggests that some sort of process memory corruption is occurring when the DSO is loaded. As a result, since I assume the nightly wheels will have accumulated many changes since the 17.0.0 release, even if I don't observe the same error it may just be that the error is now simply being hidden by other changes. I can try a few different iterations with different modifications to my scripts to see what happens, though. |
Er, are you telling us that it's not simply |
If you're asking whether
None of the modules that I directly control do any sort of relevant stateful initialization on import, but I cannot guarantee that the same is true for the other modules, so it is entirely possible that something in the stack (e.g. scipy) is doing some sort of initialization of a memory pool that introduces conflicting jemalloc symbols, or some other similar problem (it wouldn't actually be a symbol collision since IIUC libarrow does not make any of its jemalloc symbols publicly visible, but that's illustrative of the class of problems I mean). So roughly speaking, I have
and changing the sequence of |
So, perhaps there's nothing particular that we should do in PyArrow? |
(at least if you could |
Well OK, to my (pleasant) surprise upgrading to the latest nightly did not make the error vanish (well I suppose not pleasant that I have a seg fault, but at least pleasant that there's something reproducible happening):
The backtrace is the same, still in
I think compiling out jemalloc or recompiling using the appropriate page size for arm could still make sense. While I haven't been able to reduce my example much further yet, the fact that pyarrow < 17.0.0 works while 17.0.0 and 18 alphas both fail indicate that something meaningful has changed there in the pyarrow binary and anyone could hit it.
I would be happy to try that, but I would also need to be able to build pyarrow wheels that are equivalent to the build process you have. As I mentioned above
Since the latest pyarrow nightlies fail for me, that suggests that I was indeed not compiling exactly equivalent C++ to what you produce (or perhaps I was but there's also something in the Python build that's relevant since I simply LD_PRELOADed libarrow.so). The nightly index linked above unfortunately doesn't go back far enough for me to install nightlies in between 16.1 and 17 to see where the issue might have arisen. |
jemalloc may have a problem on ARM. See also: apache#44342
jemalloc may have a problem on ARM. See also: apache#44342
Could you try https://github.com/ursacomputing/crossbow/actions/runs/11285538259#artifacts (download the "wheel" artifact) that disables jemalloc? |
Also, can you tell us which hardware exactly you're using, and what the default page size is? And it would be nice if you could try to disassemble at the point of the crash. |
I am re-running some CI jobs but this is currently the only blocker to create the initial Release Candidate for 18.0.0. Should we merge disabling jemalloc by default on ARM and create the first RC? Should I create the first RC and potentially add this as a patch release if it solves the issue? @kou @pitrou ? |
Let's merge GH-44380 and create the first RC! |
Note that, for now, this is the only report about segfaults on Linux aarch64, so we're not sure if it's really a problem in general or specific to that use case. Ideally I would like answers to the questions in #44342 (comment) specifically :-) It's probably ok to disable jemalloc at least for 18.0.0, though. |
### Rationale for this change jemalloc may have a problem on ARM. See also: #44342 ### What changes are included in this PR? * Disable jemalloc by default on ARM. * Disable jemalloc for manylinux wheel for ARM. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #44342 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
Issue resolved by pull request 44380 |
### Rationale for this change jemalloc may have a problem on ARM. See also: #44342 ### What changes are included in this PR? * Disable jemalloc by default on ARM. * Disable jemalloc for manylinux wheel for ARM. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #44342 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
I've merged disabling jemalloc by default on ARM to move 18.0.0 forward. We can re-open this issue once we get feedback or we can open a follow up one if necessary |
Installing the version with jemalloc disabled does seem to fix the problem. I installed the artifact from https://github.com/ursacomputing/crossbow/actions/runs/11286890348 (slightly different than the link posted above because I'm on Python 3.12) and tested it out, then downgraded again to be sure:
So that certainly seems promising. Once pyarrow 18 is released our CI will pick it up automatically, so we'll see if the problem recurs in any way in future builds. |
Here is some information, let me know if you would like anything else. The page size is 4kb, so jemalloc/jemalloc#467 doesn't immediately seem to be implicated to me, but I haven't done much more than skim that issue.
Here is the gdb disassembly output. I don't know enough about jemalloc to debug this without spending a bit more time to familiarize myself unfortunately, but perhaps it will be meaningful to you. |
Hmm, I've tried to understand the disassembly output (not an expert, sorry). I think the crash is happening in this function: Perhaps you could try look for similar issues in the jemalloc issue tracker, and/or to open an issue there? Feel free to notify me. |
I took a look through the issue tracker but didn't see any that really seemed quite right. I'll take another look tomorrow and if I can't find anything I will open a new issue, link here, and tag you. |
I opened jemalloc/jemalloc#2739 for further discussion on the jemalloc side. |
Just stumbled over this. In conda-forge we build with
and that seems to work fine? I remember @xhochy mentioning that we cannot unvendor jemalloc in conda-forge due to "special options" being required. After reading some of the references here, I'm now assuming this is due to jemalloc/jemalloc#467. |
We already pass |
I think I'm actually hitting the same problem on x86 CPUs, but only some. I can reproduce on I'm fairly confident its the same issue because
Building the wheel with |
I don't really know if its useful but bisect pointed to #37822. I don't know if this is any kind of root cause though. My reproducers start failing here but it could be a prior change is the root cause and this just shuffled the dependencies so that the failures align with my reproducers. |
@Tom-Newton Can you disassemble using gdb at the crash point? |
Also, it would seem weird for #37822 to have caused this, because it should only affect statically-linked builds and emscripten builds... |
|
Describe the bug, including details regarding any error messages, version, and platform.
Under some very specific set of circumstances, importing pyarrow 17.0.0 from an arm wheel triggers a segmentation fault. The error comes from the jemalloc function
background_thread_entry
that is statically linked into libarrow.so. I can see libarrow.so being opened via strace, and when I run under gdb I see the following backtrace:This error is quite difficult to reproduce. In addition to only observing this this particular issue with the pyarrow 17.0.0 release (the issue vanishes I downgrade to an earlier version) and only when testing on arm architectures, it is also highly sensitive to the exact order of prior operations. In my application I load multiple Python extension modules before importing pyarrow, and the order of those imports affects whether or not this issue manifests. The cases where the issue arises do manifest reliably, so it is not a flaky error, but simply adding an unrelated extra import or reordering unrelated imports is often sufficient to make the problem vanish. I attempted to rebuild libarrow.so using the same flags used to build the wheel (I can't be sure that I got them all right though, I based my compilation on the flags in https://github.com/apache/arrow/blob/main/ci/scripts/python_wheel_manylinux_build.sh). and then preload the library, but that too caused the segmentation fault to disappear, so it's also unlikely that I can get debug symbols into the build in any useful way. I am attempting to reduce this to an MWE in rapidsai/cudf#17022, but I am not very hopeful in it being reduced all that far.
Component(s)
Python
The text was updated successfully, but these errors were encountered: