You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue only applies to the branch for #2479. I'm recording it here because it doesn't affect how our CI currently works so it isn't necessarily a blocker to merging.
To reproduce check out a branch containing the changes for #2479, enable and build the L0 and native cpu adapters and run the test-enqueue cts suite. The problem is intermittent but it shouldn't take many attempts to see either a segfault in, or an error returned from urQueueCreate.
The pathology of this behaviour is that the context handle (importantly it seems to be the loader handle, not the adapter handle) passed to urQueueCreate is corrupted somehow, resulting in the wrong adapter's implementation of urQueueCreate getting called. Most commonly this happens during a native cpu test, where the level zero implementation is called and returns UR_RESULT_ERROR_INVALID_DEVICE when it doesn't recognize the (native cpu) device. The problem doesn't respond well to debuggers but I've instrumented various bits of loader and adapter code and been able to observe the address for the urQueueCreate entry point changing from test to test when this occurs.
This same issue is behind various other spooky behaviours in a few test suites. You can see problems running the test-queue suite, and sometimes rather than what's described above in test-enqueue you'll get wrong results or a hang.
Removing this line from the l0 urContextRelease implementation (effectively leaking all the contexts) makes the problem go away
No problems are observed when only the native cpu + opencl adapters are enabled, strangely the opencl adapter seems completely unaffected.
The issue isn't anything to do with a bad urEnqueue operation (initially I thought it might be related to a bad buffer operation or something). It can be reproduced in the test-queue suite running tests that only call the following entry points:
Valgrind, UB sanitizer and address sanitizer have all come up empty handed, although this must be some kind of memory corruption. As mentioned it doesn't reproduce while running in a debugger for the most part so it isn't too surprising that these tools are enough to mess with whatever's going on.
My current best guess is that something in the l0 adapter is retaining a reference to a data member from a context after it gets destroyed, and that's getting used somewhere such that bad memory accesses occur, although I haven't actually produced any evidence of this.
The text was updated successfully, but these errors were encountered:
I've investigated this further, and it comes down to a desync in the reference counts from the driver and the loader. The loader still thinks a pointer is valid whilst the adapter frees the memory.
Basically:
L0/Hip allocates a new context, and both loader and adapter ref counts are 1.
The loader incorrectly increments its ref count without notifying the adapter.
The context is then released, meaning that L0/Hip frees the memory, but the loader thinks it is still valid.
In a future test, Native CPU happens to get allocated the same memory for its handle.
The loader tries to register it, sees that it is already allocated and returns a cached pointer... Which is for L0/Hip.
L0/Hip get sent Native CPU handles and probably isn't initialized properly any more, which causes it to break.
Annoyingly, there's no simple solution here, because it's unclear which allocations need to be free'd and which don't. In addition, I'd be surprised if code both in the CTS and the wild are diligent in freeing memory that they allocate.
I spent a bit of time hacking together a solution, but it causes so many fails: #2598
This issue only applies to the branch for #2479. I'm recording it here because it doesn't affect how our CI currently works so it isn't necessarily a blocker to merging.
To reproduce check out a branch containing the changes for #2479, enable and build the L0 and native cpu adapters and run the
test-enqueue
cts suite. The problem is intermittent but it shouldn't take many attempts to see either a segfault in, or an error returned fromurQueueCreate
.The pathology of this behaviour is that the context handle (importantly it seems to be the loader handle, not the adapter handle) passed to urQueueCreate is corrupted somehow, resulting in the wrong adapter's implementation of
urQueueCreate
getting called. Most commonly this happens during a native cpu test, where the level zero implementation is called and returnsUR_RESULT_ERROR_INVALID_DEVICE
when it doesn't recognize the (native cpu) device. The problem doesn't respond well to debuggers but I've instrumented various bits of loader and adapter code and been able to observe the address for theurQueueCreate
entry point changing from test to test when this occurs.This same issue is behind various other spooky behaviours in a few test suites. You can see problems running the
test-queue
suite, and sometimes rather than what's described above intest-enqueue
you'll get wrong results or a hang.Removing this line from the l0
urContextRelease
implementation (effectively leaking all the contexts) makes the problem go awayNo problems are observed when only the native cpu + opencl adapters are enabled, strangely the opencl adapter seems completely unaffected.
The issue isn't anything to do with a bad urEnqueue operation (initially I thought it might be related to a bad buffer operation or something). It can be reproduced in the
test-queue
suite running tests that only call the following entry points:Valgrind, UB sanitizer and address sanitizer have all come up empty handed, although this must be some kind of memory corruption. As mentioned it doesn't reproduce while running in a debugger for the most part so it isn't too surprising that these tools are enough to mess with whatever's going on.
My current best guess is that something in the l0 adapter is retaining a reference to a data member from a context after it gets destroyed, and that's getting used somewhere such that bad memory accesses occur, although I haven't actually produced any evidence of this.
The text was updated successfully, but these errors were encountered: