Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loader corruption when running CTS with native cpu + level zero enabled #2511

Open
aarongreig opened this issue Dec 31, 2024 · 2 comments
Open
Assignees
Labels
loader Loader related feature/bug

Comments

@aarongreig
Copy link
Contributor

This issue only applies to the branch for #2479. I'm recording it here because it doesn't affect how our CI currently works so it isn't necessarily a blocker to merging.

To reproduce check out a branch containing the changes for #2479, enable and build the L0 and native cpu adapters and run the test-enqueue cts suite. The problem is intermittent but it shouldn't take many attempts to see either a segfault in, or an error returned from urQueueCreate.

The pathology of this behaviour is that the context handle (importantly it seems to be the loader handle, not the adapter handle) passed to urQueueCreate is corrupted somehow, resulting in the wrong adapter's implementation of urQueueCreate getting called. Most commonly this happens during a native cpu test, where the level zero implementation is called and returns UR_RESULT_ERROR_INVALID_DEVICE when it doesn't recognize the (native cpu) device. The problem doesn't respond well to debuggers but I've instrumented various bits of loader and adapter code and been able to observe the address for the urQueueCreate entry point changing from test to test when this occurs.

This same issue is behind various other spooky behaviours in a few test suites. You can see problems running the test-queue suite, and sometimes rather than what's described above in test-enqueue you'll get wrong results or a hang.

Removing this line from the l0 urContextRelease implementation (effectively leaking all the contexts) makes the problem go away

No problems are observed when only the native cpu + opencl adapters are enabled, strangely the opencl adapter seems completely unaffected.

The issue isn't anything to do with a bad urEnqueue operation (initially I thought it might be related to a bad buffer operation or something). It can be reproduced in the test-queue suite running tests that only call the following entry points:

   ---> urAdapterGet
   ---> urAdapterGetInfo
   ---> urContextCreate
   ---> urContextRelease
   ---> urDeviceGet
   ---> urDeviceGetInfo
   ---> urPlatformGet
   ---> urPlatformGetInfo
   ---> urQueueCreate
   ---> urQueueGetInfo
   ---> urQueueRelease

Valgrind, UB sanitizer and address sanitizer have all come up empty handed, although this must be some kind of memory corruption. As mentioned it doesn't reproduce while running in a debugger for the most part so it isn't too surprising that these tools are enough to mess with whatever's going on.

My current best guess is that something in the l0 adapter is retaining a reference to a data member from a context after it gets destroyed, and that's getting used somewhere such that bad memory accesses occur, although I haven't actually produced any evidence of this.

@aarongreig aarongreig added the level-zero L0 adapter specific issues label Dec 31, 2024
@RossBrunton RossBrunton self-assigned this Jan 17, 2025
@RossBrunton
Copy link
Contributor

Just stumbled into this issue, and can reproduce it with Native CPU and HIP. I'll remove the level zero tag and look into this.

@RossBrunton RossBrunton added loader Loader related feature/bug and removed level-zero L0 adapter specific issues labels Jan 17, 2025
@RossBrunton
Copy link
Contributor

I've investigated this further, and it comes down to a desync in the reference counts from the driver and the loader. The loader still thinks a pointer is valid whilst the adapter frees the memory.

Basically:

  • L0/Hip allocates a new context, and both loader and adapter ref counts are 1.
  • The loader incorrectly increments its ref count without notifying the adapter.
  • The context is then released, meaning that L0/Hip frees the memory, but the loader thinks it is still valid.
  • In a future test, Native CPU happens to get allocated the same memory for its handle.
  • The loader tries to register it, sees that it is already allocated and returns a cached pointer... Which is for L0/Hip.
  • L0/Hip get sent Native CPU handles and probably isn't initialized properly any more, which causes it to break.

Annoyingly, there's no simple solution here, because it's unclear which allocations need to be free'd and which don't. In addition, I'd be surprised if code both in the CTS and the wild are diligent in freeing memory that they allocate.

I spent a bit of time hacking together a solution, but it causes so many fails: #2598

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
loader Loader related feature/bug
Projects
None yet
Development

No branches or pull requests

2 participants