Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase refcount to ze_loader/CUDA libraries when Level Zero/CUDA providers are used #1086

Merged
merged 5 commits into from
Feb 17, 2025

Conversation

vinser52
Copy link
Contributor

@vinser52 vinser52 commented Feb 7, 2025

Description

Prior to this PR, the Level Zero provider did not dlopen(“libze_loader.so”) at all. We suppose that UMF’s client, who uses Level Zero provider, loaded libze_loader.so into the process. But what if there are two clients in the process:

  1. The first client loads libze_loader.so into the process and uses the Level Zero provider. In that case, UMF inits Level Zero symbols via dlsym. Then the first client destroys the Level Zero Memory provider and unloads the libze_loader.so.
  2. Then the second client loads libze_loader.so into the process and uses the Level Zero provider. But our current implementation does not catch such situation and Level Zero symbols are considered initialized, but the dlsym should be done again.

This PR fixes that. The UMF acquires handle to the libze_loader.so via dlopen. There is a RTLD_NOLOAD flag that tells dlopen not to load the library and succeed only if the library is already loaded. It allows to increase the refcount to the libze_loader.so and not unload it when the first client calls dlclose.

This PR should fix the #926.

Fixes: intel/llvm#16944 (confirmed by @ldorau)

Checklist

  • Code compiles without errors locally
  • All tests pass locally
  • CI workflows execute properly
  • New tests added, especially if they will fail without my changes
  • All newly added source files have a license
  • All newly added source files are referenced in CMake files

@vinser52 vinser52 requested a review from igchor February 7, 2025 17:25
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch from 809e6ab to a24a920 Compare February 7, 2025 17:33
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch 2 times, most recently from 0e2989c to e75bed8 Compare February 8, 2025 00:20
@bratpiorka
Copy link
Contributor

I know this is probably not easy, but could we add the simplest test for this case?

@bratpiorka bratpiorka added this to the v0.11.x milestone Feb 10, 2025
@vinser52
Copy link
Contributor Author

I know this is probably not easy, but could we add the simplest test for this case?

yeah, I am working on it, that's why this PR is a draft.
Also, I will take a look into the CUDA provider, I believe it requires the same changes.

@vinser52 vinser52 force-pushed the svinogra_l0_linking branch from e75bed8 to 2bdefbd Compare February 10, 2025 13:38
@vinser52 vinser52 changed the title Increase refcount to ze_loader library when Level Zero provider is used Increase refcount to ze_loader/CUDA libraries when Level Zero/CUDA providers are used Feb 10, 2025
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch 6 times, most recently from 63c5902 to 96624e6 Compare February 11, 2025 21:26
@vinser52 vinser52 mentioned this pull request Feb 11, 2025
3 tasks
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch 13 times, most recently from 42576d2 to 8246397 Compare February 13, 2025 15:40
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch 4 times, most recently from 6b0e8e8 to 6271502 Compare February 13, 2025 22:00
@ldorau
Copy link
Contributor

ldorau commented Feb 14, 2025

@vinser52 @bratpiorka @pbalcer This PR fixes intel/llvm#16944

@vinser52 vinser52 force-pushed the svinogra_l0_linking branch from 6271502 to 7a78113 Compare February 14, 2025 13:52
@vinser52 vinser52 force-pushed the svinogra_l0_linking branch from 7a78113 to 664484f Compare February 14, 2025 13:55
@vinser52 vinser52 marked this pull request as ready for review February 14, 2025 13:55
@vinser52 vinser52 requested a review from a team as a code owner February 14, 2025 13:55
@vinser52
Copy link
Contributor Author

I know this is probably not easy, but could we add the simplest test for this case?

Ok, new tests have been added. Actually, I split the umf-provider_level_zero_dlopen and umf-provider_cuda_dlopen to the umf-provider_level_zero_dlopen_global/umf-provider_level_zero_dlopen_local and umf-provider_cuda_dlopen_global/umf-provider_cuda_dlopen_local tests correspondingly.

Without my changes the umf-provider_level_zero_dlopen_local and umf-provider_cuda_dlopen_local tests are failed.

@vinser52
Copy link
Contributor Author

@vinser52 @bratpiorka @pbalcer This PR fixes intel/llvm#16944

so this PR is ready for the review.

@ldorau
Copy link
Contributor

ldorau commented Feb 14, 2025

@lukaszstolarczuk @bratpiorka @pbalcer Hanging on Windows CUDA CI builds ...

@ldorau ldorau merged commit 5a515c5 into oneapi-src:main Feb 17, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SYCL][CUDA] Nsys profiling broken after memory providers change
5 participants