Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Nsys profiling broken after memory providers change #16944

Closed
Bensuo opened this issue Feb 10, 2025 · 9 comments · Fixed by oneapi-src/unified-memory-framework#1086 or #17034
Labels
bug Something isn't working cuda CUDA back-end

Comments

@Bensuo
Copy link
Contributor

Bensuo commented Feb 10, 2025

Describe the bug

Trying to profile a SYCL application under nsys profile seems to be broken after #16761, where urPlatformGet() fails with an unknown error when calling CreateDeviceMemoruyProviders(). This makes it impossible to profile using nsys, while the application executes normally when run directly.

Error output:

nsys profile ./usm_fill
Collecting data...
<CUDA>[ERROR]:
UR ERROR:
        Value:           UR_RESULT_ERROR_UNKNOWN
        Function:        operator()
        Source Location: <llvm build dir>/_deps/unified-runtime-src/source/adapters/cuda/platform.cpp:137

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Generating '/tmp/nsys-report-bb87.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
   ./report1.nsys-rep

To reproduce

Example commands to reproduce from LLVM build with an E2E test:

cd <llvm build dir>
./bin/clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda ../sycl/test-e2e/Graph/Explicit/usm_fill.cpp -o usm_fill

export LD_LIBRARY_PATH=<llvm build dir>/lib/:$LD_LIBRARY_PATH
export ONEAPI_DEVICE_SELECTOR=cuda:gpu

# Executes fine
./usm_fill

# Fails
nsys profile ./usm_fill

Environment

  • OS: Linux
  • Target device and vendor: [e.g. Nvidia GPU]
  • DPC++ version: 2be11a1
  • Dependencies version: Tested on both CUDA 12.3 and 12.5 and nsys version 2024.6.2.225-246235244400v0 and 2025.1.1.103-251135427971v0

Additional context

Reproduced on two different systems so local configuration doesn't seem to be the issue, and nsys profile works fine from commits before the linked PR.

@Bensuo Bensuo added bug Something isn't working cuda CUDA back-end labels Feb 10, 2025
@npmiller
Copy link
Contributor

tagging @ldorau for awareness, not sure what's going on but it looks like the UMF provider change caused it somehow

@ldorau
Copy link
Contributor

ldorau commented Feb 10, 2025

tagging @ldorau for awareness, not sure what's going on but it looks like the UMF provider change caused it somehow

Thanks! Tagging @pbalcer and @bratpiorka for awareness.

@ldorau
Copy link
Contributor

ldorau commented Feb 11, 2025

@Bensuo Could you reproduce this issue with DPC++ version ldorau@e78f196 from the https://github.com/ldorau/llvm/tree/DEBUG_UR_and_UMF branch with:
export UMF_LOG="level:debug;flush:debug;output:stderr;pid:yes" set and post here the output logs, please?

@Bensuo
Copy link
Contributor Author

Bensuo commented Feb 11, 2025

@ldorau Thanks for looking into it. Here's the logs you requested:

[PID:1808310 TID:1808310 INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.11.0-dev1.git20.gc58f188, level: DEBUG, flush: DEBUG, pid: yes, timestamp: no)
[PID:1808310 TID:1808310 DEBUG UMF] umf_ba_create_global: UMF base allocator created
[PID:1808310 TID:1808310 DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7f79f04a9068, segment map=0x7f79f0499008
[PID:1808310 TID:1808310 DEBUG UMF] umfInit: UMF tracker created
[PID:1808310 TID:1808310 DEBUG UMF] umfInit: UMF IPC cache initialized
[PID:1808310 TID:1808310 DEBUG UMF] umfInit: UMF library initialized
[PID:1808310 TID:1808310 INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.11.0-dev1.git20.gc58f188, level: DEBUG, flush: DEBUG, pid: yes, timestamp: no)
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemGetAllocationGranularity (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemGetAllocationGranularity')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemAlloc_v2 (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemAlloc_v2')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemAllocHost_v2 (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemAllocHost_v2')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemAllocManaged (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemAllocManaged')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemFree_v2 (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemFree_v2')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuMemFreeHost (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuMemFreeHost')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuGetErrorName (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuGetErrorName')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuGetErrorString (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuGetErrorString')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuCtxGetCurrent (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuCtxGetCurrent')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuCtxSetCurrent (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuCtxSetCurrent')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuIpcGetMemHandle (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuIpcGetMemHandle')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuIpcOpenMemHandle_v2 (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuIpcOpenMemHandle_v2')
[PID:1808310 TID:1808310 ERROR UMF] utils_get_symbol_addr: required symbol not found: cuIpcCloseMemHandle (error: /opt/nvidia/nsight-systems/2025.1.1/target-linux-x64/libToolsInjection64.so: undefined symbol: dlsym hook: 'cuIpcCloseMemHandle')
[PID:1808310 TID:1808310 ERROR UMF] init_cu_global_state: Required CUDA symbols not found.
[PID:1808310 TID:1808310 ERROR UMF] cu_memory_provider_initialize: Loading CUDA symbols failed

@ldorau
Copy link
Contributor

ldorau commented Feb 12, 2025

@Bensuo If you need there is a workaround for this issue - run it with the path to libcuda.so in LD_PRELOAD, for example:

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so nsys profile ./usm_fill

@nscottnichols
Copy link
Contributor

Just want to confirm that the above workaround worked for me. I was running into the same issue for a different app.

@ldorau
Copy link
Contributor

ldorau commented Feb 14, 2025

@Bensuo The fix is already under review in UMF: oneapi-src/unified-memory-framework#1086

@ldorau
Copy link
Contributor

ldorau commented Feb 17, 2025

The fix (oneapi-src/unified-memory-framework#1086) for this issue has just been merged to UMF.

ldorau added a commit to ldorau/unified-runtime that referenced this issue Feb 17, 2025
Update UMF to the latest commit:

commit 5a515c56c92be75944c8246535c408cee7711114
Author: Lukasz Dorau <[email protected]>
Date:   Mon Feb 17 10:56:05 2025 +0100
Merge pull request oneapi-src#1086 from vinser52/svinogra_l0_linking

to fix the issue in LLVM (SYCL/CUDA):

intel/llvm#16944
[SYCL][CUDA] Nsys profiling broken after memory providers change

Fixes: intel/llvm#16944

Signed-off-by: Lukasz Dorau <[email protected]>
@ldorau
Copy link
Contributor

ldorau commented Feb 17, 2025

@Bensuo I have just submitted the PR: oneapi-src/unified-runtime#2708 to UR with the fix for this issue. Please review.

ldorau added a commit to ldorau/llvm that referenced this issue Feb 17, 2025
Update UMF to the latest commit:

    commit 5a515c56c92be75944c8246535c408cee7711114
    Author: Lukasz Dorau <[email protected]>
    Date:   Mon Feb 17 10:56:05 2025 +0100
    Merge pull request intel#1086 from vinser52/svinogra_l0_linking

to fix the issue in LLVM (SYCL/CUDA):

    intel#16944
    [SYCL][CUDA] Nsys profiling broken after memory providers change

Fixes: intel#16944

Signed-off-by: Lukasz Dorau <[email protected]>
ldorau added a commit to ldorau/unified-runtime that referenced this issue Feb 18, 2025
Update UMF to the latest commit:

commit 5a515c56c92be75944c8246535c408cee7711114
Author: Lukasz Dorau <[email protected]>
Date:   Mon Feb 17 10:56:05 2025 +0100
Merge pull request oneapi-src#1086 from vinser52/svinogra_l0_linking

to fix the issue in LLVM (SYCL/CUDA):

intel/llvm#16944
[SYCL][CUDA] Nsys profiling broken after memory providers change

Fixes: intel/llvm#16944

Signed-off-by: Lukasz Dorau <[email protected]>
ldorau added a commit to ldorau/llvm that referenced this issue Feb 18, 2025
Update UMF to the latest commit:

    commit 5a515c56c92be75944c8246535c408cee7711114
    Author: Lukasz Dorau <[email protected]>
    Date:   Mon Feb 17 10:56:05 2025 +0100
    Merge pull request intel#1086 from vinser52/svinogra_l0_linking

to fix the issue in LLVM (SYCL/CUDA):

    intel#16944
    [SYCL][CUDA] Nsys profiling broken after memory providers change

Fixes: intel#16944

Signed-off-by: Lukasz Dorau <[email protected]>
ldorau added a commit to ldorau/llvm that referenced this issue Feb 19, 2025
Update UMF to the commit:

commit 5a515c56c92be75944c8246535c408cee7711114
Author: Lukasz Dorau <[email protected]>
Date:   Mon Feb 17 10:56:05 2025 +0100
Merge pull request intel#1086 from vinser52/svinogra_l0_linking

to fix the issue in LLVM (SYCL/CUDA):

intel#16944
[SYCL][CUDA] Nsys profiling broken after memory providers change

Fixes: intel#16944

Signed-off-by: Lukasz Dorau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end
Projects
None yet
4 participants