Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Runtime CUDA >= 12.3 to enable VMM #10396

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Dec 20, 2024

What?

Do not use cuCtxSetFlags() if CUDA driver does not support it.

Why?

Unresolved symbol for cuCtxSetFlags on CUDA driver < 12.1 causes crash.

How?

Assumptions:

  • cuCtxSetFlags is only needed for VMM, which has UCX support starting from CUDA driver >= 12.3
  • cuCtxSetFlags is not strictly needed for malloc async

Testing

Locally tested, needs final testing on platform with actual older drivers.

UCX_IB_GPU_DIRECT_RDMA=no ./rfs/bin/ucx_perftest -t tag_bw -m cuda 

@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from 1ce967f to 68a5f51 Compare December 20, 2024 10:46
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
@yosefe
Copy link
Contributor

yosefe commented Dec 20, 2024

we have tests for different cuda versions, which include cuda memory hooks (for example, Test Cuda Docker ubuntu18_cuda_12_0). can we add a test that would have caught the new api usage?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Jan 6, 2025

@yosefe, do we need this before release?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Jan 6, 2025

we have tests for different cuda versions, which include cuda memory hooks (for example, Test Cuda Docker ubuntu18_cuda_12_0). can we add a test that would have caught the new api usage?

I think it is difficult because we need to build with later driver version and run it with older driver version. But for instance, when I run this container on rock, we are only running later driver version, and I don't think we can easily switch driver version since it has to match kernel module as per my understanding.

root@905eb7691066:/# readelf -a /usr/lib/x86_64-linux-gnu/libcuda.so | grep -w cuCtxSetFlags
   731: 00000000002516f0    30 FUNC    GLOBAL DEFAULT   13 cuCtxSetFlags

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
}
#else
unsigned value = 1;
(void)ctx_set_flags_func;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?
maybe we could just remove #if HAVE_CUDA_FABRIC now, since we don't use cuCtxSetFlags directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored as it is needed by CU_CTX_SYNC_MEMOPS

{
static ucs_status_t status = UCS_ERR_LAST;

#if CUDA_VERSION >= 12000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuGetProcAddress() prototype changed at >=12000 and we know that cuCtxSetFlags() also appeared after 12000 so no need to use older cuGetProcAddress() prototype to check.

@@ -823,6 +834,37 @@ static uct_md_ops_t md_ops = {
.detect_memory_type = uct_cuda_copy_md_detect_memory_type
};

static ucs_status_t uct_cuda_copy_md_check_is_ctx_set_flags_supported(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the code, we could have this function call the needed function pointer, and move the global var inside it.
Something like
ucs_status_t uct_cuda_copy_set_ctx_flags(unsigned flags)
and have it return UCS_ERR_UNSUPPORTED if the func pointer is not found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but went for two step approach as we need:

  1. disable fabric at init time
  2. set the flag with md and address as parameter, in case we cannot use cuCtxSetFlags()

@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from fd0d161 to f8f88ae Compare January 6, 2025 18:48
@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from f8f88ae to 6563253 Compare January 6, 2025 18:49
@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from 7acee45 to ff4313c Compare January 7, 2025 09:07
brminich
brminich previously approved these changes Jan 7, 2025
rakhmets
rakhmets previously approved these changes Jan 7, 2025
brminich
brminich previously approved these changes Jan 8, 2025
@brminich
Copy link
Contributor

brminich commented Jan 9, 2025

@yosefe, pls review

}

ucs_diag("disabled fabric memory allocations");
md->config.enable_fabric = UCS_NO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it affects only cuda_copy memory allocations, but what happens if we get a fabric memory from user buffer and then we don't actually set sync memops for it?
we could return UNSUPPORTED from uct_cuda_copy_sync_memops and if not - return error from cuda memory detection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should show now be handled right?

@tvegas1 tvegas1 dismissed stale reviews from brminich and rakhmets via 03094e5 January 10, 2025 10:21
@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from 03094e5 to da07d62 Compare January 10, 2025 10:24
@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from da07d62 to 8657d54 Compare January 10, 2025 10:33
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
@@ -636,7 +667,7 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address,
return UCS_ERR_NO_DEVICE;
}

uct_cuda_copy_sync_memops(md, address);
uct_cuda_copy_sync_memops(md, address, is_vmm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call it also from cuda allocate flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could end-up calling cuPointerSetAttribute twice when set flags function is not available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would be the 2nd time? AFAIK we don't call pointer query on allocated memory (such as rndv fragments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

CUdriverProcAddressQueryResult sym_status;
CUresult cu_err;
ucs_status_t status;
uct_cuda_cuCtxSetFlags_t cuda_cuCtxSetFlags_func =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. initialized vars should be first
  2. should be static??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks missed the static

@@ -553,8 +554,7 @@ static void uct_cuda_copy_sync_memops(uct_cuda_copy_md_t *md,

if (is_vmm) {
ucs_fatal("failed to set sync_memops on CUDA VMM without "
"cuCtxSetFlags() (address=%p)",
address);
"cuCtxSetFlags() (address=%p)", address);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of it again it should be a warning, since failure in cuPointerSetAttribute() call is also a warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when is_vmm == 1 you want to call cuPointerSetAttribute and let it fail right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to ucs_warn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm right, actually we can return from the function after ucs_warn, and not call cuPointerSetAttribute at all

@@ -636,7 +667,7 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address,
return UCS_ERR_NO_DEVICE;
}

uct_cuda_copy_sync_memops(md, address);
uct_cuda_copy_sync_memops(md, address, is_vmm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would be the 2nd time? AFAIK we don't call pointer query on allocated memory (such as rndv fragments)

@@ -379,6 +431,9 @@ uct_cuda_copy_mem_alloc(uct_md_h uct_md, size_t *length_p, void **address_p,
}

allocated:
uct_cuda_copy_sync_memops(md, (void *)alloc_handle->ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if it will work with MANAGED memory ... maybe only on coherent platforms that allow managed memory registration with ODP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to work on managed, but shall I remove that line since we want to backport in v1.18?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add memory type check during the backport to v1.18.x

}

if (is_vmm) {
ucs_warn("failed to set sync_memops on CUDA VMM without "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvegas1 Current changes look good to me but @yosefe brought up an issue where library is built with >=12.3 compatible driver version but the system where that library gets used has driver version < 12.1. On such a system, VMM/Mallocasync allocations are allowed (as VMM and MallocAsync is supported on driver versions < 12.1). But there would be a need to report an error or fail even if UCX isn't compiled with HAVE_CUDA_FABRIC (driver version >= 12.3). The condition met here is when UCX is built with >=12.3 driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, that's where we have VMM independently allocated, but still we don't have HAVE_FABRIC set, in this case I will move is_vmm out of the #ifdef and fatal if is_vmm == 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it will help - if HAVE_FABRIC is not set we will never know in UCX it is VMM memory and assume it is legacy memory. Then we can only hope that cuPointerSetAttribute would fail.

Copy link
Contributor Author

@tvegas1 tvegas1 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes also enabled detect vmm because:

  • cuMemRelease >= 10.2
  • cuMemRetainAllocationHandle >= 11.0
  • cuMemGetAllocationPropertiesFromHandle >= 10.2

assuming we built UCX with cuda >= 11 anyways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checked actual function prototype with cuda older release online documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants