-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA: Runtime CUDA >= 12.3 to enable VMM #10396
base: master
Are you sure you want to change the base?
Conversation
1ce967f
to
68a5f51
Compare
we have tests for different cuda versions, which include cuda memory hooks (for example, Test Cuda Docker ubuntu18_cuda_12_0). can we add a test that would have caught the new api usage? |
@yosefe, do we need this before release? |
I think it is difficult because we need to build with later driver version and run it with older driver version. But for instance, when I run this container on rock, we are only running later driver version, and I don't think we can easily switch driver version since it has to match kernel module as per my understanding.
|
} | ||
#else | ||
unsigned value = 1; | ||
(void)ctx_set_flags_func; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
maybe we could just remove #if HAVE_CUDA_FABRIC
now, since we don't use cuCtxSetFlags directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored as it is needed by CU_CTX_SYNC_MEMOPS
{ | ||
static ucs_status_t status = UCS_ERR_LAST; | ||
|
||
#if CUDA_VERSION >= 12000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuGetProcAddress()
prototype changed at >=12000 and we know that cuCtxSetFlags()
also appeared after 12000 so no need to use older cuGetProcAddress() prototype to check.
@@ -823,6 +834,37 @@ static uct_md_ops_t md_ops = { | |||
.detect_memory_type = uct_cuda_copy_md_detect_memory_type | |||
}; | |||
|
|||
static ucs_status_t uct_cuda_copy_md_check_is_ctx_set_flags_supported(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the code, we could have this function call the needed function pointer, and move the global var inside it.
Something like
ucs_status_t uct_cuda_copy_set_ctx_flags(unsigned flags)
and have it return UCS_ERR_UNSUPPORTED if the func pointer is not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but went for two step approach as we need:
- disable fabric at init time
- set the flag with
md
andaddress
as parameter, in case we cannot usecuCtxSetFlags()
fd0d161
to
f8f88ae
Compare
f8f88ae
to
6563253
Compare
7acee45
to
ff4313c
Compare
@yosefe, pls review |
} | ||
|
||
ucs_diag("disabled fabric memory allocations"); | ||
md->config.enable_fabric = UCS_NO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it affects only cuda_copy memory allocations, but what happens if we get a fabric memory from user buffer and then we don't actually set sync memops for it?
we could return UNSUPPORTED from uct_cuda_copy_sync_memops and if not - return error from cuda memory detection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should show now be handled right?
03094e5
to
da07d62
Compare
da07d62
to
8657d54
Compare
@@ -636,7 +667,7 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address, | |||
return UCS_ERR_NO_DEVICE; | |||
} | |||
|
|||
uct_cuda_copy_sync_memops(md, address); | |||
uct_cuda_copy_sync_memops(md, address, is_vmm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call it also from cuda allocate flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could end-up calling cuPointerSetAttribute twice when set flags function is not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where would be the 2nd time? AFAIK we don't call pointer query on allocated memory (such as rndv fragments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
CUdriverProcAddressQueryResult sym_status; | ||
CUresult cu_err; | ||
ucs_status_t status; | ||
uct_cuda_cuCtxSetFlags_t cuda_cuCtxSetFlags_func = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- initialized vars should be first
- should be static??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks missed the static
@@ -553,8 +554,7 @@ static void uct_cuda_copy_sync_memops(uct_cuda_copy_md_t *md, | |||
|
|||
if (is_vmm) { | |||
ucs_fatal("failed to set sync_memops on CUDA VMM without " | |||
"cuCtxSetFlags() (address=%p)", | |||
address); | |||
"cuCtxSetFlags() (address=%p)", address); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of it again it should be a warning, since failure in cuPointerSetAttribute() call is also a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when is_vmm == 1
you want to call cuPointerSetAttribute
and let it fail right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to ucs_warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm right, actually we can return from the function after ucs_warn, and not call cuPointerSetAttribute at all
@@ -636,7 +667,7 @@ uct_cuda_copy_md_query_attributes(uct_cuda_copy_md_t *md, const void *address, | |||
return UCS_ERR_NO_DEVICE; | |||
} | |||
|
|||
uct_cuda_copy_sync_memops(md, address); | |||
uct_cuda_copy_sync_memops(md, address, is_vmm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where would be the 2nd time? AFAIK we don't call pointer query on allocated memory (such as rndv fragments)
@@ -379,6 +431,9 @@ uct_cuda_copy_mem_alloc(uct_md_h uct_md, size_t *length_p, void **address_p, | |||
} | |||
|
|||
allocated: | |||
uct_cuda_copy_sync_memops(md, (void *)alloc_handle->ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if it will work with MANAGED memory ... maybe only on coherent platforms that allow managed memory registration with ODP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to work on managed, but shall I remove that line since we want to backport in v1.18?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add memory type check during the backport to v1.18.x
} | ||
|
||
if (is_vmm) { | ||
ucs_warn("failed to set sync_memops on CUDA VMM without " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tvegas1 Current changes look good to me but @yosefe brought up an issue where library is built with >=12.3 compatible driver version but the system where that library gets used has driver version < 12.1. On such a system, VMM/Mallocasync allocations are allowed (as VMM and MallocAsync is supported on driver versions < 12.1). But there would be a need to report an error or fail even if UCX isn't compiled with HAVE_CUDA_FABRIC (driver version >= 12.3). The condition met here is when UCX is built with >=12.3 driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, that's where we have VMM independently allocated, but still we don't have HAVE_FABRIC
set, in this case I will move is_vmm
out of the #ifdef and fatal if is_vmm == 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it will help - if HAVE_FABRIC is not set we will never know in UCX it is VMM memory and assume it is legacy memory. Then we can only hope that cuPointerSetAttribute would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes also enabled detect vmm because:
- cuMemRelease >= 10.2
- cuMemRetainAllocationHandle >= 11.0
- cuMemGetAllocationPropertiesFromHandle >= 10.2
assuming we built UCX with cuda >= 11 anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if you read it differently: https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Driver_API_functions_supported_by_HIP.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checked actual function prototype with cuda older release online documentation
What?
Do not use
cuCtxSetFlags()
if CUDA driver does not support it.Why?
Unresolved symbol for
cuCtxSetFlags
on CUDA driver < 12.1 causes crash.How?
Assumptions:
cuCtxSetFlags
is only needed for VMM, which has UCX support starting from CUDA driver >= 12.3cuCtxSetFlags
is not strictly needed for malloc asyncTesting
Locally tested, needs final testing on platform with actual older drivers.