btl/smcuda: add delayed stream initialization #12354

edgargabriel · 2024-02-20T21:13:08Z

introduce two new mca parameterse to the smcuda component:

allow for delayed initialization of the internal ipc stream and the array of events. This allows to handle situations where the user code did not set the device before MPI_Init AND the internal stream and/or event structures have some dependence on the device id used during creation.
add a parameter to control how many events are created during initialization by the smcuda component.

edgargabriel · 2024-02-29T14:25:23Z

I was wondering whether somebody would have time to review this pr, I would like to have it merged before the weekend if possible.

edgargabriel · 2024-02-29T14:32:49Z

bot:retest

wenduwan · 2024-02-29T18:45:09Z

Running AWS CI. We test on both x86 and arm.

edgargabriel · 2024-02-29T18:49:34Z

@wenduwan ok, thank you!

wenduwan · 2024-02-29T22:17:36Z

FYI our CI passed.

edgargabriel · 2024-03-06T13:37:25Z

bot:retest

edgargabriel · 2024-03-06T14:21:37Z

Hm, not sure whether the mpi4py failure is truly because of something in this pr

Error: The action 'Test mpi4py (np=4)' has timed out after 10 minutes.

opal/mca/btl/smcuda/btl_smcuda.h

opal/mca/btl/smcuda/btl_smcuda_accelerator.c

bosilca · 2024-03-06T16:23:09Z

opal/mca/btl/smcuda/btl_smcuda_accelerator.c

+
+    /* Create the events since they can be reused. */
+    for (i = 0; i < accelerator_event_max; i++) {
+        rc = opal_accelerator.create_event(MCA_ACCELERATOR_NO_DEVICE_ID, &accelerator_event_ipc_array[i], opal_accelerator_use_sync_memops ? false : true);


I'm slightly confused here about what this PR really does. You mentioned it improves the handling of the process GPU/stream, but I see here that the events are created without a device association (MCA_ACCELERATOR_NO_DEVICE_ID) as if they were expected to be generic (which is not the case, at least not with CUDA)?

Moreover, the CUDA accelerator simply ignores the device id, which either suggest that we don't need it or that the CUDA accelerator framework is careless on the handling of the devices.

So its a tricky question: neither hipEventCreateWithFlags() nor cuEventCreate() take a device_id as an argument, the device_id argument to the API was added for ze devices. However, it seems like the device_id does play internally a role. Specifically, in some ROCm releases, if the code did something like:

hipSetDevice(device_id) MPI_Init()

everything worked, since the events were created using the device that was used subsequently. If the sequence was however the other way around (i.e. the code set the device using hipSetDevice() after MPI_Init()) we had the events (and streams) being created with the default device (0), but afterwards we tried to use it for operations on a different device, and that caused an error. The delayed init is fixing this.

I agree however that I can make this a bit cleaner, by first retrieving the current device id , and using that id as an argument in the event_create and stream_create functions. Would that be ok?

that would be awesome, and will make the code much more readable.

edgargabriel · 2024-03-06T17:17:00Z

@bosilca thank you for your review, I tried address your comments and updated the code accordingly, please let me know whether this looks ok now. Thanks again!

introduce two new mca parameterse to the smcuda component: - allow for delayed initialization of the internal ipc stream and the array of events. This allows to handle situations where the user code did not set the device before MPI_Init AND the internal stream and/or event structures have some dependence on the device id used during creation. - add a parameter to control how many events are created during initialization. Signed-off-by: Edgar Gabriel <[email protected]>

edgargabriel requested a review from bosilca February 20, 2024 21:13

github-actions bot added the Target: main label Feb 20, 2024

edgargabriel requested review from wenduwan and lrbison February 20, 2024 21:13

edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from ccba7e8 to 71b404b Compare March 6, 2024 13:46

bosilca reviewed Mar 6, 2024

View reviewed changes

edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from 71b404b to 4f83e45 Compare March 6, 2024 17:16

edgargabriel force-pushed the topic/smcuda-delayed-sae-init branch from 4f83e45 to 835eef5 Compare March 6, 2024 17:45

bosilca approved these changes Mar 6, 2024

View reviewed changes

wenduwan approved these changes Mar 6, 2024

View reviewed changes

edgargabriel merged commit a25dd5f into open-mpi:main Mar 6, 2024
7 checks passed

edgargabriel deleted the topic/smcuda-delayed-sae-init branch July 12, 2024 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

btl/smcuda: add delayed stream initialization #12354

btl/smcuda: add delayed stream initialization #12354

edgargabriel commented Feb 20, 2024 •

edited

Loading

edgargabriel commented Feb 29, 2024

edgargabriel commented Feb 29, 2024

wenduwan commented Feb 29, 2024

edgargabriel commented Feb 29, 2024

wenduwan commented Feb 29, 2024

edgargabriel commented Mar 6, 2024

edgargabriel commented Mar 6, 2024

bosilca Mar 6, 2024

bosilca Mar 6, 2024

edgargabriel Mar 6, 2024 •

edited

Loading

bosilca Mar 6, 2024

edgargabriel commented Mar 6, 2024

btl/smcuda: add delayed stream initialization #12354

btl/smcuda: add delayed stream initialization #12354

Conversation

edgargabriel commented Feb 20, 2024 • edited Loading

edgargabriel commented Feb 29, 2024

edgargabriel commented Feb 29, 2024

wenduwan commented Feb 29, 2024

edgargabriel commented Feb 29, 2024

wenduwan commented Feb 29, 2024

edgargabriel commented Mar 6, 2024

edgargabriel commented Mar 6, 2024

bosilca Mar 6, 2024

Choose a reason for hiding this comment

bosilca Mar 6, 2024

Choose a reason for hiding this comment

edgargabriel Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

bosilca Mar 6, 2024

Choose a reason for hiding this comment

edgargabriel commented Mar 6, 2024

edgargabriel commented Feb 20, 2024 •

edited

Loading

edgargabriel Mar 6, 2024 •

edited

Loading