prov/shm: fix incorrect capability set #10362

aingerson · 2024-09-06T18:31:55Z

SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption

SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption Signed-off-by: Alexia Ingerson <[email protected]>

aingerson · 2024-09-09T15:55:09Z

@shijin-aws Double checking that the AWS CI failure isn't related? This is a pretty obvious bug but I don't want to merge if it'll cause any issues for you.

shijin-aws · 2024-09-09T16:11:53Z

The PR consistently failed SHM provider's fabtests on a certain OS (Amazon linux 2023)

[2024-09-06 18:50:34] test_suites/libfabric/test_efa_ut.py::test_run_efa_unit_tests SKIPPED [ 22%]

[2024-09-06 18:50:34] test_suites/libfabric/test_fabtests.py::test_fabtests_functional[single_node-no_xpmem-rma-write-non-dmabuf-release] FAILED [ 22%]

[2024-09-06 23:25:09] test_suites/libfabric/test_fabtests.py::test_fabtests_functional[single_node-no_xpmem-rma-write-non-dmabuf-debug] Sending interrupt signal to process

2024-09-07 02:33:53,742 - INFO - test_orchestrator - Stopping timer...

2024-09-07 02:33:53,742 - INFO - remote_executor - Downloading [email protected]:/home/ec2-user/PortaFiducia/tests/outputs/LibfabricPRCI-PR-10362-1-alinux2023-g4dn8xlar-lKkk9NqB.xml to /var/jenkins_home/agent/fargate-cloud-pr-------1-zjnmx-f5hh1/workspace/Libfabric_PR_CI_PR-103

But I don't have more logs, I need to reproduce it manually and get back to you

aingerson · 2024-09-09T17:03:02Z

@shijin-aws I repushed with a little simplification (FI_LOCAL_COMM was OR'ed in twice) so hopefully this round of tests will provide more information on reproducibility

shijin-aws · 2024-09-09T17:05:12Z

@aingerson I don't think the failure is related to this PR as I find it in all AWS CI now. But let me get back to you before merging it, thx!

aingerson · 2024-09-09T17:13:45Z

@shijin-aws Ok, thanks! Also heads up that the commit that introduced this made it into 2.0 alpha and would probably result in multiple issues that would disable shm. It's this one 50fb455

shijin-aws · 2024-09-09T17:33:08Z

@shijin-aws Ok, thanks! Also heads up that the commit that introduced this made it into 2.0 alpha and would probably result in multiple issues that would disable shm. It's this one 50fb455

Can you elaborate? You mean #10314 merged earlier can cause shm provider to be disabled inside efa?

aingerson · 2024-09-09T17:45:22Z

@shijin-aws Yes, so that commit added the FI_PEER capability to shm and added it to efa shm hints to request it from shm. The issue is that because there was a comma instead of an OR, the new SMR_DOMAIN_CAPS got set as the next field in the fi_info which means the caps did not have FI_PEER and FI_AV_USER_ID and the mode was set to FI_LOCAL_COMM | FI_PEER | FI_AV_USER_ID which means that fi_info would not return success because the modes were not supported and the capability was not set. The bug was only in the regular fi_info and not the hmem one though so it probably wouldn't disable shm but rather would select the shm hmem which would disable CMA and injects so it would probably still work but hurt performance.

shijin-aws · 2024-09-09T17:51:18Z

I think the CI failed as the cuda setup has some problem

(env) [ec2-user@ip-172-31-82-7 pr10362]$ FI_LOG_LEVEL=warn fi_rma_bw -e rdm -o writedata -I 5 -v -p shm -E
libfabric:735977:1725904206::core:core:cuda_hmem_verify_devices():594<warn> Failed to perform cudaGetDeviceCount: cudaErrorInitializationError:initialization error
libfabric:735977:1725904206::core:core:ofi_hmem_init():663<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error
libfabric:735977:1725904206::core:core:cuda_hmem_verify_devices():594<warn> Failed to perform cudaGetDeviceCount: cudaErrorInitializationError:initialization error
libfabric:735977:1725904206::core:core:ofi_hmem_init():663<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error

so all cuda hmem test failed... we will get it fixed

shijin-aws · 2024-09-09T17:58:08Z

The bug was only in the regular fi_info and not the hmem one though so it probably wouldn't disable shm but rather would select the shm hmem which would disable CMA and injects so it would probably still work but hurt performance.

Interesting, our dashboard doesn't show a regression on intra-node performance for main branch around 08/27 (that commit was merged). I don't think we are using HMEM for those platform

aingerson · 2024-09-09T18:07:40Z

@shijin-aws Weird, it should definitely be disabling the optimized shm info. Not sure how there was no regression... maybe it was never using that path?

shijin-aws · 2024-09-09T18:15:29Z

Just did a quick test for main branch, efa's fi_getinfo call for shm succeeded (non-hmem)

(env) ubuntu@ip-172-31-40-58:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric$ FI_LOG_LEVEL=warn fi_rdm -p efa -E
libfabric:2500754:1725905690::efa:domain:efa_shm_info_create():109<warn> shm fi_getinfo returns 0
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!
1725905691 efa_destroy_qp: destroying qp

aingerson · 2024-09-09T19:40:50Z

@shijin-aws Does it return the info with or without FI_HMEM? Can you print out the info (or at least caps) that are returned/used?

shijin-aws · 2024-09-09T19:42:03Z

It is without FI_HMEM, efa only request FI_HMEM to shm when application request FI_HMEM, which fi_rdm shouldn't do by default

aingerson · 2024-09-09T20:33:06Z

@shijin-aws Oh I guess that makes sense. It would get turned off by default when matching because it wasn't requested. What about the inject size?

prov/shm: fix incorrect capability set

8301fae

SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption Signed-off-by: Alexia Ingerson <[email protected]>

j-xiong approved these changes Sep 6, 2024

View reviewed changes

zachdworkin approved these changes Sep 6, 2024

View reviewed changes

aingerson force-pushed the shm branch from 20e8bd8 to 8301fae Compare September 9, 2024 16:57

shijin-aws mentioned this pull request Sep 9, 2024

fabtests/multi_ep: add more stressing of resource binding to multi_ep test #10249

Merged

aingerson merged commit 036e2dd into ofiwg:main Sep 10, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: fix incorrect capability set #10362

prov/shm: fix incorrect capability set #10362

aingerson commented Sep 6, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 •

edited

Loading

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 •

edited

Loading

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 •

edited

Loading

aingerson commented Sep 9, 2024 •

edited

Loading

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

prov/shm: fix incorrect capability set #10362

prov/shm: fix incorrect capability set #10362

Conversation

aingerson commented Sep 6, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 • edited Loading

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 • edited Loading

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 • edited Loading

aingerson commented Sep 9, 2024 • edited Loading

shijin-aws commented Sep 9, 2024

aingerson commented Sep 9, 2024

shijin-aws commented Sep 9, 2024 •

edited

Loading

shijin-aws commented Sep 9, 2024 •

edited

Loading

shijin-aws commented Sep 9, 2024 •

edited

Loading

aingerson commented Sep 9, 2024 •

edited

Loading