Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: fix incorrect capability set #10362

Merged
merged 1 commit into from
Sep 10, 2024
Merged

prov/shm: fix incorrect capability set #10362

merged 1 commit into from
Sep 10, 2024

Conversation

aingerson
Copy link
Contributor

SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption

SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID
but a typo caused an smr_info data corruption

Signed-off-by: Alexia Ingerson <[email protected]>
@aingerson
Copy link
Contributor Author

@shijin-aws Double checking that the AWS CI failure isn't related? This is a pretty obvious bug but I don't want to merge if it'll cause any issues for you.

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 9, 2024

The PR consistently failed SHM provider's fabtests on a certain OS (Amazon linux 2023)

[2024-09-06 18:50:34] test_suites/libfabric/test_efa_ut.py::test_run_efa_unit_tests SKIPPED [ 22%]

[2024-09-06 18:50:34] test_suites/libfabric/test_fabtests.py::test_fabtests_functional[single_node-no_xpmem-rma-write-non-dmabuf-release] FAILED [ 22%]

[2024-09-06 23:25:09] test_suites/libfabric/test_fabtests.py::test_fabtests_functional[single_node-no_xpmem-rma-write-non-dmabuf-debug] Sending interrupt signal to process

2024-09-07 02:33:53,742 - INFO - test_orchestrator - Stopping timer...

2024-09-07 02:33:53,742 - INFO - remote_executor - Downloading [email protected]:/home/ec2-user/PortaFiducia/tests/outputs/LibfabricPRCI-PR-10362-1-alinux2023-g4dn8xlar-lKkk9NqB.xml to /var/jenkins_home/agent/fargate-cloud-pr-------1-zjnmx-f5hh1/workspace/Libfabric_PR_CI_PR-103

But I don't have more logs, I need to reproduce it manually and get back to you

@aingerson
Copy link
Contributor Author

@shijin-aws I repushed with a little simplification (FI_LOCAL_COMM was OR'ed in twice) so hopefully this round of tests will provide more information on reproducibility

@shijin-aws
Copy link
Contributor

@aingerson I don't think the failure is related to this PR as I find it in all AWS CI now. But let me get back to you before merging it, thx!

@aingerson
Copy link
Contributor Author

@shijin-aws Ok, thanks! Also heads up that the commit that introduced this made it into 2.0 alpha and would probably result in multiple issues that would disable shm. It's this one 50fb455

@shijin-aws
Copy link
Contributor

@shijin-aws Ok, thanks! Also heads up that the commit that introduced this made it into 2.0 alpha and would probably result in multiple issues that would disable shm. It's this one 50fb455

Can you elaborate? You mean #10314 merged earlier can cause shm provider to be disabled inside efa?

@aingerson
Copy link
Contributor Author

@shijin-aws Yes, so that commit added the FI_PEER capability to shm and added it to efa shm hints to request it from shm. The issue is that because there was a comma instead of an OR, the new SMR_DOMAIN_CAPS got set as the next field in the fi_info which means the caps did not have FI_PEER and FI_AV_USER_ID and the mode was set to FI_LOCAL_COMM | FI_PEER | FI_AV_USER_ID which means that fi_info would not return success because the modes were not supported and the capability was not set. The bug was only in the regular fi_info and not the hmem one though so it probably wouldn't disable shm but rather would select the shm hmem which would disable CMA and injects so it would probably still work but hurt performance.

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 9, 2024

I think the CI failed as the cuda setup has some problem

(env) [ec2-user@ip-172-31-82-7 pr10362]$ FI_LOG_LEVEL=warn fi_rma_bw -e rdm -o writedata -I 5 -v -p shm -E
libfabric:735977:1725904206::core:core:cuda_hmem_verify_devices():594<warn> Failed to perform cudaGetDeviceCount: cudaErrorInitializationError:initialization error
libfabric:735977:1725904206::core:core:ofi_hmem_init():663<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error
libfabric:735977:1725904206::core:core:cuda_hmem_verify_devices():594<warn> Failed to perform cudaGetDeviceCount: cudaErrorInitializationError:initialization error
libfabric:735977:1725904206::core:core:ofi_hmem_init():663<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error

so all cuda hmem test failed... we will get it fixed

@shijin-aws
Copy link
Contributor

The bug was only in the regular fi_info and not the hmem one though so it probably wouldn't disable shm but rather would select the shm hmem which would disable CMA and injects so it would probably still work but hurt performance.

Interesting, our dashboard doesn't show a regression on intra-node performance for main branch around 08/27 (that commit was merged). I don't think we are using HMEM for those platform

@aingerson
Copy link
Contributor Author

@shijin-aws Weird, it should definitely be disabling the optimized shm info. Not sure how there was no regression... maybe it was never using that path?

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 9, 2024

Just did a quick test for main branch, efa's fi_getinfo call for shm succeeded (non-hmem)

(env) ubuntu@ip-172-31-40-58:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric$ FI_LOG_LEVEL=warn fi_rdm -p efa -E
libfabric:2500754:1725905690::efa:domain:efa_shm_info_create():109<warn> shm fi_getinfo returns 0
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!
1725905691 efa_destroy_qp: destroying qp

@aingerson
Copy link
Contributor Author

aingerson commented Sep 9, 2024

@shijin-aws Does it return the info with or without FI_HMEM? Can you print out the info (or at least caps) that are returned/used?

@shijin-aws
Copy link
Contributor

It is without FI_HMEM, efa only request FI_HMEM to shm when application request FI_HMEM, which fi_rdm shouldn't do by default

@aingerson
Copy link
Contributor Author

@shijin-aws Oh I guess that makes sense. It would get turned off by default when matching because it wasn't requested. What about the inject size?

@aingerson aingerson merged commit 036e2dd into ofiwg:main Sep 10, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants