-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm: fix incorrect capability set #10362
Conversation
SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption Signed-off-by: Alexia Ingerson <[email protected]>
@shijin-aws Double checking that the AWS CI failure isn't related? This is a pretty obvious bug but I don't want to merge if it'll cause any issues for you. |
The PR consistently failed SHM provider's fabtests on a certain OS (Amazon linux 2023)
But I don't have more logs, I need to reproduce it manually and get back to you |
@shijin-aws I repushed with a little simplification (FI_LOCAL_COMM was OR'ed in twice) so hopefully this round of tests will provide more information on reproducibility |
@aingerson I don't think the failure is related to this PR as I find it in all AWS CI now. But let me get back to you before merging it, thx! |
@shijin-aws Ok, thanks! Also heads up that the commit that introduced this made it into 2.0 alpha and would probably result in multiple issues that would disable shm. It's this one 50fb455 |
Can you elaborate? You mean #10314 merged earlier can cause shm provider to be disabled inside efa? |
@shijin-aws Yes, so that commit added the FI_PEER capability to shm and added it to efa shm hints to request it from shm. The issue is that because there was a comma instead of an OR, the new SMR_DOMAIN_CAPS got set as the next field in the fi_info which means the caps did not have FI_PEER and FI_AV_USER_ID and the mode was set to FI_LOCAL_COMM | FI_PEER | FI_AV_USER_ID which means that fi_info would not return success because the modes were not supported and the capability was not set. The bug was only in the regular fi_info and not the hmem one though so it probably wouldn't disable shm but rather would select the shm hmem which would disable CMA and injects so it would probably still work but hurt performance. |
I think the CI failed as the cuda setup has some problem
so all cuda hmem test failed... we will get it fixed |
Interesting, our dashboard doesn't show a regression on intra-node performance for main branch around 08/27 (that commit was merged). I don't think we are using HMEM for those platform |
@shijin-aws Weird, it should definitely be disabling the optimized shm info. Not sure how there was no regression... maybe it was never using that path? |
Just did a quick test for main branch, efa's fi_getinfo call for shm succeeded (non-hmem)
|
@shijin-aws Does it return the info with or without FI_HMEM? Can you print out the info (or at least caps) that are returned/used? |
It is without FI_HMEM, efa only request FI_HMEM to shm when application request FI_HMEM, which fi_rdm shouldn't do by default |
@shijin-aws Oh I guess that makes sense. It would get turned off by default when matching because it wasn't requested. What about the inject size? |
SMR_DOMAIN_CAPS was recently added to include FI_PEER, FI_AV_USER_ID but a typo caused an smr_info data corruption