-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race check failures in shared memory groupby #17985
Fix race check failures in shared memory groupby #17985
Conversation
@@ -213,6 +211,7 @@ CUDF_KERNEL void single_pass_shmem_aggs_kernel(cudf::size_type num_rows, | |||
block.sync(); | |||
|
|||
while (col_end < num_cols) { | |||
block.sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing this block sync is the culprit of the Spark query failure and I still don't understand why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it equivalent if you put the block.sync();
after compute_final_aggregations
? I think that would be better. When entering this while
loop, we should already be synced by line 211. Syncing at the end of the loop would feel safer, since it forces a final sync that won't happen if the sync is at the beginning of this while
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it equivalent if you put the block.sync(); after compute_final_aggregations?
No, the current implementation is actually synced after compute_final_aggregations
(or at the end of thecompute_final_aggregations
) and failed the spark query. That's why I'm saying I don't understand 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More info for reference
If we sync at the end of the loop right after compute_final_aggregations
. compute sanitizer will report a race condition between
++col_end; |
and
while (col_end < num_cols) { |
Relocating the block sync from the end of the loop to the current position eliminates the race check failure and resolves the Spark query issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this explain it?
- We're at the beginning of the
while
loop. - All threads read the shared variable
col_end
to comparecol_end < num_cols
. - We are adding a block sync here!
- Thread 0 runs
calculate_columns_to_aggregate
and updates the shared variablecol_end
.
Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).
Totally makes sense. 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's it @bdice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.
I see your concern. To be honest, I still don’t understand how the current change resolves the Spark failure. However, if it were an illegal memory access issue, our nightly memcheck tests should have already detected it, and our Python and C++ tests wouldn’t have survived thousands of runs.
This PR applies a patch to 24.12 that contains the core fix to the groupby algorithm that [was merged](rapidsai/cudf#17985) into cuDF 25.02. Signed-off-by: Paul Mattione <[email protected]>
….12 (NVIDIA#2920)" This reverts commit 50125d6.
….12 (NVIDIA#2920)" This reverts commit 50125d6. Signed-off-by: Peixin Li <[email protected]>
…nch-24.12 hotfix [skip ci] (#2940) This reverts commit 50125d6 for automerge from branch-24.12 Signed-off-by: Peixin Li <[email protected]>
Description
Replace #17976
This fixes the race check failures in shared memory groupby and resolves NVIDIA/spark-rapids#11835.
Checklist