Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race check failures in shared memory groupby #17985

Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 11, 2025

Description

Replace #17976

This fixes the race check failures in shared memory groupby and resolves NVIDIA/spark-rapids#11835.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround non-breaking Non-breaking change labels Feb 11, 2025
@PointKernel PointKernel self-assigned this Feb 11, 2025
@PointKernel PointKernel requested a review from a team as a code owner February 11, 2025 19:17
@PointKernel PointKernel requested review from shrshi and vuule February 11, 2025 19:17
@@ -213,6 +211,7 @@ CUDF_KERNEL void single_pass_shmem_aggs_kernel(cudf::size_type num_rows,
block.sync();

while (col_end < num_cols) {
block.sync();
Copy link
Member Author

@PointKernel PointKernel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing this block sync is the culprit of the Spark query failure and I still don't understand why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it equivalent if you put the block.sync(); after compute_final_aggregations? I think that would be better. When entering this while loop, we should already be synced by line 211. Syncing at the end of the loop would feel safer, since it forces a final sync that won't happen if the sync is at the beginning of this while loop.

Copy link
Member Author

@PointKernel PointKernel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it equivalent if you put the block.sync(); after compute_final_aggregations?

No, the current implementation is actually synced after compute_final_aggregations (or at the end of thecompute_final_aggregations) and failed the spark query. That's why I'm saying I don't understand 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More info for reference

If we sync at the end of the loop right after compute_final_aggregations. compute sanitizer will report a race condition between


and
while (col_end < num_cols) {
.

Relocating the block sync from the end of the loop to the current position eliminates the race check failure and resolves the Spark query issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this explain it?

  1. We're at the beginning of the while loop.
  2. All threads read the shared variable col_end to compare col_end < num_cols.
  3. We are adding a block sync here!
  4. Thread 0 runs calculate_columns_to_aggregate and updates the shared variable col_end.

Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).

Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).

Totally makes sense. 🔥

Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's it @bdice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.

I see your concern. To be honest, I still don’t understand how the current change resolves the Spark failure. However, if it were an illegal memory access issue, our nightly memcheck tests should have already detected it, and our Python and C++ tests wouldn’t have survived thousands of runs.

@raydouglass raydouglass merged commit 4b2ce98 into rapidsai:branch-25.02 Feb 11, 2025
127 of 128 checks passed
@PointKernel PointKernel deleted the fix-groupby-race-check branch February 11, 2025 21:18
pxLi pushed a commit to NVIDIA/spark-rapids-jni that referenced this pull request Feb 17, 2025
This PR applies a patch to 24.12 that contains the core fix to the
groupby algorithm that [was
merged](rapidsai/cudf#17985) into cuDF 25.02.

Signed-off-by: Paul Mattione <[email protected]>
pxLi added a commit to pxLi/spark-rapids-jni that referenced this pull request Feb 17, 2025
pxLi added a commit to pxLi/spark-rapids-jni that referenced this pull request Feb 17, 2025
pxLi added a commit to NVIDIA/spark-rapids-jni that referenced this pull request Feb 17, 2025
…nch-24.12 hotfix [skip ci] (#2940)

This reverts commit 50125d6 for
automerge from branch-24.12

Signed-off-by: Peixin Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

5 participants