Fix race check failures in shared memory groupby #17985

PointKernel · 2025-02-11T19:17:49Z

Description

This fixes the race check failures in shared memory groupby and resolves NVIDIA/spark-rapids#11835.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

PointKernel · 2025-02-11T19:20:10Z

cpp/src/groupby/hash/compute_shared_memory_aggs.cu

@@ -213,6 +211,7 @@ CUDF_KERNEL void single_pass_shmem_aggs_kernel(cudf::size_type num_rows,
  block.sync();

  while (col_end < num_cols) {
+    block.sync();


Missing this block sync is the culprit of the Spark query failure and I still don't understand why

Is it equivalent if you put the block.sync(); after compute_final_aggregations? I think that would be better. When entering this while loop, we should already be synced by line 211. Syncing at the end of the loop would feel safer, since it forces a final sync that won't happen if the sync is at the beginning of this while loop.

Is it equivalent if you put the block.sync(); after compute_final_aggregations?

No, the current implementation is actually synced after compute_final_aggregations (or at the end of thecompute_final_aggregations) and failed the spark query. That's why I'm saying I don't understand 😕

More info for reference

If we sync at the end of the loop right after compute_final_aggregations. compute sanitizer will report a race condition between

cudf/cpp/src/groupby/hash/compute_shared_memory_aggs.cu

Line 73 in 847fa28

++col_end;

and

cudf/cpp/src/groupby/hash/compute_shared_memory_aggs.cu

Line 215 in 847fa28

while (col_end < num_cols) {

.

Relocating the block sync from the end of the loop to the current position eliminates the race check failure and resolves the Spark query issue.

Does this explain it?

We're at the beginning of the while loop.

All threads read the shared variable col_end to compare col_end < num_cols.

We are adding a block sync here!

Thread 0 runs calculate_columns_to_aggregate and updates the shared variable col_end.

Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).

These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.

Other threads besides thread 0 could have a race condition between (2) and (4), unless we force them to be ordered by inserting a sync at (3).

Totally makes sense. 🔥

Ah yes, that's it @bdice

These functions are not inlined, so I can see how maybe moving the block sync could change the timing such that the bug is now masked. But I think it may still be there hiding. I wouldn't be surprised at all if this was a false positive on racecheck. Or this change is just lucky for it.

I see your concern. To be honest, I still don’t understand how the current change resolves the Spark failure. However, if it were an illegal memory access issue, our nightly memcheck tests should have already detected it, and our Python and C++ tests wouldn’t have survived thousands of runs.

cpp/src/groupby/hash/compute_shared_memory_aggs.hpp

cpp/src/groupby/hash/shared_memory_aggregator.cuh

This PR applies a patch to 24.12 that contains the core fix to the groupby algorithm that [was merged](rapidsai/cudf#17985) into cuDF 25.02. Signed-off-by: Paul Mattione <[email protected]>

….12 (NVIDIA#2920)" This reverts commit 50125d6.

….12 (NVIDIA#2920)" This reverts commit 50125d6. Signed-off-by: Peixin Li <[email protected]>

…nch-24.12 hotfix [skip ci] (#2940) This reverts commit 50125d6 for automerge from branch-24.12 Signed-off-by: Peixin Li <[email protected]>

Fix race check failures in shared memory groupby

6a2de87

PointKernel self-assigned this Feb 11, 2025

PointKernel requested a review from a team as a code owner February 11, 2025 19:17

PointKernel requested review from shrshi and vuule February 11, 2025 19:17

PointKernel commented Feb 11, 2025

View reviewed changes

PointKernel mentioned this pull request Feb 11, 2025

Disable shared memmory aggregations in CUDF NVIDIA/spark-rapids-jni#2892

Merged

bdice reviewed Feb 11, 2025

View reviewed changes

cpp/src/groupby/hash/compute_shared_memory_aggs.hpp Show resolved Hide resolved

shrshi reviewed Feb 11, 2025

View reviewed changes

cpp/src/groupby/hash/shared_memory_aggregator.cuh Show resolved Hide resolved

shrshi approved these changes Feb 11, 2025

View reviewed changes

bdice approved these changes Feb 11, 2025

View reviewed changes

pmattione-nvidia approved these changes Feb 11, 2025

View reviewed changes

PointKernel mentioned this pull request Feb 11, 2025

Fix shared memory groupby race check failures #17976

Closed

3 tasks

raydouglass merged commit 4b2ce98 into rapidsai:branch-25.02 Feb 11, 2025
127 of 128 checks passed

PointKernel deleted the fix-groupby-race-check branch February 11, 2025 21:18

pxLi added a commit to pxLi/spark-rapids-jni that referenced this pull request Feb 17, 2025

Revert "Apply groupby patch based on rapidsai/cudf#17985 to branch-24…

aeb5dab

….12 (NVIDIA#2920)" This reverts commit 50125d6.

pxLi added a commit to pxLi/spark-rapids-jni that referenced this pull request Feb 17, 2025

Revert "Apply groupby patch based on rapidsai/cudf#17985 to branch-24…

a8a767e

….12 (NVIDIA#2920)" This reverts commit 50125d6. Signed-off-by: Peixin Li <[email protected]>

PointKernel mentioned this pull request Feb 18, 2025

[DO NOT MERGE] Test groupby corruption #17926

Closed

3 tasks

ttnghia mentioned this pull request Feb 21, 2025

Revert "Apply groupby patch based on rapidsai/cudf#17985..." from branch-24.12 hotfix [skip ci] NVIDIA/spark-rapids-jni#2940

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race check failures in shared memory groupby #17985

Fix race check failures in shared memory groupby #17985

PointKernel commented Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025 •

edited

Loading

bdice Feb 11, 2025

PointKernel Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025

bdice Feb 11, 2025

pmattione-nvidia Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025

pmattione-nvidia Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025

Fix race check failures in shared memory groupby #17985

Fix race check failures in shared memory groupby #17985

Conversation

PointKernel commented Feb 11, 2025 • edited Loading

Description

Checklist

PointKernel Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

bdice Feb 11, 2025

Choose a reason for hiding this comment

PointKernel Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

PointKernel Feb 11, 2025

Choose a reason for hiding this comment

bdice Feb 11, 2025

Choose a reason for hiding this comment

pmattione-nvidia Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

PointKernel Feb 11, 2025

Choose a reason for hiding this comment

pmattione-nvidia Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

PointKernel Feb 11, 2025

Choose a reason for hiding this comment

PointKernel commented Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025 •

edited

Loading

PointKernel Feb 11, 2025 •

edited

Loading

pmattione-nvidia Feb 11, 2025 •

edited

Loading

pmattione-nvidia Feb 11, 2025 •

edited

Loading