[WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files #2373

wangshengjie123 · 2024-03-11T02:46:37Z

What changes were proposed in this pull request?

Add logic to support avoid sorting shuffle files for Reduce mode when optimize skew partitions

Why are the changes needed?

Current logic need sorting shuffle files when read Reduce mode skew partition shuffle files, we found some shuffle sorting timeout and performance issue

Does this PR introduce any user-facing change?

No

How was this patch tested?

Cluster test and uts

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

waitinfuture

Thanks @wangshengjie123 for this PR! I left some comments. In addition, is the small change to Spark missing?

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

waitinfuture · 2024-03-12T02:13:42Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

+
+    int step = locations.length / subPartitionSize;
+
+    // if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task:


Seems the logic should be like this:

// if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task: // task 0: 1, 4, 7, 10 // task 1: 2, 4, 8 // task 2: 3, 5, 9 for (int i = 0; i < step + 1; i++) { int index = i * step + subPartitionIndex; if (index < locations.length) { result.add(orderedPartitionLocations[index]); } }

If I am not wrong, the idea is to minimize per row size - and so why column 0 goes "down" the array index, while column 1 goes "up" - and keeps alternating - so that as the size keeps increasing, it is more reasonably distributed for each row (essentially a way to approximate multi-way partition problem).

The result would be different for the formulation above @waitinfuture.

For example:

partition sizes: {1000, 1100, 1300, 1400, 2000, 2500, 3000, 10000, 20000, 25000, 28000, 30000}
subPartitionSize == 3
subPartitionIndex == 1

In formulation from PR we have:

task 0: 1000 , 2500 , 3000 , 30000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 1400 , 20000 , 25000

So the sizes will be:
task 0: 36500
task 1: 41100
task 2: 47700

As formulated above, we will end up with:

task 0: 1000 , 1400 , 3000 , 25000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 2500 , 20000 , 30000

In this case, the sizes will be:
task 0: 30400
task 1: 41100
task 2: 53800

Personally, I would have looked into either largest remainder or knapsack heuristic (given we are sorting anyway).

(Do let me know if I am missing something here @wangshengjie123)

@mridulm Sorry for late reply, your understanding is correct, and i should optimize the logic

Thanks @mridulm for the explanation, I actually didn't get the idea and was thinking the naive way :)

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala

common/src/main/java/org/apache/celeborn/common/write/PushState.java

common/src/main/proto/TransportMessages.proto

common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala

mridulm

Interesting work @wangshengjie123 !

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

mridulm · 2024-03-13T04:58:48Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

+
+    int step = locations.length / subPartitionSize;
+
+    // if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task:


If I am not wrong, the idea is to minimize per row size - and so why column 0 goes "down" the array index, while column 1 goes "up" - and keeps alternating - so that as the size keeps increasing, it is more reasonably distributed for each row (essentially a way to approximate multi-way partition problem).

The result would be different for the formulation above @waitinfuture.

For example:

partition sizes: {1000, 1100, 1300, 1400, 2000, 2500, 3000, 10000, 20000, 25000, 28000, 30000}
subPartitionSize == 3
subPartitionIndex == 1

In formulation from PR we have:

task 0: 1000 , 2500 , 3000 , 30000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 1400 , 20000 , 25000

So the sizes will be:
task 0: 36500
task 1: 41100
task 2: 47700

As formulated above, we will end up with:

task 0: 1000 , 1400 , 3000 , 25000
task 1: 1100 , 2000 , 10000 , 28000
task 2: 1300 , 2500 , 20000 , 30000

In this case, the sizes will be:
task 0: 30400
task 1: 41100
task 2: 53800

Personally, I would have looked into either largest remainder or knapsack heuristic (given we are sorting anyway).

(Do let me know if I am missing something here @wangshengjie123)

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

client/src/main/scala/org/apache/celeborn/client/CommitManager.scala

client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

common/src/main/java/org/apache/celeborn/common/write/PushFailedBatch.java

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/FetchHandler.scala

cfmcgrady · 2024-03-14T07:28:43Z

Thanks @wangshengjie123 for this PR! I left some comments. In addition, is the small change to Spark missing?

HI, @wangshengjie123
Can you please update the Spark patch? It will help the reviewers understand this PR better. Thanks!

wangshengjie123 · 2024-03-16T07:58:48Z

Thanks @wangshengjie123 for this PR! I left some comments. In addition, is the small change to Spark missing?

HI, @wangshengjie123 Can you please update the Spark patch? It will help the reviewers understand this PR better. Thanks!

Sorry for late reply, the pr will be updated today or tomorrow

codecov · 2024-03-16T08:28:59Z

Codecov Report

Attention: Patch coverage is 1.20482% with 82 lines in your changes are missing coverage. Please review.

Project coverage is 48.51%. Comparing base (12c3779) to head (ef81070).
Report is 12 commits behind head on main.

Files	Patch %	Lines
...born/common/protocol/message/ControlMessages.scala	0.00%	38 Missing ⚠️
.../apache/celeborn/common/write/PushFailedBatch.java	0.00%	24 Missing ⚠️
...org/apache/celeborn/common/util/PbSerDeUtils.scala	0.00%	9 Missing ⚠️
...g/apache/celeborn/common/protocol/StorageInfo.java	0.00%	6 Missing ⚠️
...va/org/apache/celeborn/common/write/PushState.java	16.67%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2373      +/-   ##
==========================================
- Coverage   48.77%   48.51%   -0.26%     
==========================================
  Files         209      210       +1     
  Lines       13109    13186      +77     
  Branches     1134     1139       +5     
==========================================
+ Hits         6393     6396       +3     
- Misses       6294     6368      +74     
  Partials      422      422

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RexXiong

Thanks @wangshengjie123 nice pr! Another suggestion is better to add UT for this feature.

RexXiong · 2024-03-19T09:14:29Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

@@ -1393,7 +1414,13 @@ public void onSuccess(ByteBuffer response) {
                    Arrays.toString(partitionIds),
                    groupedBatchId,
                    Arrays.toString(batchIds));
-
+                if (dataPushFailureTrackingEnabled) {


There is no need for HARD_SPLIT to do this. as worker never write the batch when HARD_SPLIT. cc @waitinfuture

I'm not sure if it's possible that the master copy succeeds but the copy fails due to HARD_SPLIT. I will check it again

RexXiong · 2024-03-19T09:46:24Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

@@ -615,6 +663,17 @@ private boolean fillBuffer() throws IOException {

          // de-duplicate
          if (attemptId == attempts[mapId]) {
+            if (splitSkewPartitionWithoutMapRange) {


We can reuse one PushFailedBatch object and update inner fields to improve memory-efficient.

Better to check failedBatches is empty or not first. May be we never need to check failed batches.

get this

fixed to avid NPE

client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

RexXiong · 2024-03-19T09:58:01Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4671,4 +4671,13 @@ object CelebornConf extends Logging {
      .version("0.5.0")
      .intConf
      .createWithDefault(10000)
+
+  val CLIENT_DATA_PUSH_FAILURE_TRACKING_ENABLED: ConfigEntry[Boolean] =


May be we can use another configuration name for enable optimize skew join. The CLIENT_DATA_PUSH_FAILURE_TRACKING_ENABLED doesn't feel so straightforward.

wangshengjie123 · 2024-03-24T09:19:02Z

Thanks @wangshengjie123 nice pr! Another suggestion is better to add UT for this feature.

UTs is doing, test in cluster this week, uts will be submit later

s0nskar · 2024-04-04T08:53:10Z

@wangshengjie123 Is there any doc or ticket explaining this approach? Also for the sort based approach that you mentioned.

s0nskar · 2024-04-04T13:02:29Z

From my understanding, in this PR we're diverting from vanilla spark approach based on mapIndex and just dividing the full partition into multiple sub-partition based on some heuristics. I'm new to Celeborn code, so might be missing something basic but in this PR we're not addressing below issue. If we consider a basic scenario where a partial partition read is happening and we see a FetchFailure.

ShuffleMapStage --> ResultStage

ShuffleMapStage (attempt 0) generated [P0, P1, P2] and P0 is skewed with partition location [0,1,2,3,4,5].
AQE asks for three splits and this PR logic will create three partitions [0, 1], [2, 3], [4, 5]
Now consider is reducer read [0, 1] and [2, 3] and gets FetchFailure while reading [4, 5]
This will trigger a complete mapper stage retry a/c to this doc and will clear the map output corresponding the shuffleID
ShuffleMapStage (attempt 0) will again generate data for P0 at different partition location [a, b, c, d, e, f] and it will get divided like [a, b], [c, d], [e, f]
Now if reader stage is ShuffleMapStage then it will read every sub-partition again but if the reader is ResultStage then it will only read missing partition data which [e, f].

The data generated on location 1 and location a would be different because of other factors like network delay (same thing applies for other locations). Ex – The data that might be present in 1st location in first attempt might be present in 2nd location or any location in different attempt because of the order mapper generated the data and in order server received that data.

This can cause both Data loss and Data duplication, this might be getting addressed in some other place in the codebase that i'm not aware of but i wanted point this problem out.

pan3793 · 2024-04-04T14:00:08Z

@s0nskar Good point, this should be an issue for ResultStage, even though the ShuffleMapStage's output is deterministic.

IIRC, vanilla Spark also has some limitations on stage retry cases for ResultStage when ShuffleMapStage's output is indeterministic, for such cases, we need to fail the job, right?

s0nskar · 2024-04-04T14:13:24Z

@pan3793 This does not become problem if we are maintaining the concept of mapIndex ranges as spark will always read deterministic output for each sub-partition.

As vanilla spark always read deterministic output because of mapIndex range filter, it will not face this issue. In this approach sub-partitions data will be indeterministic across stage attempts. Failing would be only option for such cases until spark start supporting ResultStage rollback.

s0nskar · 2024-04-04T14:26:45Z

Also, I think this issue would not be only limited to ResultStage, this can happen with ShuffleMapStage as well in some complex cases. Consider another scenario –

ShuffleMapStage1 -----> ShuffleMapStage2 ----->

Similar to above example, let's say partition skew P0 generated by ShuffleMapStage1.
ShuffleMapStage2 gets FetchFailure while reading sub-partitions of ShuffleMapStage1.
ShuffleMapStage1 will be recomputed and shuffle outputs will be cleared.
Only missing task of ShuffleMapStage2 will be retries, again causing the same issue.

This is case though, we can rollback the whole lineage till this point instead of failing this job. Similar to what vanilla spark does, what this will be very expensive.

pan3793 · 2024-04-04T16:23:48Z

@s0nskar I see your point. When consuming skew partitions, we should always treat the previous ShuffleMapStage's output as indeterministic under the current approach to avoid correctness issues.

waitinfuture · 2024-04-04T16:42:21Z

Hi @s0nskar , thanks for your point, I think you are correct. Seems this PR conflicts with stage rerun.

we should always treat the previous ShuffleMapStage's output as indeterministic under the current approach to avoid correctness issues.

@pan3793 Is it possible to force make it as indeterministic?

Also, I think Spark doesn't correctly set stage's determinism for some cases, for example a row_number window operator followed by aggregation keyed by the row_number.

cc @mridulm @ErikFang

waitinfuture · 2024-04-04T16:56:40Z

@wangshengjie123 Is there any doc or ticket explaining this approach? Also for the sort based approach that you mentioned.

The sort based approach is roughly like this:

Each sub reducer reads from all partition splits of its partitionId for data within its map range
The first read request will trigger the partition split file to be sorted based on map ids, so each IO will be sequential

s0nskar · 2024-04-04T17:49:59Z

Thanks a lot @waitinfuture for the sort based approach description.

Is it possible to force make it as indeterministic?

IMO this would be very difficult to do it from Celeborn itself but it can be done by putting a patch in the Spark code. ShuffledRowRDD can set Determinacy Level to INDETEMINATE if partial partition reads are happening and Celeborn is getting is used.

cc: @mridulm for viz

pan3793 · 2024-04-04T18:09:09Z

@waitinfuture It seems this PR is getting attention, some discussions happened offline, we'd better update the PR description(or Google Docs) to summarize the whole design and known issues so far

s0nskar · 2024-04-05T04:11:00Z

assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_3.patch

+         }
+-        PartialReducerPartitionSpec(reducerId, startMapIndex, endMapIndex, dataSize)
+        if (splitSkewPartitionWithCeleborn) {
+          PartialReducerPartitionSpec(reducerId, mapStartIndices.length, i, dataSize)


We can maybe add a note here that these dataSize will not be accurate. Even though in the current downstream code, we're only getting the sum of dataSize which should be equal but someone might be using these differently.

mridulm · 2024-04-06T01:48:17Z

It has been a while since I looked at this PR - but as formulated, the split into subranges is deterministic (if it is not, it should be made so).
With that in place, this would not be an issue ...
(I will take a deeper look later next week, but do let me know if I am missing something so that I can add that to my analysis)

waitinfuture · 2024-04-06T03:05:00Z

It has been a while since I looked at this PR - but as formulated, the split into subranges is deterministic (if it is not, it should be made so). With that in place, this would not be an issue ... (I will take a deeper look later next week, but do let me know if I am missing something so that I can add that to my analysis)

the split into subranges is deterministic

The way Celeborn splits partition is not deterministic with stage rerun, for example any push failure will cause split, so I'm afraid this statement does not hold...

mridulm · 2024-04-06T05:14:08Z

Ah, I see what you mean ... PartitionLocation would change between retries.
Yeah, this is a problem then - it will cause data loss. This would be a variant of SPARK-23207

I will need to relook at the PR, and how it interact with Celeborn - but if scenarios directly described in SPARK-23207 (or variants of it) are applicable (and we cant mitigate it), we should not proceed down this path given the correctness implications unfortunately.

mridulm · 2024-04-06T05:27:58Z

+CC @otterc as well.

waitinfuture · 2024-04-06T11:59:22Z

Ah, I see what you mean ... PartitionLocation would change between retries. Yeah, this is a problem then - it will cause data loss. This would be a variant of SPARK-23207

I will need to relook at the PR, and how it interact with Celeborn - but if scenarios directly described in SPARK-23207 (or variants of it) are applicable (and we cant mitigate it), we should not proceed down this path given the correctness implications unfortunately.

Maybe we can remain both this optimization and stage rerun, but only allows one to take effect by checking configs for now. The performance issue this PR solves does happen in production.

github-actions · 2024-06-23T08:30:23Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-07-14T08:32:31Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-08-04T08:31:36Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-08-26T08:33:43Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-09-16T08:34:40Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-10-09T08:34:37Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-10-30T08:36:02Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-11-10T08:32:46Z

This issue was closed because it has been staled for 10 days with no activity.

xy2953396112 · 2024-11-13T07:15:42Z

Is this optimization used in your production environment?

wangshengjie123 changed the title ~~[CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files~~ [WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files Mar 11, 2024

lyy-pineapple reviewed Mar 11, 2024

View reviewed changes

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java Outdated Show resolved Hide resolved

waitinfuture reviewed Mar 12, 2024

View reviewed changes

mridulm reviewed Mar 13, 2024

View reviewed changes

wangshengjie123 closed this Mar 16, 2024

wangshengjie123 reopened this Mar 16, 2024

wangshengjie123 force-pushed the optimize-skew-partition branch from b3af836 to 599be24 Compare March 16, 2024 08:14

RexXiong reviewed Mar 19, 2024

View reviewed changes

wangshengjie123 force-pushed the optimize-skew-partition branch 2 times, most recently from 8fe3a13 to 12eca26 Compare March 26, 2024 14:12

s0nskar reviewed Apr 5, 2024

View reviewed changes

rebase main and fix npe

b3a30be

wangshengjie123 force-pushed the optimize-skew-partition branch from e96516e to b3a30be Compare June 2, 2024 14:28

wangshengjie123 added 3 commits June 3, 2024 09:41

fix ut

f5715c4

fix npe when memory storage enabled

eef849a

fix code style check error

81947e7

github-actions bot added the stale label Jun 23, 2024

RexXiong removed the stale label Jun 23, 2024

github-actions bot added the stale label Jul 14, 2024

waitinfuture removed the stale label Jul 14, 2024

github-actions bot added the stale label Aug 4, 2024

waitinfuture removed the stale label Aug 5, 2024

github-actions bot added the stale label Aug 26, 2024

RexXiong removed the stale label Aug 27, 2024

github-actions bot added the stale label Sep 16, 2024

RexXiong removed the stale label Sep 19, 2024

github-actions bot added the stale label Oct 9, 2024

RexXiong removed the stale label Oct 9, 2024

github-actions bot added the stale label Oct 30, 2024

github-actions bot closed this Nov 10, 2024

waitinfuture reopened this Nov 10, 2024

github-actions bot removed the stale label Nov 11, 2024


		int step = locations.length / subPartitionSize;

		// if partition location is [1,2,3,4,5,6,7,8,9,10], and skew partition split to 3 task:

[WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files #2373

Are you sure you want to change the base?

[WIP][CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files #2373

Conversation

wangshengjie123 commented Mar 11, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

waitinfuture left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfmcgrady commented Mar 14, 2024

wangshengjie123 commented Mar 16, 2024

codecov bot commented Mar 16, 2024 • edited Loading

Codecov Report

RexXiong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangshengjie123 Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangshengjie123 commented Mar 24, 2024

s0nskar commented Apr 4, 2024 • edited Loading

s0nskar commented Apr 4, 2024

pan3793 commented Apr 4, 2024

s0nskar commented Apr 4, 2024

s0nskar commented Apr 4, 2024

pan3793 commented Apr 4, 2024 • edited Loading

waitinfuture commented Apr 4, 2024 • edited Loading

waitinfuture commented Apr 4, 2024 • edited Loading

s0nskar commented Apr 4, 2024

pan3793 commented Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

mridulm commented Apr 6, 2024 • edited Loading

waitinfuture commented Apr 6, 2024

mridulm commented Apr 6, 2024 • edited Loading

mridulm commented Apr 6, 2024

waitinfuture commented Apr 6, 2024

github-actions bot commented Jun 23, 2024

github-actions bot commented Jul 14, 2024

github-actions bot commented Aug 4, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 10, 2024

xy2953396112 commented Nov 13, 2024

codecov bot commented Mar 16, 2024 •

edited

Loading

wangshengjie123 Mar 24, 2024 •

edited

Loading

s0nskar commented Apr 4, 2024 •

edited

Loading

pan3793 commented Apr 4, 2024 •

edited

Loading

waitinfuture commented Apr 4, 2024 •

edited

Loading

waitinfuture commented Apr 4, 2024 •

edited

Loading

pan3793 commented Apr 4, 2024 •

edited

Loading

mridulm commented Apr 6, 2024 •

edited

Loading

mridulm commented Apr 6, 2024 •

edited

Loading