[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

mergify · 2024-09-05T08:27:05Z

Why I'm doing:

One problem with tpch sf100 Q21 is that the performance is very unstable, if you look at the logs you can see that there are a lot of mem alloc failed cases inside scan node 8

python run-bench.py --endpoint 127.0.0.1 --port 41003 --user root --mysql --db hive.zz_tpch_sf100_hive_parquet_lz4 --sql-file tpch --times 3 --warmup 1 --include Q21

running Q21 on mysql
>>> warmup begin
--> 20130. avg = 20130.
<<< warmup end
--> 12773. avg = 12773.
--> 12996. avg = 12884.
--> 17706. avg = 14491.

If you analyze the logs, you can observe that most of the memory is allocated by nodes 0,2,5. This leads to a problem that node 8 has almost no way to get memory: (my memory is 64G, scan_mem_ratio=0.3, then at most 15G memory is allocated to connector scan ndoe)

0: mem = 6206458665(6G), io task = 24
2: mem = 5338887537(5G), io task = 20
5: mem = 4438315236(4G), io task = 17

The root cause of this problem is the connector mem arbitrator implementation. The original intent of this implementation was to be able to average memory between each node. The algorithm is as follows:

int64_t ConnectorScanOperatorMemShareArbitrator::update_chunk_source_mem_bytes(int64_t old_value, int64_t new_value) {
    int64_t diff = new_value - old_value;
    int64_t total = total_chunk_source_mem_bytes.fetch_add(diff) + diff;
    if (new_value == 0) return 0;
    if (total <= 0) return scan_mem_limit;
    return scan_mem_limit * (new_value * 1.0 / std::max(total, new_value));
}

Each node starts with update_chunk_source_mem_bytes(0, mem). But the problem with this is that for the very first node, it uses up all the memory. And it causes all the following nodes to not get any memory.

What I'm doing:

So the fix here is to

count exactly how many connector scan node nodes exist on the FE
and preset each node to initially use 256MB of memory.
then each node makes adjustments from update_chunk_source_mem_bytes(256M, mem)

benchmark

After the modification, the execution time can be stabilized at 9s

running Q21 on mysql
>>> warmup begin
--> 9795. avg = 9795.
<<< warmup end
--> 9445. avg = 9445.
--> 10411. avg = 9928. --> 9068. avg = 9928.
--> 9068. avg = 9641.

If more memory is given, then the time can be reduced even further

running Q21 on mysql
>>> warmup begin
--> 6679. avg = 6679.
<<< warmup end
avg = 7262. avg = 7262. --> 7958. avg = 7958. avg = 7262.
--> 7958. avg = 7610. --> 6572. avg = 6572. avg = 7610.
--> 6572. avg = 7264.

Here you can compare the effect of turning off the adaptive effect, the time difference is not much.

>>> warmup begin
--> 9423. avg = 9423.
<<< warmup end
--> 7516. avg = 7516. avg = 7288. avg = 7288.
--> 7288. avg = 7402.
--> 7293. avg = 7365.

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

This is an automatic backport of pull request #50686 done by [Mergify](https://mergify.com). ## Why I'm doing:

One problem with tpch sf100 Q21 is that the performance is very unstable, if you look at the logs you can see that there are a lot of mem alloc failed cases inside scan node 8

python run-bench.py --endpoint 127.0.0.1 --port 41003 --user root --mysql --db hive.zz_tpch_sf100_hive_parquet_lz4 --sql-file tpch --times 3 --warmup 1 --include Q21

running Q21 on mysql
>>> warmup begin
--> 20130. avg = 20130.
<<< warmup end
--> 12773. avg = 12773.
--> 12996. avg = 12884.
--> 17706. avg = 14491.

If you analyze the logs, you can observe that most of the memory is allocated by nodes 0,2,5. This leads to a problem that node 8 has almost no way to get memory: (my memory is 64G, scan_mem_ratio=0.3, then at most 15G memory is allocated to connector scan ndoe)

0: mem = 6206458665(6G), io task = 24
2: mem = 5338887537(5G), io task = 20
5: mem = 4438315236(4G), io task = 17

The root cause of this problem is the connector mem arbitrator implementation. The original intent of this implementation was to be able to average memory between each node. The algorithm is as follows:

int64_t ConnectorScanOperatorMemShareArbitrator::update_chunk_source_mem_bytes(int64_t old_value, int64_t new_value) {
    int64_t diff = new_value - old_value;
    int64_t total = total_chunk_source_mem_bytes.fetch_add(diff) + diff;
    if (new_value == 0) return 0;
    if (total <= 0) return scan_mem_limit;
    return scan_mem_limit * (new_value * 1.0 / std::max(total, new_value));
}

Each node starts with update_chunk_source_mem_bytes(0, mem). But the problem with this is that for the very first node, it uses up all the memory. And it causes all the following nodes to not get any memory.

What I'm doing:

So the fix here is to

count exactly how many connector scan node nodes exist on the FE
and preset each node to initially use 256MB of memory.
then each node makes adjustments from update_chunk_source_mem_bytes(256M, mem)

benchmark

After the modification, the execution time can be stabilized at 9s

running Q21 on mysql
>>> warmup begin
--> 9795. avg = 9795.
<<< warmup end
--> 9445. avg = 9445.
--> 10411. avg = 9928. --> 9068. avg = 9928.
--> 9068. avg = 9641.

If more memory is given, then the time can be reduced even further

running Q21 on mysql
>>> warmup begin
--> 6679. avg = 6679.
<<< warmup end
avg = 7262. avg = 7262. --> 7958. avg = 7958. avg = 7262.
--> 7958. avg = 7610. --> 6572. avg = 6572. avg = 7610.
--> 6572. avg = 7264.

Here you can compare the effect of turning off the adaptive effect, the time difference is not much.

>>> warmup begin
--> 9423. avg = 9423.
<<< warmup end
--> 7516. avg = 7516. avg = 7288. avg = 7288.
--> 7288. avg = 7402.
--> 7293. avg = 7365.

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

…50686) Signed-off-by: yanz <[email protected]> (cherry picked from commit 15a6518) # Conflicts: # be/src/connector/connector.h # be/src/exec/pipeline/fragment_executor.cpp # gensrc/thrift/InternalService.thrift

mergify · 2024-09-05T08:27:06Z

Cherry-pick of 15a6518 has failed:

On branch mergify/bp/branch-3.2/pr-50686
Your branch is up to date with 'origin/branch-3.2'.

You are currently cherry-picking commit 15a6518fae.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/exec/pipeline/query_context.cpp
	modified:   be/src/exec/pipeline/query_context.h
	modified:   be/src/exec/pipeline/scan/connector_scan_operator.cpp
	modified:   be/src/exec/pipeline/scan/connector_scan_operator.h
	modified:   be/src/exec/workgroup/work_group.cpp
	modified:   be/src/runtime/exec_env.cpp
	modified:   fe/fe-core/src/main/java/com/starrocks/qe/scheduler/dag/JobSpec.java

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   be/src/connector/connector.h
	both modified:   be/src/exec/pipeline/fragment_executor.cpp
	both modified:   gensrc/thrift/InternalService.thrift

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

mergify · 2024-09-05T08:27:41Z

@mergify[bot]: Backport conflict, please reslove the conflict and resubmit the pr

-e Signed-off-by: yanz <[email protected]>

sonarcloud · 2024-09-05T10:02:45Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

mergify bot added the conflicts label Sep 5, 2024

mergify bot mentioned this pull request Sep 5, 2024

[Enhancement] reduce mem alloc failed because unfair memory sharing #50686

Merged

24 tasks

github-actions bot assigned dirtysalt Sep 5, 2024

mergify bot closed this Sep 5, 2024

github-actions bot added the automerge label Sep 5, 2024

mergify bot deleted the mergify/bp/branch-3.2/pr-50686 branch September 5, 2024 08:28

dirtysalt restored the mergify/bp/branch-3.2/pr-50686 branch September 5, 2024 09:47

dirtysalt reopened this Sep 5, 2024

wanpengfei-git enabled auto-merge (squash) September 5, 2024 09:48

fix conflict

9e0c40c

-e Signed-off-by: yanz <[email protected]>

dirtysalt approved these changes Sep 5, 2024

View reviewed changes

dirtysalt changed the title ~~[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686)~~ [BugFix] reduce mem alloc failed because unfair memory sharing (backport #50686) Sep 5, 2024

dirtysalt changed the title ~~[BugFix] reduce mem alloc failed because unfair memory sharing (backport #50686)~~ [Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) Sep 5, 2024

dirtysalt approved these changes Sep 6, 2024

View reviewed changes

wanpengfei-git merged commit 13d42f4 into branch-3.2 Sep 6, 2024
73 of 74 checks passed

wanpengfei-git deleted the mergify/bp/branch-3.2/pr-50686 branch September 6, 2024 08:00

github-actions bot added the version:3.2.11 label Sep 6, 2024

wanpengfei-git added version:3.2.12 and removed version:3.2.11 labels Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

mergify bot commented Sep 5, 2024 •

edited by dirtysalt

Loading

mergify bot commented Sep 5, 2024

mergify bot commented Sep 5, 2024

sonarcloud bot commented Sep 5, 2024

[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

Conversation

mergify bot commented Sep 5, 2024 • edited by dirtysalt Loading

Why I'm doing:

What I'm doing:

benchmark

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

What I'm doing:

benchmark

What type of PR is this:

Checklist:

mergify bot commented Sep 5, 2024

mergify bot commented Sep 5, 2024

sonarcloud bot commented Sep 5, 2024

Quality Gate passed

mergify bot commented Sep 5, 2024 •

edited by dirtysalt

Loading