Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) #50752

Merged
merged 2 commits into from
Sep 6, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Sep 5, 2024

Why I'm doing:

One problem with tpch sf100 Q21 is that the performance is very unstable, if you look at the logs you can see that there are a lot of mem alloc failed cases inside scan node 8

python run-bench.py --endpoint 127.0.0.1 --port 41003 --user root --mysql --db hive.zz_tpch_sf100_hive_parquet_lz4 --sql-file tpch --times 3 --warmup 1 --include Q21

running Q21 on mysql
>>> warmup begin
--> 20130. avg = 20130.
<<< warmup end
--> 12773. avg = 12773.
--> 12996. avg = 12884.
--> 17706. avg = 14491.

If you analyze the logs, you can observe that most of the memory is allocated by nodes 0,2,5. This leads to a problem that node 8 has almost no way to get memory: (my memory is 64G, scan_mem_ratio=0.3, then at most 15G memory is allocated to connector scan ndoe)

  • 0: mem = 6206458665(6G), io task = 24
  • 2: mem = 5338887537(5G), io task = 20
  • 5: mem = 4438315236(4G), io task = 17

The root cause of this problem is the connector mem arbitrator implementation. The original intent of this implementation was to be able to average memory between each node. The algorithm is as follows:

int64_t ConnectorScanOperatorMemShareArbitrator::update_chunk_source_mem_bytes(int64_t old_value, int64_t new_value) {
    int64_t diff = new_value - old_value;
    int64_t total = total_chunk_source_mem_bytes.fetch_add(diff) + diff;
    if (new_value == 0) return 0;
    if (total <= 0) return scan_mem_limit;
    return scan_mem_limit * (new_value * 1.0 / std::max(total, new_value));
}

Each node starts with update_chunk_source_mem_bytes(0, mem). But the problem with this is that for the very first node, it uses up all the memory. And it causes all the following nodes to not get any memory.

What I'm doing:

So the fix here is to

  • count exactly how many connector scan node nodes exist on the FE
  • and preset each node to initially use 256MB of memory.
  • then each node makes adjustments from update_chunk_source_mem_bytes(256M, mem)

benchmark

After the modification, the execution time can be stabilized at 9s

running Q21 on mysql
>>> warmup begin
--> 9795. avg = 9795.
<<< warmup end
--> 9445. avg = 9445.
--> 10411. avg = 9928. --> 9068. avg = 9928.
--> 9068. avg = 9641.

If more memory is given, then the time can be reduced even further

running Q21 on mysql
>>> warmup begin
--> 6679. avg = 6679.
<<< warmup end
avg = 7262. avg = 7262. --> 7958. avg = 7958. avg = 7262.
--> 7958. avg = 7610. --> 6572. avg = 6572. avg = 7610.
--> 6572. avg = 7264.

Here you can compare the effect of turning off the adaptive effect, the time difference is not much.

>>> warmup begin
--> 9423. avg = 9423.
<<< warmup end
--> 7516. avg = 7516. avg = 7288. avg = 7288.
--> 7288. avg = 7402.
--> 7293. avg = 7365.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

This is an automatic backport of pull request #50686 done by [Mergify](https://mergify.com). ## Why I'm doing:

One problem with tpch sf100 Q21 is that the performance is very unstable, if you look at the logs you can see that there are a lot of mem alloc failed cases inside scan node 8

python run-bench.py --endpoint 127.0.0.1 --port 41003 --user root --mysql --db hive.zz_tpch_sf100_hive_parquet_lz4 --sql-file tpch --times 3 --warmup 1 --include Q21

running Q21 on mysql
>>> warmup begin
--> 20130. avg = 20130.
<<< warmup end
--> 12773. avg = 12773.
--> 12996. avg = 12884.
--> 17706. avg = 14491.

If you analyze the logs, you can observe that most of the memory is allocated by nodes 0,2,5. This leads to a problem that node 8 has almost no way to get memory: (my memory is 64G, scan_mem_ratio=0.3, then at most 15G memory is allocated to connector scan ndoe)

  • 0: mem = 6206458665(6G), io task = 24
  • 2: mem = 5338887537(5G), io task = 20
  • 5: mem = 4438315236(4G), io task = 17

The root cause of this problem is the connector mem arbitrator implementation. The original intent of this implementation was to be able to average memory between each node. The algorithm is as follows:

int64_t ConnectorScanOperatorMemShareArbitrator::update_chunk_source_mem_bytes(int64_t old_value, int64_t new_value) {
    int64_t diff = new_value - old_value;
    int64_t total = total_chunk_source_mem_bytes.fetch_add(diff) + diff;
    if (new_value == 0) return 0;
    if (total <= 0) return scan_mem_limit;
    return scan_mem_limit * (new_value * 1.0 / std::max(total, new_value));
}

Each node starts with update_chunk_source_mem_bytes(0, mem). But the problem with this is that for the very first node, it uses up all the memory. And it causes all the following nodes to not get any memory.

What I'm doing:

So the fix here is to

  • count exactly how many connector scan node nodes exist on the FE
  • and preset each node to initially use 256MB of memory.
  • then each node makes adjustments from update_chunk_source_mem_bytes(256M, mem)

benchmark

After the modification, the execution time can be stabilized at 9s

running Q21 on mysql
>>> warmup begin
--> 9795. avg = 9795.
<<< warmup end
--> 9445. avg = 9445.
--> 10411. avg = 9928. --> 9068. avg = 9928.
--> 9068. avg = 9641.

If more memory is given, then the time can be reduced even further

running Q21 on mysql
>>> warmup begin
--> 6679. avg = 6679.
<<< warmup end
avg = 7262. avg = 7262. --> 7958. avg = 7958. avg = 7262.
--> 7958. avg = 7610. --> 6572. avg = 6572. avg = 7610.
--> 6572. avg = 7264.

Here you can compare the effect of turning off the adaptive effect, the time difference is not much.

>>> warmup begin
--> 9423. avg = 9423.
<<< warmup end
--> 7516. avg = 7516. avg = 7288. avg = 7288.
--> 7288. avg = 7402.
--> 7293. avg = 7365.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

…50686)

Signed-off-by: yanz <[email protected]>
(cherry picked from commit 15a6518)

# Conflicts:
#	be/src/connector/connector.h
#	be/src/exec/pipeline/fragment_executor.cpp
#	gensrc/thrift/InternalService.thrift
@mergify mergify bot added the conflicts label Sep 5, 2024
Copy link
Contributor Author

mergify bot commented Sep 5, 2024

Cherry-pick of 15a6518 has failed:

On branch mergify/bp/branch-3.2/pr-50686
Your branch is up to date with 'origin/branch-3.2'.

You are currently cherry-picking commit 15a6518fae.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/exec/pipeline/query_context.cpp
	modified:   be/src/exec/pipeline/query_context.h
	modified:   be/src/exec/pipeline/scan/connector_scan_operator.cpp
	modified:   be/src/exec/pipeline/scan/connector_scan_operator.h
	modified:   be/src/exec/workgroup/work_group.cpp
	modified:   be/src/runtime/exec_env.cpp
	modified:   fe/fe-core/src/main/java/com/starrocks/qe/scheduler/dag/JobSpec.java

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   be/src/connector/connector.h
	both modified:   be/src/exec/pipeline/fragment_executor.cpp
	both modified:   gensrc/thrift/InternalService.thrift

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Copy link
Contributor Author

mergify bot commented Sep 5, 2024

@mergify[bot]: Backport conflict, please reslove the conflict and resubmit the pr

@mergify mergify bot deleted the mergify/bp/branch-3.2/pr-50686 branch September 5, 2024 08:28
@dirtysalt dirtysalt restored the mergify/bp/branch-3.2/pr-50686 branch September 5, 2024 09:47
@dirtysalt dirtysalt reopened this Sep 5, 2024
@wanpengfei-git wanpengfei-git enabled auto-merge (squash) September 5, 2024 09:48
-e
Signed-off-by: yanz <[email protected]>
Copy link

sonarcloud bot commented Sep 5, 2024

@dirtysalt dirtysalt changed the title [Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) [BugFix] reduce mem alloc failed because unfair memory sharing (backport #50686) Sep 5, 2024
@dirtysalt dirtysalt changed the title [BugFix] reduce mem alloc failed because unfair memory sharing (backport #50686) [Enhancement] reduce mem alloc failed because unfair memory sharing (backport #50686) Sep 5, 2024
@wanpengfei-git wanpengfei-git merged commit 13d42f4 into branch-3.2 Sep 6, 2024
73 of 74 checks passed
@wanpengfei-git wanpengfei-git deleted the mergify/bp/branch-3.2/pr-50686 branch September 6, 2024 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants