Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

ctbrennan · 2024-09-19T16:07:40Z

Why I'm doing:

This is part of the implementation of the Resource Isolation for Shared Data Mode feature

What I'm doing:

Makes syntax and functionality changes to CACHE SELECT statements to allow caching data for specified resource isolation groups and more than one replica.

They can now be of the form

CACHE SELECT <column_name> [, ...]
FROM [<catalog_name>.][<db_name>.]<table_name> [WHERE <boolean_expression>]
[PROPERTIES("verbose"="true", "resource_isolation_groups"="<GROUP_ID_1>,...,<GROUP_ID_N>", "num_replicas"="<NUM_REPLICAS>")]

If no resource_isolation_groups are specified, uses the resource group of the current frontend.
If no num_replicas, we assume 1.

Fixes #issue https://jira.pinadmin.com/browse/RTA-6269

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

…ch should be quieted. Missing unit tests. has some directional todos.

zhenxiao

looks good
a few small things

.gitignore

zhenxiao · 2024-09-19T23:51:51Z

fe/fe-core/src/main/java/com/starrocks/qe/CacheSelectBackendSelector.java

+        List<Long> cnIdsOrderedByPreference = mapper.computeNodesForTablet(
+                tabletId, props.numReplicasDesired, resourceIsolationGroupId);
+        if (cnIdsOrderedByPreference.size() < props.numReplicasDesired) {
+            throw new DdlException(String.format("Requesting more replicas than we have available CN" +


shall we add a TODO: if requesting more replicas than available, shall we trigger a replica load task, from S3 to compute node

Not totally sure what you mean by "replica load task", that basically describes what is already being executed here. Are you saying we should schedule it for later, if/when we have more compute nodes available? I don't think that's a good approach necessarily.

My thinking is this: if we're requesting more replicas than we have compute nodes, we can't fulfill the intention of the statement, so I'm throwing an exception. The other option is to make a best-effort attempt and select all of the available compute nodes.

fe/fe-core/src/main/java/com/starrocks/qe/CacheSelectBackendSelector.java

fe/fe-core/src/main/java/com/starrocks/sql/analyzer/DataCacheStmtAnalyzer.java

…tion-3.3_cacheselect

zhenxiao

looks good
one minor suggestion

zhenxiao · 2024-09-26T22:33:12Z

fe/fe-core/src/main/java/com/starrocks/lake/qe/scheduler/DefaultSharedDataWorkerProvider.java

+    // really shouldn't be getting ComputeNode references for un-matching resource isolation groups or unhealthy
+    // ComputeNodes. Instead of changing a bunch of code which uses the WorkerProvider in a specific way, this way
+    // limits scope to only change behavior when the user of the WorkerProvider sets this very specific option.
+    public void setAllowGetAnyWorker(boolean allowGetAnyWorker) {


Shall we use a more intuitive naming? e.g. getWorkerFromUnmatchingIsolationGroup?

This specific function isn't getting any worker, it's signaling the intent to allow getting any worker.

…tion-3.3_cacheselect Need to cherrypick from upstream for BE build fix related to error seen when loading: /opt/starrocks/be/lib/starrocks_be: error while loading shared libraries: libbfd-2.38-system.so: cannot open shared object file: No such file or directory

ctbrennan added 8 commits September 19, 2024 10:02

compute node selection for ingestion and reads. Note, has logging whi…

42a2b62

…ch should be quieted. Missing unit tests. has some directional todos.

TODOs to remind me what to look at later

5a9b98c

small refactor of resoure isolation group utilities

7f87cc8

cache select logic and remove obsolete TODOs

55795e4

gitignore build.sh and remove newlines and unnecessary code

6442fac

lint

ec2132d

remove build.sh

e3aa09d

defaults and tests

6c8fd83

ctbrennan marked this pull request as ready for review September 19, 2024 19:20

zhenxiao approved these changes Sep 20, 2024

View reviewed changes

ctbrennan added 3 commits September 20, 2024 12:24

correct some backend selection logic and change func name

5bcf04f

Merge branch 'pinterest-integration-3.3' into cbrennan/resource-isola…

8376c2c

…tion-3.3_cacheselect

bug fix for cache selection

d682c96

zhenxiao approved these changes Sep 26, 2024

View reviewed changes

ctbrennan merged commit bec060d into pinterest-integration-3.3 Oct 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

ctbrennan commented Sep 19, 2024 •

edited

Loading

zhenxiao left a comment

zhenxiao Sep 19, 2024

ctbrennan Sep 20, 2024

zhenxiao left a comment

zhenxiao Sep 26, 2024

ctbrennan Oct 11, 2024

Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

Conversation

ctbrennan commented Sep 19, 2024 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

zhenxiao left a comment

Choose a reason for hiding this comment

zhenxiao Sep 19, 2024

Choose a reason for hiding this comment

ctbrennan Sep 20, 2024

Choose a reason for hiding this comment

zhenxiao left a comment

Choose a reason for hiding this comment

zhenxiao Sep 26, 2024

Choose a reason for hiding this comment

ctbrennan Oct 11, 2024

Choose a reason for hiding this comment

ctbrennan commented Sep 19, 2024 •

edited

Loading