Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource Isolation Feature Checkpoint 4: Extend cache warmup to allow multiple resource isolation groups and multiple replicas #12

Merged

Conversation

ctbrennan
Copy link

@ctbrennan ctbrennan commented Sep 19, 2024

Why I'm doing:

This is part of the implementation of the Resource Isolation for Shared Data Mode feature

What I'm doing:

Makes syntax and functionality changes to CACHE SELECT statements to allow caching data for specified resource isolation groups and more than one replica.

They can now be of the form

CACHE SELECT <column_name> [, ...]
FROM [<catalog_name>.][<db_name>.]<table_name> [WHERE <boolean_expression>]
[PROPERTIES("verbose"="true", "resource_isolation_groups"="<GROUP_ID_1>,...,<GROUP_ID_N>", "num_replicas"="<NUM_REPLICAS>")]

If no resource_isolation_groups are specified, uses the resource group of the current frontend.
If no num_replicas, we assume 1.

Fixes #issue https://jira.pinadmin.com/browse/RTA-6269

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@ctbrennan ctbrennan marked this pull request as ready for review September 19, 2024 19:20
Copy link

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good
a few small things

.gitignore Outdated Show resolved Hide resolved
List<Long> cnIdsOrderedByPreference = mapper.computeNodesForTablet(
tabletId, props.numReplicasDesired, resourceIsolationGroupId);
if (cnIdsOrderedByPreference.size() < props.numReplicasDesired) {
throw new DdlException(String.format("Requesting more replicas than we have available CN" +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a TODO: if requesting more replicas than available, shall we trigger a replica load task, from S3 to compute node

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally sure what you mean by "replica load task", that basically describes what is already being executed here. Are you saying we should schedule it for later, if/when we have more compute nodes available? I don't think that's a good approach necessarily.

My thinking is this: if we're requesting more replicas than we have compute nodes, we can't fulfill the intention of the statement, so I'm throwing an exception. The other option is to make a best-effort attempt and select all of the available compute nodes.

Copy link

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good
one minor suggestion

// really shouldn't be getting ComputeNode references for un-matching resource isolation groups or unhealthy
// ComputeNodes. Instead of changing a bunch of code which uses the WorkerProvider in a specific way, this way
// limits scope to only change behavior when the user of the WorkerProvider sets this very specific option.
public void setAllowGetAnyWorker(boolean allowGetAnyWorker) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a more intuitive naming? e.g. getWorkerFromUnmatchingIsolationGroup?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific function isn't getting any worker, it's signaling the intent to allow getting any worker.

…tion-3.3_cacheselect

Need to cherrypick from upstream for BE build fix related to error seen when loading: /opt/starrocks/be/lib/starrocks_be: error while loading shared libraries: libbfd-2.38-system.so: cannot open shared object file: No such file or directory
@ctbrennan ctbrennan merged commit bec060d into pinterest-integration-3.3 Oct 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants