Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet throttler: read and use MySQL host metrics #16904

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Oct 7, 2024

Description

In #16850 we added the necessary proto/gRPC calls to get hsot metrics from the MySQL daemon. #16850 is part of the v21 release.

This PR complements #16850 by having the tablet throttler read and use said MySQL host metrics. These will henceforth be available to the user by the names:

  • mysqld-loadavg (default threshold: 1.0)
  • mysqld-datadir-used-ratio (range 0.0 == empty, to 1.0 == full ; default threshold: 0.98)

The MySQL host metrics stand out in that the throttler only reads them sporadically. It would be too expensive to call a sequence of two gRPC (one to tablet manager, then to mysql daemon) multiple times a second. Certainly for a metric such as mysqld-datadir-used-ratio, but also for mysqld-loadavg. The throttler only actually reads those metrics once per 10sec, and caches the results for subsequent checks.

Doc PR: vitessio/website#1878

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…MacOS, and MacOS is beign used in local dev env unit and endtoend testing

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Throttler labels Oct 7, 2024
@shlomi-noach shlomi-noach removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Oct 7, 2024
Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 41.21622% with 87 lines in your changes missing coverage. Please review.

Project coverage is 67.38%. Comparing base (ffaeba0) to head (8adc7eb).
Report is 33 commits behind head on main.

Files with missing lines Patch % Lines
...t/tabletserver/throttle/base/self_metric_mysqld.go 32.85% 47 Missing ⚠️
.../tabletserver/throttle/base/self_metric_loadavg.go 0.00% 16 Missing ⚠️
go/vt/vttablet/tabletserver/throttle/throttler.go 48.27% 15 Missing ⚠️
go/osutil/loadavg_linux.go 66.66% 2 Missing ⚠️
...etserver/throttle/base/self_metric_custom_query.go 0.00% 2 Missing ⚠️
...blet/tabletserver/throttle/base/self_metric_lag.go 0.00% 2 Missing ⚠️
...erver/throttle/base/self_metric_threads_running.go 0.00% 2 Missing ⚠️
.../tabletserver/throttle/base/self_metric_default.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16904      +/-   ##
==========================================
- Coverage   67.43%   67.38%   -0.06%     
==========================================
  Files        1571     1573       +2     
  Lines      252249   252980     +731     
==========================================
+ Hits       170098   170464     +366     
- Misses      82151    82516     +365     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vitess-bot vitess-bot modified the milestones: v21.0.0, v22.0.0 Oct 8, 2024
@shlomi-noach shlomi-noach changed the title Tablet throttler: read and use MySQL host metrics (for v22) Tablet throttler: read and use MySQL host metrics Oct 8, 2024
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some nits and questions. Can we also please add an e2e test (sorry if I'm missing it). The testing feels a bit light (and is reflected in the codecov results) and this seems like something we'd want to test through the RPC chain (to a fake and/or real mysqlctld).

Copy link
Contributor

@mattlord mattlord Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these changes relate to the PR? Is it that we're getting rid of the old loadavg OS metric now that we have the new mysql-loadavg metric and so we wanted to test another MySQL process metric here?

Is there any reason that we're not using any of the new MySQL host metrics in the e2e tests? Or am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these changes relate to the PR?

See #16904 (comment): it's about tests failing on MacOS machines.

Is there any reason that we're not using any of the new MySQL host metrics in the e2e tests? Or am I missing something here?

You're not missing anything here. Let me see what we can do for e2e tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added endtoend tests that verify the full MysqlHostMetrics path don to mysqld, and checking response values. We check the existence of both new mysqld-* values, and we specifically check that mysqld-datadir-used-ratio is non-zero (as can be expected in local hosts and in CI). We do not check the value of mysqld-loadavg as that could be arbitrarily low and even appear to be zero, as well as because we do not implement it in MacOS.

go/test/endtoend/throttler/util.go Outdated Show resolved Hide resolved
go/vt/vttablet/tabletserver/throttle/base/self_metric.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/mysqld.go Outdated Show resolved Hide resolved
}

func (m *MysqldLoadAvgSelfMetric) DefaultScope() Scope {
return SelfScope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the type and variable exported makes this interface feel off to me. Since anyone can access the metric and this internal (?) variable as things are now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet no one holds an instance of MysqldLoadAvgSelfMetric. The only instance in existence is:

var _ SelfMetric = registerSelfMetric(&MysqldLoadAvgSelfMetric{})

So yes, anyone can create their own new instance, but that cannot affect the existing instances.

}
assert.EqualValues(t, 10, val)
cancel()
// There can be a race condition where the rate limiter still emits one final tick after the context is cancelled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we should be able to drain the ticker's channel on cancel if it's not empty. That would be ideal, unless I'm missing something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition is a bit more nuanced:

		go func() {
			defer mysqlHostMetricsRateLimiter.Store(nil)
			defer rateLimiter.Stop()
			<-ctx.Done()
		}()

so when you hit cancel(), the rate limiter gets unblocked from <-ctx.Done(), but then it's still possible to get a new ticker event before defer rateLimiter.Stop() gets issued.

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
…nce of mysqld-* throttler values, and for nonzero value of mysqld-datadir-used-ratio

Signed-off-by: Shlomi Noach <[email protected]>
…rAggregatedShardMysqldDatadirUsedRatio

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor Author

Docs: vitessio/website#1878

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! ❤️

go/textutil/strings.go Outdated Show resolved Hide resolved
@shlomi-noach
Copy link
Contributor Author

Now supporting loadavg values on darwin. I've added go/osutil package that knows how to handle both systems.

This is used in both tablet's loadavg as well as in Mysqld host metrics.

go/osutil/os.go Outdated Show resolved Hide resolved
Signed-off-by: Shlomi Noach <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants