Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlstats: add new MaybeFlushWithDrainer method to persistedSqlStats #140673

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kyle-a-wong
Copy link
Contributor

@kyle-a-wong kyle-a-wong commented Feb 7, 2025

Adds a new SSDrainer interface that PersistedSQLStats uses to flush sql stats.

Instead of calling ConsumeStats on sslocal.SQLStats, persistedsqlstats will use a provided SSDrainer to either DrainStats or Reset.

The ConsumeStats method on sslocal.SQLStats has been removed and its logic has been redistributed to persistedsqlstats, as that was the only consumer of the method. Since all the logic now lives in persistedsqlstats, the flush related functions can be called directly instead of relying on nested callbacks.

Epic: CRDB-45771
Release note: None

Note: This is a stack committed, only the last commit in the chain needs to be reviewed

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Feb 9, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kyle-a-wong kyle-a-wong force-pushed the sql_stats_drain branch 8 times, most recently from ce294dd to 9217e5c Compare February 10, 2025 21:51
@kyle-a-wong kyle-a-wong changed the title sqlstats: introduce new SSDrainer interface sqlstats: decouple local SQLStats from persistedSqlStats Feb 10, 2025
@kyle-a-wong kyle-a-wong marked this pull request as ready for review February 10, 2025 22:41
@kyle-a-wong kyle-a-wong requested review from a team as code owners February 10, 2025 22:41
@kyle-a-wong kyle-a-wong requested review from angles-n-daemons, dhartunian, xinhaoz and alyshanjahani-crl and removed request for a team February 10, 2025 22:41
@kyle-a-wong kyle-a-wong force-pushed the sql_stats_drain branch 2 times, most recently from fa1c2cf to 5264c73 Compare February 12, 2025 23:21
@kyle-a-wong kyle-a-wong changed the title sqlstats: decouple local SQLStats from persistedSqlStats sqlstats: add new MaybeFlushWithDrainer method to persistedSqlStats Feb 12, 2025
@kyle-a-wong kyle-a-wong force-pushed the sql_stats_drain branch 2 times, most recently from e884d43 to eabd876 Compare February 13, 2025 19:00
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 41 of 41 files at r1, 3 of 33 files at r2, 10 of 36 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 114 at r3 (raw file):

	}

	bucket := s.getBucket()

It's a bit odd that s provides the bucket and ssDrainer provides the data and we assume that they match. Anything we can do about that? Or maybe there's a connection I'm missing here.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 149 at r3 (raw file):

			log.Warningf(ctx, "failed to dump stats to log, %s", err.Error())
		}
		containerStmtStats, containerTxnStats := container.DrainStats(ctx)

Is there a way we could say that DrainStats gives the caller ownership of these slices? would be nice to to have to copy them over. I assume if we can do that, they're no longer being modified anyway.

Copy link
Contributor Author

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 114 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

It's a bit odd that s provides the bucket and ssDrainer provides the data and we assume that they match. Anything we can do about that? Or maybe there's a connection I'm missing here.

This is how it was previously, I just created a struct to encapsulate this data. That said, the appstatspb.CollectedStatementStatistics and appstatspb.CollectedTransactionStatistics do have aggregated_ts and aggregation_interval fields on them already, so theoretically we could populate them when we call ssmemstorage.Container.DrainStats().

One thing I'm unclear about is the disconnect between collected statistics in the in memory containers and aggregation / bucketing . While the CollectedStats related types have these fields, they are only ever set when flushing via persistedstatsSQL, when being queried from crdb_internal.cluster_statement_statistics, or requested from one of the gRPC statement APIs. I'm not sure if I'm missing something somewhere, but it seems to me like the determination of which bucket a stat belongs to is tightly coupled to sql stats flush. If flushing is disabled, it seems like queried statistics will always be considered part of the most recent bucket (I don't know if this is correct or not). To me, it would make more sense to push the aggregation data all the way down to the write path of the ssmemstorage.Container, but I haven't spent time evaluation how big of an effort this would be. I don't think it is trivial.


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 149 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Is there a way we could say that DrainStats gives the caller ownership of these slices? would be nice to to have to copy them over. I assume if we can do that, they're no longer being modified anyway.

As discussed in person:

  1. container.DrainStats uses the sqlstats iterators to create new copies of the data and return before deleting the stats from its container. This would be a larger endeavor to re-implement, so I'm gonna punt on that for now.
  2. I'll look into the the performance of the operation below where we are flattening the stats using append and spread to see if there are any extra allocations that we may want to avoid.

Copy link
Contributor Author

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/sqlstats/sslocal/sslocal_provider.go line 149 at r3 (raw file):

Previously, kyle-a-wong (Kyle Wong) wrote…

As discussed in person:

  1. container.DrainStats uses the sqlstats iterators to create new copies of the data and return before deleting the stats from its container. This would be a larger endeavor to re-implement, so I'm gonna punt on that for now.
  2. I'll look into the the performance of the operation below where we are flattening the stats using append and spread to see if there are any extra allocations that we may want to avoid.

I added a new benchmark that benchmarks DrainStats. Locally, I added back in the ConsumeStats function updated the benchmark to call ConsumeStats ,with noop callbacks, and got the following results:

name                             old time/op    new time/op    delta
SqlStatsDrain/drainsql-10-12        469ns ±11%     456ns ±12%     ~     (p=0.382 n=8+9)
SqlStatsDrain/drainsql-50-12       2.12µs ±20%    2.24µs ±13%     ~     (p=0.280 n=10+10)
SqlStatsDrain/drainsql-100-12      4.21µs ±65%    3.44µs ±44%     ~     (p=0.079 n=10+9)
SqlStatsDrain/drainsql-10000-12     356µs ±19%     506µs ±65%     ~     (p=0.113 n=9+10)
SqlStatsDrain/drainsql-1000-12     23.9µs ±11%    28.3µs ±42%  +18.32%  (p=0.011 n=8+9)

name                             old alloc/op   new alloc/op   delta
SqlStatsDrain/drainsql-100-12      17.6kB ± 0%    17.4kB ± 0%   -1.02%  (p=0.000 n=10+10)
SqlStatsDrain/drainsql-10000-12    1.37MB ± 0%    1.36MB ± 0%   -0.96%  (p=0.000 n=10+8)
SqlStatsDrain/drainsql-10-12       1.68kB ± 0%    1.67kB ± 0%   -0.95%  (p=0.000 n=10+10)
SqlStatsDrain/drainsql-50-12       8.74kB ± 0%    8.65kB ± 0%   -0.95%  (p=0.000 n=10+10)
SqlStatsDrain/drainsql-1000-12      180kB ± 0%     178kB ± 0%   -0.91%  (p=0.000 n=10+10)

name                             old allocs/op  new allocs/op  delta
SqlStatsDrain/drainsql-10-12         3.00 ± 0%      3.00 ± 0%     ~     (all equal)
SqlStatsDrain/drainsql-50-12         12.0 ± 0%      12.0 ± 0%     ~     (all equal)
SqlStatsDrain/drainsql-100-12        22.0 ± 0%      22.0 ± 0%     ~     (all equal)
SqlStatsDrain/drainsql-1000-12        203 ± 0%       203 ± 0%     ~     (all equal)
SqlStatsDrain/drainsql-10000-12     1.50k ± 0%     1.50k ± 0%     ~     (all equal)

The results show that the new implementation of the DrainStats, that returns the list, results in ~1% more memory consumption than the ConsumeStats. My assumption that this increase in alloc/op is a result of a new slice being created on each append. Do you think this is worth going back to the ConsumeStats callback style of processing drained stats?

@kyle-a-wong kyle-a-wong force-pushed the sql_stats_drain branch 3 times, most recently from 25c5fbe to eb5dff8 Compare February 20, 2025 20:46
@@ -1046,3 +1044,8 @@ func getStatusJSONProto(
url := fmt.Sprintf("/_status/%s?start=%d&end=%d", path, startTime.Unix(), endTime.Unix())
return serverutils.GetJSONProto(ts, url, response)
}

func SqlStatsFlushTestServerLocal(ts serverutils.TestServerInterface) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The function name makes the code a little harder to understand, can we consider renaming or just removing this wrapper?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function is even necessary. What's the value of hiding what the flush is doing? I'd rather know that MaybeFlush is being called in my test without having to hop around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. See other comment for an explanation of why I had originally made these helpers.

@@ -114,3 +114,19 @@ type RecordedTxnStats struct {
SessionData *sessiondata.SessionData
TxnErr error
}

// SSDrainer is the interface for draining or resetting sql stats.
type SSDrainer interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just double check my understaning for the motivation of these changes (coming from the 1 pager):

What is going to be implementing SSDrainer aside from SQLStats? From my understanding, as part of the new job, we would do the grpc fanout (in some yet to be determined batched method) that will be calling DrainStats on each sslocal.SQLStats container. Where else do we need a drainer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a new ssRemote.SQLStats that will also implement this and will be responsible for the fanout. This is what the job will provide to the MaybeFlushWithDrainer method to do the cluster wide drain / persist

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 36 files at r3, 2 of 3 files at r4, 4 of 6 files at r5, 2 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, @kyle-a-wong, and @xinhaoz)


pkg/sql/sqlstats/persistedsqlstats/flush.go line 382 at r6 (raw file):

}

func generateStatementStatsQuery(statementCount int) string {

what do you think of just inlining these? unless you're going to write extensive unit tests for them, I'd just inline since args and query need to match up and will be easier to see.


pkg/sql/sqlstats/test_utils.go line 53 at r5 (raw file):

	SkipZoneConfigBootstrap bool

	// FlushInterceptor intercepts persistsqlstats flush operation.

nit: persistedsqlstats


pkg/sql/sqlstats/test_utils.go line 20 at r6 (raw file):

	aggregatedTs time.Time,
	stmtStats []*appstatspb.CollectedStatementStatistics,
	txnStats []*appstatspb.CollectedTransactionStatistics)

nit: comma + line break before ). our formatter doesn't do that?


pkg/sql/sqlstats/persistedsqlstats/provider.go line 34 at r6 (raw file):

)

const statementStatisticUpsertQuery = `

why do these live so far away from where they're used? can we move next to usage?


pkg/sql/sqlstats/persistedsqlstats/sqlstatstestutil/testutils.go line 298 at r5 (raw file):

}

func FlushTestServerLocal(ts serverutils.TestServerInterface) {

@kyle-a-wong what does this helper provide? I think for the long line that it saves it also obscures something quite simple. I'd just leave the original call here unless there's a compelling reason I'm missing.

@@ -1046,3 +1044,8 @@ func getStatusJSONProto(
url := fmt.Sprintf("/_status/%s?start=%d&end=%d", path, startTime.Unix(), endTime.Unix())
return serverutils.GetJSONProto(ts, url, response)
}

func SqlStatsFlushTestServerLocal(ts serverutils.TestServerInterface) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function is even necessary. What's the value of hiding what the flush is doing? I'd rather know that MaybeFlush is being called in my test without having to hop around.

Adds a new SSDrainer interface that PersistedSQLStats uses to flush
sql stats.

Instead of calling ConsumeStats on sslocal.SQLStats, persistedsqlstats
will use a provided SSDrainer to either DrainStats or Reset.

The ConsumeStats method on sslocal.SQLStats has been removed and its
logic has been redistributed to persistedsqlstats, as that was the
only consumer of the method. Since all the logic now lives in
persistedsqlstats, the flush related functions can be called
directly instead of relying on nested callbacks.

In addition to the new interface, this commit updates the flush
logic to upsert multiple rows in a single upsert statement,
determined by `sql.stats.flush.batch_size`.

Epic: CRDB-45771
Release note: None
Copy link
Contributor Author

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, @dhartunian, and @xinhaoz)


pkg/sql/sqlstats/test_utils.go line 53 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: persistedsqlstats

done


pkg/sql/sqlstats/test_utils.go line 20 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: comma + line break before ). our formatter doesn't do that?

I guess not. Done


pkg/sql/sqlstats/persistedsqlstats/flush.go line 382 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

what do you think of just inlining these? unless you're going to write extensive unit tests for them, I'd just inline since args and query need to match up and will be easier to see.

Done


pkg/sql/sqlstats/persistedsqlstats/provider.go line 34 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

why do these live so far away from where they're used? can we move next to usage?

Done


pkg/sql/sqlstats/persistedsqlstats/sqlstatstestutil/testutils.go line 298 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

@kyle-a-wong what does this helper provide? I think for the long line that it saves it also obscures something quite simple. I'd just leave the original call here unless there's a compelling reason I'm missing.

Originally, this PR contained a larger refactor that included getting rid of the embded sslocal.SQLStats from persistedSQLStats and changed the MaybeFlush signature to require the SSDrainer to be provided. Doing that required all updates to all tests t hat called MaybeFlush to get this localSqlStats object from the server and it felt cleaner to me to only have to do that in one place instead of every place we call MaybeFlush.

I've since went back on that idea for this PR and opted for the new MaybeFlushWithDrainer method instead, but i never removed these helps / the refactor. If necessary I can add this back in later, but for now i've removed this helper

@@ -114,3 +114,19 @@ type RecordedTxnStats struct {
SessionData *sessiondata.SessionData
TxnErr error
}

// SSDrainer is the interface for draining or resetting sql stats.
type SSDrainer interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a new ssRemote.SQLStats that will also implement this and will be responsible for the fanout. This is what the job will provide to the MaybeFlushWithDrainer method to do the cluster wide drain / persist

@@ -1046,3 +1044,8 @@ func getStatusJSONProto(
url := fmt.Sprintf("/_status/%s?start=%d&end=%d", path, startTime.Unix(), endTime.Unix())
return serverutils.GetJSONProto(ts, url, response)
}

func SqlStatsFlushTestServerLocal(ts serverutils.TestServerInterface) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. See other comment for an explanation of why I had originally made these helpers.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. :lgtm: Can you link to #139143 which I assigned to you prior to merging?

Reviewed 6 of 13 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @alyshanjahani-crl, @angles-n-daemons, and @xinhaoz)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants