rework on poller auto scaler #1411

shijiesheng · 2024-12-10T19:37:05Z

Detailed Description

Improve performance of poller auto scaler by using more accurate scaling signals and several implementation changes.

Changes

New WorkerOptions AutoScalerOptions is introduced.
Several WorkerOptions are deprecated and become no-op.
read new signal (poller wait time) to scale
allow kill switching poller auto scaler from server
new implementation that makes scaling quicker to traffic change
removed no longer used autoscaler package completely (original implementation is over complicated)

Impact Analysis

Backward Compatibility: NO existing autoscaling will be stopped but this shall not have big impact since this feature was never rolled out in production. For OSS users, please follow the instructions below in rollout plan.
Forward Compatibility: Yes, introduce new

Testing Plan

Unit Tests: Yes
Persistence Tests: Not related
Integration Tests: No
Compatibility Tests: No, because it's autoscaler is a feature that was not rolled out in production.

Rollout Plan

What is the rollout plan?
For Uber services, standard client release steps
For OSS users, turn off autoscaler feature first before the client upgrade.
Does the order of deployment matter? No
Is it safe to rollback? Does the order of rollback matter? Yes
Is there a kill switch to mitigate the impact immediately? Yes, the new autoscaler feature is an opt-in feature.

codecov · 2024-12-21T10:21:50Z

Codecov Report

Attention: Patch coverage is 94.41860% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.72%. Comparing base (6e22a27) to head (6b028ad).

Files with missing lines	Patch %	Lines
internal/internal_task_handlers.go	20.00%	7 Missing and 1 partial ⚠️
internal/worker/concurrency_auto_scaler.go	98.08%	2 Missing and 1 partial ⚠️
internal/internal_task_pollers.go	50.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
internal/internal_worker_base.go	`86.08% <100.00%> (+3.45%)`	⬆️
internal/internal_task_pollers.go	`82.76% <50.00%> (ø)`
internal/worker/concurrency_auto_scaler.go	`98.08% <98.08%> (ø)`
internal/internal_task_handlers.go	`81.08% <20.00%> (-0.54%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e22a27...6b028ad. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

internal/internal_worker_base.go

internal/worker/concurrency_auto_scaler_test.go

internal/worker/concurrency_auto_scaler.go

Groxx · 2025-01-02T18:31:44Z

to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go

3vilhamster

Overall looks good, but I left some nits

3vilhamster · 2025-01-08T11:23:28Z

internal/internal_worker_base.go

@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() {
 	var err error
 	var task interface{}

-	if bw.pollerAutoScaler != nil {
+	if bw.concurrencyAutoScaler != nil {
 		if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {


nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

i've removed this check in all places. Regarding the comment hese two should be abstracted behind a single interface to avoid additional complexity in this file, I still think this is two separate entities. Client still needs concurrency whether autoscaler is enabled or not.

3vilhamster · 2025-01-08T11:31:12Z

internal/worker/concurrency_auto_scaler.go

+				return
+			case <-ticker.Chan():
+				c.logEvent(autoScalerEventMetrics)
+				c.lock.Lock()


nit: push lock/unlock to updatePollerPermit, then you can use defer inside the function and ensure that unlock happens if anything will cause panic.

right, it's simpler

I later found a race condition. I actually need to lock on both logEvent and updatePollerPermit. The way I avoid deadlocks is to only lock/unlock on exported methods. Locking on helper methods would easily lead to deadlock for me.

3vilhamster · 2025-01-08T11:33:03Z

internal/worker/concurrency_auto_scaler.go

+	c.wg.Add(1)
+
+	go func() {
+		defer c.wg.Done()


nit: any calls that start goroutine should have a panic handler.
If a bug exists, it will crash the worker process, significantly impacting customer service.
This is an optional functionality that should be safe to break. Worst case, it won't update concurrency.

internal/worker/concurrency_auto_scaler.go

taylanisikdemir · 2025-06-13T21:39:45Z

internal/internal_task_handlers.go

@@ -153,6 +165,20 @@ type (
 	}
 )

+func (t *workflowTask) getAutoConfigHint() *s.AutoConfigHint {
+	if t.task != nil {
+		return t.task.AutoConfigHint


do we need to check whether t.task.AutoConfigHint is nil?

good catch. I changed the order so it makes more sense.

taylanisikdemir · 2025-06-13T21:44:15Z

internal/internal_worker_base.go

@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() {
 	var err error
 	var task interface{}

-	if bw.pollerAutoScaler != nil {
+	if bw.concurrencyAutoScaler != nil {
 		if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {


we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

taylanisikdemir · 2025-06-13T21:45:55Z

internal/internal_worker_base.go

+		return t.autoConfigHint
+	default:
+		return nil
+	}


instead of this switch case (which is not future proof), we can cast the task to autoConfigHintAwareTask interface and get the auto config hint

I've removed this to use autoConfigHintAwareTask

taylanisikdemir · 2025-06-13T21:50:07Z

internal/worker/concurrency_auto_scaler.go

+	lowerPollerWaitTime           = 16 * time.Millisecond
+	upperPollerWaitTime           = 256 * time.Millisecond


it looks like we would want to iterate on these to adjust sensitivity. consider exposing these to worker config

The poller wait time is an invariant. User doesn't need to tune it. The sensitivity control (time-to-react) is actually controlled by the Cooldown which is already in the parameter

taylanisikdemir · 2025-06-13T21:54:52Z

internal/worker/concurrency_auto_scaler_test.go

+			},
+		},
+		{
+			"idl pollers waiting for tasks",


nit: typo idle. same in other cases below

taylanisikdemir · 2025-06-13T21:55:16Z

internal/worker/concurrency_auto_scaler_test.go

+		name               string
+		pollAutoConfigHint []*shared.AutoConfigHint
+		expectedEvents     []eventLog
+	}{


would be nice to add a case where it scales up and down a few times

… fields related

shijiesheng · 2025-06-17T21:54:00Z

coverage failed due to deprecation changes

Groxx

dropping notes for now, while reading tests carefully 👍

overall looks pretty good I think - fairly easy to follow, behavior looks good (e.g. up to 4x growth when "instant", 0.5x shrink when slow, one scale change every 10 seconds sounds reasonable), everything's pretty close.
so just a small pile of minor stuff, some nits some not.

Groxx · 2025-06-18T21:00:43Z

internal/worker/concurrency_auto_scaler.go

+	autoScalerEventStart                      autoScalerEvent = "auto-scaler-start"
+	autoScalerEventStop                       autoScalerEvent = "auto-scaler-stop"
+	autoScalerEventLogMsg                     string          = "concurrency auto scaler event"
+	testTimeFormat                            string          = "15:04:05"


Suggested change

testTimeFormat string = "15:04:05"

internal/worker/concurrency_auto_scaler_test.go

Groxx · 2025-06-18T21:12:36Z

internal/worker/concurrency_auto_scaler.go

+		shutdownChan:             make(chan struct{}),
+		concurrency:              input.Concurrency,
+		cooldown:                 input.Cooldown,
+		log:                      input.Logger.Named(metrics.ConcurrencyAutoScalerScope),


I think this might be our first use of Named 🤔

since this isn't a concept in log/slog I kinda feel like we might drop it eventually, but for now I think it makes sense 👍

Just curious what should be the alternative to do it. I'll change once we want to remove it.

the main alternative is probably either:

logger.WithGroup: https://pkg.go.dev/log/slog#Logger.WithGroup

i.e. pushing everything from "this logger" into a sub-field

or logger.With("logger", "concurrency-auto-scaler")

setting a top-level field ("logger") and leaving everything else flat / possibly conflicting in meaning.

which is not a clear win in either direction. flatter is much easier to query for shared fields, structured is much easier to be unambiguous and is often more efficient to index, etc.

I bring it up mostly because I think a move to log/slog is inevitable, and we'll have to decide [something] at that point. It'll probably just be .With("logger", "concurrency-auto-scaler") tho, since I don't think we'll care much about dot.separated.names at that point (and none exist now).

internal/worker/concurrency_auto_scaler.go

internal/internal_worker_base.go

internal/internal_worker_test.go

internal/worker.go

Groxx · 2025-06-18T23:33:26Z

internal/worker/concurrency_auto_scaler_test.go

+						eventType:   autoScalerEvent(event.ContextMap()["event"].(string)),
+						enabled:     event.ContextMap()["enabled"].(bool),
+						pollerQuota: event.ContextMap()["poller_quota"].(int64),
+						time:        event.ContextMap()["time"].(time.Time).Format(testTimeFormat),


Suggested change

time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat),

time: event.Time.Format(testTimeFormat),

with a logger.WithClock, I think this handles the "logs are hard to identify uniquely" thing that seems to be the intent here?

Yes, I just didn't find a good way to assert event logs and find this is easier (or quicker) with a test entity.

yea, the "make a simplified struct for comparison" is a good choice I think, this was just for the time-context-map-field.

Huh. zap.WithClock is unusable, given its definition 🤔

still seems like an odd addition here tbh

a possible option that appears to work: sort the logs by event.Time, and just make sure the other values occur in the same order. you can get rid of the time field entirely from eventLog then.

(afaict they are always in order in these tests, but an explicit sort is probably a good idea)

there is technically room for some out-of-order-ness to occur by doing that, but I don't think these tests really run that risk. and ensuring unique values in all logs would take care of it too.

Groxx · 2025-06-25T20:40:15Z

internal/worker/concurrency_auto_scaler_test.go

+			"busy pollers, scale up to maximum",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
+			},
+			[]eventLog{
+				{autoScalerEventStart, false, 100, "00:00:00"},
+				{autoScalerEventEnable, true, 100, "00:00:00"},
+				{autoScalerEventPollerSkipUpdateCooldown, true, 100, "00:00:01"},
+				{autoScalerEventPollerScaleUp, true, 200, "00:00:02"},
+				{autoScalerEventStop, true, 200, "00:00:02"},
+			},


might be easier to follow the actual behavior of this one with a less-than-1/2-maximum set of values, e.g. start with 10 rather than 100 -> it won't scale to maximum, it'll scale to 42.

kinda similar for others below, e.g. pollers, scale up and down multiple times becomes:

{autoScalerEventStart, false, 10, "00:00:00"}, {autoScalerEventEnable, true, 10, "00:00:00"}, {autoScalerEventPollerSkipUpdateCooldown, true, 10, "00:00:01"}, {autoScalerEventPollerScaleUp, true, 42, "00:00:02"}, {autoScalerEventPollerSkipUpdateCooldown, true, 42, "00:00:03"}, {autoScalerEventPollerScaleDown, true, 25, "00:00:04"}, {autoScalerEventPollerSkipUpdateCooldown, true, 25, "00:00:05"}, {autoScalerEventPollerScaleUp, true, 104, "00:00:06"}, {autoScalerEventPollerSkipUpdateCooldown, true, 104, "00:00:07"}, {autoScalerEventPollerScaleDown, true, 63, "00:00:08"}, {autoScalerEventStop, true, 63, "00:00:08"},

which seems a bit more informative than "to max, down, back to max, back to same down value"

Groxx · 2025-06-25T20:40:52Z

internal/worker/concurrency_auto_scaler_test.go

+			"busy pollers, scale up to maximum",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum


Suggested change

{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum

{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale up significantly

Groxx · 2025-06-25T20:41:15Z

internal/worker/concurrency_auto_scaler_test.go

+			"idle pollers waiting for tasks",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up


Suggested change

{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up

{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down

Groxx · 2025-06-25T20:41:31Z

internal/worker/concurrency_auto_scaler_test.go

+			"idle pollers, scale down to minimum",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up


Suggested change

{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up

{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale down

Groxx · 2025-06-25T20:41:49Z

internal/worker/concurrency_auto_scaler_test.go

+			"idle pollers but disabled scaling",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, in cool down
+				{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up


Suggested change

{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up

{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, no update

also this one isn't really "idle", that'd be ~60k / something larger than the scale-down value, yea?

Groxx · 2025-06-25T20:42:37Z

internal/worker/concurrency_auto_scaler_test.go

+			"idle pollers but disabled scaling at a later time",
+			[]*shared.AutoConfigHint{
+				{common.PtrOf(true), common.PtrOf(int64(1000))},  // <- tick, in cool down
+				{common.PtrOf(true), common.PtrOf(int64(1000))},  // <- tick, scale up


Suggested change

{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up

{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down

Groxx · 2025-06-25T20:53:06Z

internal/worker/concurrency_auto_scaler_test.go

+			PollerPermit: NewResizablePermit(100),
+			TaskPermit:   NewResizablePermit(1000),
+		},
+		Cooldown:                 2 * testTickTime,
+		Tick:                     testTickTime,
+		PollerMaxCount:           200,
+		PollerMinCount:           50,


somewhat odd that PollerPermit starts at a different value than PollerMinCount, since I don't believe that'll ever be the case in practice?

it does seem harmless though, and kinda simplifies the "scale down" tests... just not sure that's worth breaking the normal pattern to achieve.

Groxx

Just minor stuff as optional cleanups, I think - looks good to go to me 👍

shijiesheng requested review from dkrotx and demirkayaender as code owners December 10, 2024 19:37

shijiesheng mentioned this pull request Dec 18, 2024

poller auto scaler rework #1409

Closed

shijiesheng force-pushed the autoscaler-rework branch from 82e6bb6 to 636c433 Compare December 20, 2024 23:01

shijiesheng requested review from Groxx, jakobht, 3vilhamster and taylanisikdemir as code owners December 20, 2024 23:01

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/internal_worker_base.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler_test.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Outdated Show resolved Hide resolved

Groxx reviewed Dec 31, 2024

View reviewed changes

internal/worker/concurrency_auto_scaler.go Show resolved Hide resolved

shijiesheng force-pushed the autoscaler-rework branch from 8438696 to 69f7e3f Compare January 7, 2025 21:24

3vilhamster reviewed Jan 8, 2025

View reviewed changes

shijiesheng force-pushed the autoscaler-rework branch 2 times, most recently from a9d3781 to 52dd229 Compare January 17, 2025 17:34

davidporter-id-au reviewed Jan 31, 2025

View reviewed changes

internal/worker/concurrency_auto_scaler.go Show resolved Hide resolved

davidporter-id-au reviewed Jan 31, 2025

View reviewed changes

internal/worker/concurrency_auto_scaler.go Show resolved Hide resolved

shijiesheng added 6 commits June 3, 2025 15:52

populate task for empty polls as well

f2fda10

fix test case and revert task handler change

4a8fb28

simply change

65be93d

rework on poller auto scaler

c03dca5

add unit test

413d4b1

address incorrect scaling direction

cbdb27c

shijiesheng added 9 commits June 3, 2025 15:54

use worker type in logger and scope for autoscaler

6bb32bf

correct locking

10f934f

address zero wait time

b6ed267

process hint only for successful tasks

be9ebca

split scaling policy

8e403cc

change default cooldown to 10s

80045de

add a task interface

94185f3

lint

defc47b

populate autoconfig hint field

e7776be

shijiesheng force-pushed the autoscaler-rework branch from 1d733a5 to e7776be Compare June 3, 2025 23:04

fix test

6b028ad

taylanisikdemir reviewed Jun 13, 2025

View reviewed changes

shijiesheng added 4 commits June 17, 2025 13:37

address comments except for exposing poller wait time and metrics fixing

b467b01

changed metrics to histogram

ffdfc41

add AutoscalerOptions and deprecate several first level WorkerOptions…

14144a2

… fields related

add comments on AutoScalerOptions

4b74bcf

Groxx reviewed Jun 18, 2025

View reviewed changes

shijiesheng and others added 4 commits June 20, 2025 14:03

address most comments except for max poller settings

7087b88

handle max poller setting correctly

14d2322

fix unit test

31cbf1f

Merge branch 'master' into autoscaler-rework

5bfb979

Groxx reviewed Jun 25, 2025

View reviewed changes

Groxx approved these changes Jun 25, 2025

View reviewed changes

		lowerPollerWaitTime = 16 * time.Millisecond
		upperPollerWaitTime = 256 * time.Millisecond

	time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat),
	time: event.Time.Format(testTimeFormat),

	{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
	{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale up significantly

	{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up
	{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, no update

rework on poller auto scaler #1411

Are you sure you want to change the base?

rework on poller auto scaler #1411

Uh oh!

Conversation

shijiesheng commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Groxx commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

3vilhamster left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3vilhamster Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijiesheng Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijiesheng commented Jun 17, 2025

Uh oh!

Groxx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shijiesheng commented Dec 10, 2024 •

edited

Loading

codecov bot commented Dec 21, 2024 •

edited

Loading

Groxx commented Jan 2, 2025 •

edited

Loading

3vilhamster Jan 8, 2025 •

edited

Loading

shijiesheng Jun 17, 2025 •

edited

Loading

Groxx Jun 20, 2025 •

edited

Loading

Groxx Jun 25, 2025 •

edited

Loading

Groxx Jun 25, 2025 •

edited

Loading