Skip to content

rework on poller auto scaler #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

shijiesheng
Copy link
Member

@shijiesheng shijiesheng commented Dec 10, 2024

Detailed Description

Improve performance of poller auto scaler by using more accurate scaling signals and several implementation changes.

Changes

  • New WorkerOptions AutoScalerOptions is introduced.
  • Several WorkerOptions are deprecated and become no-op.
  • read new signal (poller wait time) to scale
  • allow kill switching poller auto scaler from server
  • new implementation that makes scaling quicker to traffic change
  • removed no longer used autoscaler package completely (original implementation is over complicated)

Impact Analysis

  • Backward Compatibility: NO existing autoscaling will be stopped but this shall not have big impact since this feature was never rolled out in production. For OSS users, please follow the instructions below in rollout plan.
  • Forward Compatibility: Yes, introduce new

Testing Plan

  • Unit Tests: Yes
  • Persistence Tests: Not related
  • Integration Tests: No
  • Compatibility Tests: No, because it's autoscaler is a feature that was not rolled out in production.

Rollout Plan

  • What is the rollout plan?
    For Uber services, standard client release steps
    For OSS users, turn off autoscaler feature first before the client upgrade.

  • Does the order of deployment matter? No

  • Is it safe to rollback? Does the order of rollback matter? Yes

  • Is there a kill switch to mitigate the impact immediately? Yes, the new autoscaler feature is an opt-in feature.

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 94.41860% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.72%. Comparing base (6e22a27) to head (6b028ad).

Files with missing lines Patch % Lines
internal/internal_task_handlers.go 20.00% 7 Missing and 1 partial ⚠️
internal/worker/concurrency_auto_scaler.go 98.08% 2 Missing and 1 partial ⚠️
internal/internal_task_pollers.go 50.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
internal/internal_worker_base.go 86.08% <100.00%> (+3.45%) ⬆️
internal/internal_task_pollers.go 82.76% <50.00%> (ø)
internal/worker/concurrency_auto_scaler.go 98.08% <98.08%> (ø)
internal/internal_task_handlers.go 81.08% <20.00%> (-0.54%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e22a27...6b028ad. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Groxx
Copy link
Member

Groxx commented Jan 2, 2025

to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go

Copy link
Contributor

@3vilhamster 3vilhamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I left some nits

@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() {
var err error
var task interface{}

if bw.pollerAutoScaler != nil {
if bw.concurrencyAutoScaler != nil {
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've removed this check in all places. Regarding the comment hese two should be abstracted behind a single interface to avoid additional complexity in this file, I still think this is two separate entities. Client still needs concurrency whether autoscaler is enabled or not.

return
case <-ticker.Chan():
c.logEvent(autoScalerEventMetrics)
c.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: push lock/unlock to updatePollerPermit, then you can use defer inside the function and ensure that unlock happens if anything will cause panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it's simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I later found a race condition. I actually need to lock on both logEvent and updatePollerPermit. The way I avoid deadlocks is to only lock/unlock on exported methods. Locking on helper methods would easily lead to deadlock for me.

c.wg.Add(1)

go func() {
defer c.wg.Done()
Copy link
Contributor

@3vilhamster 3vilhamster Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any calls that start goroutine should have a panic handler.
If a bug exists, it will crash the worker process, significantly impacting customer service.
This is an optional functionality that should be safe to break. Worst case, it won't update concurrency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@shijiesheng shijiesheng force-pushed the autoscaler-rework branch 2 times, most recently from a9d3781 to 52dd229 Compare January 17, 2025 17:34
@@ -153,6 +165,20 @@ type (
}
)

func (t *workflowTask) getAutoConfigHint() *s.AutoConfigHint {
if t.task != nil {
return t.task.AutoConfigHint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check whether t.task.AutoConfigHint is nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I changed the order so it makes more sense.

@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() {
var err error
var task interface{}

if bw.pollerAutoScaler != nil {
if bw.concurrencyAutoScaler != nil {
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are guarding calls againts bw.concurrency based on nilness of bw.concurrencyAutoScaler which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file

return t.autoConfigHint
default:
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this switch case (which is not future proof), we can cast the task to autoConfigHintAwareTask interface and get the auto config hint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this to use autoConfigHintAwareTask

Comment on lines 38 to 39
lowerPollerWaitTime = 16 * time.Millisecond
upperPollerWaitTime = 256 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we would want to iterate on these to adjust sensitivity. consider exposing these to worker config

Copy link
Member Author

@shijiesheng shijiesheng Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The poller wait time is an invariant. User doesn't need to tune it. The sensitivity control (time-to-react) is actually controlled by the Cooldown which is already in the parameter

},
},
{
"idl pollers waiting for tasks",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo idle. same in other cases below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

name string
pollAutoConfigHint []*shared.AutoConfigHint
expectedEvents []eventLog
}{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add a case where it scales up and down a few times

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@shijiesheng
Copy link
Member Author

coverage failed due to deprecation changes

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping notes for now, while reading tests carefully 👍

overall looks pretty good I think - fairly easy to follow, behavior looks good (e.g. up to 4x growth when "instant", 0.5x shrink when slow, one scale change every 10 seconds sounds reasonable), everything's pretty close.
so just a small pile of minor stuff, some nits some not.

autoScalerEventStart autoScalerEvent = "auto-scaler-start"
autoScalerEventStop autoScalerEvent = "auto-scaler-stop"
autoScalerEventLogMsg string = "concurrency auto scaler event"
testTimeFormat string = "15:04:05"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testTimeFormat string = "15:04:05"

shutdownChan: make(chan struct{}),
concurrency: input.Concurrency,
cooldown: input.Cooldown,
log: input.Logger.Named(metrics.ConcurrencyAutoScalerScope),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be our first use of Named 🤔

since this isn't a concept in log/slog I kinda feel like we might drop it eventually, but for now I think it makes sense 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what should be the alternative to do it. I'll change once we want to remove it.

Copy link
Member

@Groxx Groxx Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main alternative is probably either:

  • logger.WithGroup: https://pkg.go.dev/log/slog#Logger.WithGroup
    • i.e. pushing everything from "this logger" into a sub-field
  • or logger.With("logger", "concurrency-auto-scaler")
    • setting a top-level field ("logger") and leaving everything else flat / possibly conflicting in meaning.

which is not a clear win in either direction. flatter is much easier to query for shared fields, structured is much easier to be unambiguous and is often more efficient to index, etc.

I bring it up mostly because I think a move to log/slog is inevitable, and we'll have to decide [something] at that point. It'll probably just be .With("logger", "concurrency-auto-scaler") tho, since I don't think we'll care much about dot.separated.names at that point (and none exist now).

eventType: autoScalerEvent(event.ContextMap()["event"].(string)),
enabled: event.ContextMap()["enabled"].(bool),
pollerQuota: event.ContextMap()["poller_quota"].(int64),
time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat),
time: event.Time.Format(testTimeFormat),

with a logger.WithClock, I think this handles the "logs are hard to identify uniquely" thing that seems to be the intent here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just didn't find a good way to assert event logs and find this is easier (or quicker) with a test entity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the "make a simplified struct for comparison" is a good choice I think, this was just for the time-context-map-field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. zap.WithClock is unusable, given its definition 🤔

still seems like an odd addition here tbh

Copy link
Member

@Groxx Groxx Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a possible option that appears to work: sort the logs by event.Time, and just make sure the other values occur in the same order. you can get rid of the time field entirely from eventLog then.

(afaict they are always in order in these tests, but an explicit sort is probably a good idea)

there is technically room for some out-of-order-ness to occur by doing that, but I don't think these tests really run that risk. and ensuring unique values in all logs would take care of it too.

Comment on lines +129 to +140
"busy pollers, scale up to maximum",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
},
[]eventLog{
{autoScalerEventStart, false, 100, "00:00:00"},
{autoScalerEventEnable, true, 100, "00:00:00"},
{autoScalerEventPollerSkipUpdateCooldown, true, 100, "00:00:01"},
{autoScalerEventPollerScaleUp, true, 200, "00:00:02"},
{autoScalerEventStop, true, 200, "00:00:02"},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be easier to follow the actual behavior of this one with a less-than-1/2-maximum set of values, e.g. start with 10 rather than 100 -> it won't scale to maximum, it'll scale to 42.

kinda similar for others below, e.g. pollers, scale up and down multiple times becomes:

{autoScalerEventStart, false, 10, "00:00:00"},
{autoScalerEventEnable, true, 10, "00:00:00"},
{autoScalerEventPollerSkipUpdateCooldown, true, 10, "00:00:01"},
{autoScalerEventPollerScaleUp, true, 42, "00:00:02"},
{autoScalerEventPollerSkipUpdateCooldown, true, 42, "00:00:03"},
{autoScalerEventPollerScaleDown, true, 25, "00:00:04"},
{autoScalerEventPollerSkipUpdateCooldown, true, 25, "00:00:05"},
{autoScalerEventPollerScaleUp, true, 104, "00:00:06"},
{autoScalerEventPollerSkipUpdateCooldown, true, 104, "00:00:07"},
{autoScalerEventPollerScaleDown, true, 63, "00:00:08"},
{autoScalerEventStop, true, 63, "00:00:08"},

which seems a bit more informative than "to max, down, back to max, back to same down value"

"busy pollers, scale up to maximum",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale up significantly

"idle pollers waiting for tasks",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down

"idle pollers, scale down to minimum",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale down

"idle pollers but disabled scaling",
[]*shared.AutoConfigHint{
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, in cool down
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up
Copy link
Member

@Groxx Groxx Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, no update

also this one isn't really "idle", that'd be ~60k / something larger than the scale-down value, yea?

"idle pollers but disabled scaling at a later time",
[]*shared.AutoConfigHint{
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, in cool down
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down

Comment on lines +49 to +55
PollerPermit: NewResizablePermit(100),
TaskPermit: NewResizablePermit(1000),
},
Cooldown: 2 * testTickTime,
Tick: testTickTime,
PollerMaxCount: 200,
PollerMinCount: 50,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat odd that PollerPermit starts at a different value than PollerMinCount, since I don't believe that'll ever be the case in practice?

it does seem harmless though, and kinda simplifies the "scale down" tests... just not sure that's worth breaking the normal pattern to achieve.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor stuff as optional cleanups, I think - looks good to go to me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants