-
Notifications
You must be signed in to change notification settings - Fork 135
rework on poller auto scaler #1411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
rework on poller auto scaler #1411
Conversation
82e6bb6
to
636c433
Compare
Codecov ReportAttention: Patch coverage is
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go |
8438696
to
69f7e3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but I left some nits
internal/internal_worker_base.go
Outdated
@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() { | |||
var err error | |||
var task interface{} | |||
|
|||
if bw.pollerAutoScaler != nil { | |||
if bw.concurrencyAutoScaler != nil { | |||
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are guarding calls againts bw.concurrency
based on nilness of bw.concurrencyAutoScaler
which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've removed this check in all places. Regarding the comment hese two should be abstracted behind a single interface to avoid additional complexity in this file
, I still think this is two separate entities. Client still needs concurrency whether autoscaler is enabled or not.
return | ||
case <-ticker.Chan(): | ||
c.logEvent(autoScalerEventMetrics) | ||
c.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: push lock/unlock to updatePollerPermit, then you can use defer inside the function and ensure that unlock happens if anything will cause panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, it's simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I later found a race condition. I actually need to lock on both logEvent
and updatePollerPermit
. The way I avoid deadlocks is to only lock/unlock on exported methods. Locking on helper methods would easily lead to deadlock for me.
c.wg.Add(1) | ||
|
||
go func() { | ||
defer c.wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: any calls that start goroutine should have a panic handler.
If a bug exists, it will crash the worker process, significantly impacting customer service.
This is an optional functionality that should be safe to break. Worst case, it won't update concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
a9d3781
to
52dd229
Compare
1d733a5
to
e7776be
Compare
@@ -153,6 +165,20 @@ type ( | |||
} | |||
) | |||
|
|||
func (t *workflowTask) getAutoConfigHint() *s.AutoConfigHint { | |||
if t.task != nil { | |||
return t.task.AutoConfigHint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check whether t.task.AutoConfigHint
is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I changed the order so it makes more sense.
internal/internal_worker_base.go
Outdated
@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() { | |||
var err error | |||
var task interface{} | |||
|
|||
if bw.pollerAutoScaler != nil { | |||
if bw.concurrencyAutoScaler != nil { | |||
if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are guarding calls againts bw.concurrency
based on nilness of bw.concurrencyAutoScaler
which indicates that these two should be abstracted behind a single interface to avoid additional complexity in this file
internal/internal_worker_base.go
Outdated
return t.autoConfigHint | ||
default: | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this switch case (which is not future proof), we can cast the task to autoConfigHintAwareTask
interface and get the auto config hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this to use autoConfigHintAwareTask
lowerPollerWaitTime = 16 * time.Millisecond | ||
upperPollerWaitTime = 256 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like we would want to iterate on these to adjust sensitivity. consider exposing these to worker config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The poller wait time is an invariant. User doesn't need to tune it. The sensitivity control (time-to-react) is actually controlled by the Cooldown which is already in the parameter
}, | ||
}, | ||
{ | ||
"idl pollers waiting for tasks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo idle. same in other cases below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
name string | ||
pollAutoConfigHint []*shared.AutoConfigHint | ||
expectedEvents []eventLog | ||
}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to add a case where it scales up and down a few times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
coverage failed due to deprecation changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropping notes for now, while reading tests carefully 👍
overall looks pretty good I think - fairly easy to follow, behavior looks good (e.g. up to 4x growth when "instant", 0.5x shrink when slow, one scale change every 10 seconds sounds reasonable), everything's pretty close.
so just a small pile of minor stuff, some nits some not.
autoScalerEventStart autoScalerEvent = "auto-scaler-start" | ||
autoScalerEventStop autoScalerEvent = "auto-scaler-stop" | ||
autoScalerEventLogMsg string = "concurrency auto scaler event" | ||
testTimeFormat string = "15:04:05" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testTimeFormat string = "15:04:05" |
shutdownChan: make(chan struct{}), | ||
concurrency: input.Concurrency, | ||
cooldown: input.Cooldown, | ||
log: input.Logger.Named(metrics.ConcurrencyAutoScalerScope), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be our first use of Named
🤔
since this isn't a concept in log/slog
I kinda feel like we might drop it eventually, but for now I think it makes sense 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious what should be the alternative to do it. I'll change once we want to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main alternative is probably either:
logger.WithGroup
: https://pkg.go.dev/log/slog#Logger.WithGroup- i.e. pushing everything from "this logger" into a sub-field
- or
logger.With("logger", "concurrency-auto-scaler")
- setting a top-level field ("logger") and leaving everything else flat / possibly conflicting in meaning.
which is not a clear win in either direction. flatter is much easier to query for shared fields, structured is much easier to be unambiguous and is often more efficient to index, etc.
I bring it up mostly because I think a move to log/slog
is inevitable, and we'll have to decide [something] at that point. It'll probably just be .With("logger", "concurrency-auto-scaler")
tho, since I don't think we'll care much about dot.separated.names at that point (and none exist now).
eventType: autoScalerEvent(event.ContextMap()["event"].(string)), | ||
enabled: event.ContextMap()["enabled"].(bool), | ||
pollerQuota: event.ContextMap()["poller_quota"].(int64), | ||
time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time: event.ContextMap()["time"].(time.Time).Format(testTimeFormat), | |
time: event.Time.Format(testTimeFormat), |
with a logger.WithClock
, I think this handles the "logs are hard to identify uniquely" thing that seems to be the intent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just didn't find a good way to assert event logs and find this is easier (or quicker) with a test entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, the "make a simplified struct for comparison" is a good choice I think, this was just for the time-context-map-field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. zap.WithClock
is unusable, given its definition 🤔
still seems like an odd addition here tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a possible option that appears to work: sort the logs by event.Time
, and just make sure the other values occur in the same order. you can get rid of the time field entirely from eventLog
then.
(afaict they are always in order in these tests, but an explicit sort is probably a good idea)
there is technically room for some out-of-order-ness to occur by doing that, but I don't think these tests really run that risk. and ensuring unique values in all logs would take care of it too.
"busy pollers, scale up to maximum", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down | ||
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum | ||
}, | ||
[]eventLog{ | ||
{autoScalerEventStart, false, 100, "00:00:00"}, | ||
{autoScalerEventEnable, true, 100, "00:00:00"}, | ||
{autoScalerEventPollerSkipUpdateCooldown, true, 100, "00:00:01"}, | ||
{autoScalerEventPollerScaleUp, true, 200, "00:00:02"}, | ||
{autoScalerEventStop, true, 200, "00:00:02"}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be easier to follow the actual behavior of this one with a less-than-1/2-maximum set of values, e.g. start with 10 rather than 100 -> it won't scale to maximum, it'll scale to 42.
kinda similar for others below, e.g. pollers, scale up and down multiple times
becomes:
{autoScalerEventStart, false, 10, "00:00:00"},
{autoScalerEventEnable, true, 10, "00:00:00"},
{autoScalerEventPollerSkipUpdateCooldown, true, 10, "00:00:01"},
{autoScalerEventPollerScaleUp, true, 42, "00:00:02"},
{autoScalerEventPollerSkipUpdateCooldown, true, 42, "00:00:03"},
{autoScalerEventPollerScaleDown, true, 25, "00:00:04"},
{autoScalerEventPollerSkipUpdateCooldown, true, 25, "00:00:05"},
{autoScalerEventPollerScaleUp, true, 104, "00:00:06"},
{autoScalerEventPollerSkipUpdateCooldown, true, 104, "00:00:07"},
{autoScalerEventPollerScaleDown, true, 63, "00:00:08"},
{autoScalerEventStop, true, 63, "00:00:08"},
which seems a bit more informative than "to max, down, back to max, back to same down value"
"busy pollers, scale up to maximum", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, in cool down | ||
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale down to minimum | |
{common.PtrOf(true), common.PtrOf(int64(0))}, // <- tick, scale up significantly |
"idle pollers waiting for tasks", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, in cool down | ||
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up | |
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down |
"idle pollers, scale down to minimum", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, in cool down | ||
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale up | |
{common.PtrOf(true), common.PtrOf(int64(60000))}, // <- tick, scale down |
"idle pollers but disabled scaling", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, in cool down | ||
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, scale up | |
{common.PtrOf(false), common.PtrOf(int64(100))}, // <- tick, no update |
also this one isn't really "idle", that'd be ~60k / something larger than the scale-down value, yea?
"idle pollers but disabled scaling at a later time", | ||
[]*shared.AutoConfigHint{ | ||
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, in cool down | ||
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale up | |
{common.PtrOf(true), common.PtrOf(int64(1000))}, // <- tick, scale down |
PollerPermit: NewResizablePermit(100), | ||
TaskPermit: NewResizablePermit(1000), | ||
}, | ||
Cooldown: 2 * testTickTime, | ||
Tick: testTickTime, | ||
PollerMaxCount: 200, | ||
PollerMinCount: 50, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat odd that PollerPermit
starts at a different value than PollerMinCount
, since I don't believe that'll ever be the case in practice?
it does seem harmless though, and kinda simplifies the "scale down" tests... just not sure that's worth breaking the normal pattern to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor stuff as optional cleanups, I think - looks good to go to me 👍
Detailed Description
Improve performance of poller auto scaler by using more accurate scaling signals and several implementation changes.
Changes
AutoScalerOptions
is introduced.Impact Analysis
Testing Plan
Rollout Plan
What is the rollout plan?
For Uber services, standard client release steps
For OSS users, turn off autoscaler feature first before the client upgrade.
Does the order of deployment matter? No
Is it safe to rollback? Does the order of rollback matter? Yes
Is there a kill switch to mitigate the impact immediately? Yes, the new autoscaler feature is an opt-in feature.