-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A84: PID LB policy #430
base: master
Are you sure you want to change the base?
A84: PID LB policy #430
Changes from all commits
9ded058
f1a4661
7e14d72
ff1e037
684ed20
530beba
0d6f76e
71cb41d
1786b29
f1d8921
12110b6
dbc7adc
a1fe191
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,290 @@ | ||
A68: PID LB policy. | ||
---- | ||
* Author(s): @s-matyukevich | ||
* Approver: | ||
* Status: Draft | ||
* Implemented in: PoC in Go | ||
* Last updated: 2024-04-26 | ||
* Discussion at: https://groups.google.com/g/grpc-io/c/eD2bE2JzQ2w | ||
|
||
## Abstract | ||
|
||
This document proposes a design for a new load balancing policy called pid. The term pid stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy builds upon the [A58: weighted_round_robin LB policy (WRR)][A58] and requires direct load reporting from backends to clients. Similar to wrr, it utilizes client-side weighted round robin load balancing. However, unlike wrr, it does not determine weights deterministically. Instead, it employs a feedback loop with the pid controller to adjust the weights in a manner that allows the load on all backends to converge to the same value. The policy supports either per-call or periodic out-of-band load reporting as per [gRFC A51][A51]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd appreciate making it more clear earlier this is a PD controller, not full PID. |
||
|
||
## Background | ||
|
||
The `wrr` policy uses the following formula to calculate subchannel weights, which is described in more details in the "Subchannel Weights" section of [gRFC A58][A58]: | ||
|
||
$$weight = \dfrac{qps}{utilization + \dfrac{eps}{qps} * error\\_utilization\\_penalty}$$ | ||
|
||
This formula is effective when backends, which have different average CPU costs per request, receive an identical number of connections. In such scenarios, `wrr` aids in fairly distributing requests between backends, ensuring that more powerful backends receive more requests, and less powerful backends receive fewer requests. However, `wrr` is not effective in correcting imbalances generated by the use of random subsetting, as described in [gRFC A68][A68]. This is because random subsetting leads to a situation where some backends receive more connections than others. The number of connections a server receives does not impact its CPU cost per request metric, so more connected backends will end up receiving more requests than less connected ones. | ||
|
||
The `pid` balancer takes a different approach: instead of deterministically calculating weights based on a backend metric, it continuously adjusts weights at runtime. It utilizes a feedback loop based on the backend CPU metric to determine the direction and magnitude of every weight update. | ||
|
||
|
||
### Related Proposals: | ||
* [gRFC A51: Custom Backend Metrics Support][A51] | ||
* [gRFC A58: `weighted_round_robin` LB policy][A58] | ||
* [gRFC A68: Random subsetting with rendezvous hashing LB policy.][A68] | ||
|
||
## Proposal | ||
|
||
Introduce a new LB policy `pid`. This policy implements client-side load balancing with direct load reports from backends. It utilizes a feedback loop with a PID controller to dynamically adjust the weights. The policy is otherwise largely a superset of the existing policy `weighted_round_robin`. | ||
|
||
### LB Policy Config and Parameters | ||
|
||
The `pid` LB policy config will be as follows. | ||
|
||
```textproto | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is "protobuf" as it it the schema. Textproto is textual representation for a message. |
||
message LoadBalancingConfig { | ||
oneof policy { | ||
PIDLbConfig pid = 20 [json_name = "pid"]; | ||
} | ||
} | ||
|
||
message PIDLbConfig { | ||
// Configuration for the WRR load balancer as defined in [gRFC A58][A58]. | ||
// The PID balancer is an extension of WRR and all settings applicable to WRR also apply to PID identically. | ||
WeightedRoundRobinLbConfig wrr_config = 1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we want to include the WRR config here. Instead, let's just duplicate the fields from that proto in this one, so that the two are independent. |
||
|
||
// Threshold beyond which the balancer starts considering the ErrorUtilizationPenalty. | ||
// This helps avoid oscillations in cases where the server experiences a very high and spiky error rate. | ||
// We avoid eliminating the error_utilization_penalty entirely to prevent redirecting all traffic to an instance | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is the only thing that you need the error utilization penalty for, an alternative would be to set the penalty to zero and instead use outlier detection (see gRFC A50) to avoid sending traffic to such a backend. Then we presumably wouldn't need this knob. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, we can use outlier detection with zero utilization penalty for servers with very high and spiky error rates. I can remove this knob. |
||
// that has low CPU usage but rejects all requests. Default is 0.5. | ||
google.protobuf.FloatValue error_utilization_threshold = 2; | ||
|
||
// Controls the convergence speed of the PID controller. Higher values accelerate convergence but may induce oscillations, | ||
// especially if server load changes more rapidly than the PID controller can adjust. Oscillations might also occur due to | ||
// significant delays in load report propagation or extremely spiky server load. To mitigate spiky loads, server owners should | ||
// employ a moving average to smooth the load reporting. Default is 0.1. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be very hard for us to change these values later. We're thinking it'd be better for each service to define these. That'd mean we wouldn't define defaults for PD and require they be specified in the service config. We'd still provide a suggestion. But we'd have a way to change that suggestion over time. |
||
google.protobuf.FloatValue proportional_gain = 2; | ||
|
||
// Adjusts the smoothness of the PID controller convergence. Higher values enhance smoothness but can decelerate convergence. | ||
// Default is 1. | ||
google.protobuf.FloatValue derivative_gain = 4; | ||
|
||
// Maximum allowable weight. Weights proposed by the PID controller exceeding this value will be capped. | ||
// This prevents infinite weight growth, which could occur if only a subset of clients uses PID and increasing weights | ||
// no longer effectively corrects the imbalance. Default is 10. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI: We were looking to see if there were ways to make the policy very safe by default. One option was to strongly restrict the range of weights which would limit the possible damage. However, WRR's main purpose is to balance load with non-homogeneous servers. Heavily restricting the range of weights only works with homogeneous servers, so we dismissed this idea. |
||
google.protobuf.FloatValue max_weight = 5; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no inherent "anchor" or "center" point to these weights. Because of D the average weight will stray from 1. (As well as scaling We can't avoid the drift by giving a convenient meaning to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very good point, I verified that the drift actually happens in practice. It is not causing issues and we didn't notice this before, since weights are re-centered around 1 after every client redeployment and, as you said, drift stops once a significant portion of the clients reach the limit. Anyway, I'll update gRFC to include your proposal. |
||
|
||
// Minimum allowable weight. Weights proposed by the PID controller falling below this value will be capped. | ||
// This prevents weights from dropping to zero, which could occur if only a subset of clients uses PID and decreasing weights | ||
// no longer effectively corrects the imbalance. Default is 0.1. | ||
google.protobuf.FloatValue min_weight = 6; | ||
} | ||
``` | ||
|
||
### PID controller | ||
|
||
A PID controller is a control loop feedback mechanism that continuously calculates an error value as the difference between a desired setpoint (`referenceSignal`) and a measured process variable (`actualSignal`). It then applies a correction based on proportional, integral, and derivative terms (denoted P, I, and D respectively), hence the name. | ||
|
||
In our implementation, we will not be using the integral part. The integral component is useful for speeding up convergence when the `referenceSignal` changes sharply. In our case, we will be converging the load on the subchannels to a mean value, which is mostly stable. | ||
|
||
Here is a sample implementation in pseudo-code: | ||
|
||
``` | ||
pidController class { | ||
proportionalGain float | ||
derivativeGain float | ||
|
||
controlError float | ||
|
||
update(referenceSignal float, actualSignal float, samplingInterval duration) float { | ||
previousError = this.controlError | ||
// Save last controlError so we can use it to calculate derivative during next update | ||
this.controlError = referenceSignal - actualSignal | ||
controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds() | ||
controlSignal = this.proportionalGain * this.controlError + | ||
this.derivativeGain * controlErrorDerivative | ||
return controlSignal | ||
} | ||
} | ||
``` | ||
|
||
The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight. | ||
|
||
The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be additionally scaled by the `WeightUpdatePeriod` value. This is necessary because derivative error is calculated like `controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds()` and dividing by a very small `samplingInterval` value makes the result too big. `WeightUpdatePeriod` is roughly equal to `samplingInterval` as we will be updating the PID state once per `WeightUpdatePeriod`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "scaled" in which way? Let's not assume the implementer will get that right. (Let's explicitly say "multiply by weight_update_period as a floating point number of seconds".) I think I understand (Future comments about weight_expiration_period may change exactly what we do here.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The proposal was to scale only proportionalGain, not derivativeGain. Also I'll update the doc to explicitly mention multiplication instead of scaling. |
||
|
||
### Extending WRR balancer | ||
|
||
The `pid` balancer reuses 90% of the wrr code. The proposal is to refactor the `wrr` codebase and introduce several hooks that allow other balancers, like `pid`, to reuse the code efficiently without the need for duplication. This approach is mostly language-specific, but the general plan is as follows: | ||
|
||
* Add a `callbacks` object to the `wrr` balancer: This object will contain a series of callback functions that wrr will invoke at various stages of its lifecycle. | ||
* Introduce a `callbackData` object: This will be utilized by the callbacks to store any data that is reused across different callback functions. The `wrr` balancer will pass this object to all callbacks and treat it as an opaque blob of data. | ||
* Add `callbackConfig` object to the `wrr` balancer: This object will contain the PID specific part of the user provided config as defined in the `LB Policy Config and Parameters` section. The `wrr` balancer will pass this object to all callbacks and treat it as an opaque blob of data. | ||
|
||
The `callbacks` object, which is to be provided by the balancer builder, will implement the following interface (expressed in pseudo-code): | ||
|
||
``` | ||
wrrCallbacks interface { | ||
onSubchannelAdded(subchannelID int, data callbackData, conf callbackConfig) | ||
onSubchannelRemoved(subchannelID int, data callbackData, conf callbackConfig) | ||
|
||
// onLoadReport is called when a new load report is received for a given subchannel. | ||
// This function returns the new weight for a subchannel. If the returned value is -1, | ||
// the subchannel should keep using the old value. | ||
// onLoadReport won't be called during the blackout period. | ||
onLoadReport(subchannelId int, report loadReport, data callbackData, conf callbackConfig) float | ||
|
||
// onEDFSchedulerUpdate is called after the wrr balancer recreates the EDF scheduler. | ||
onEDFSchedulerUpdate(data callbackData, conf callbackConfig) | ||
} | ||
``` | ||
|
||
Here is how `pid` balancer implements `wrrCallbacks` interface. | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be great to define this as ```go, just to have some level of syntax highlighting. |
||
func onSubchannelAdded(subchannelID int, data callbackData, conf callbackConfig) { | ||
// Do nothing | ||
} | ||
|
||
func onSubchannelRemoved(subchannelID int, data callbackData, conf callbackConfig) { | ||
// Remove subchannelID from two maps that store the value of last utilization | ||
// and last applied weight per subchannel | ||
delete(data.utilizationPerSubchannel, subchannelID) | ||
delete(data.lastAppliedWeightPerSubchannel, subchannelID) | ||
} | ||
|
||
|
||
func onLoadReport(subchannelId int, load loadReport, data callbackData, conf callbackConfig) float { | ||
utilization = load.ApplicationUtilization | ||
if utilization == 0 { | ||
utilization = load.CpuUtilization | ||
} | ||
if utilization == 0 || load.RpsFractional == 0 { | ||
// Ignore empty load | ||
return -1 | ||
} | ||
errorRate = load.Eps / load.RpsFractional | ||
useErrPenalty = errorRate > conf.ErrorUtilizationThreshold | ||
if useErrPenalty { | ||
utilization += errorRate * conf.ErrorUtilizationPenalty | ||
} | ||
|
||
// Ensure at least WeightUpdatePeriod has passed since the last update. | ||
// Prevents corruption of PID controller's internal state, which could happen in the following cases: | ||
// * If 2 updates are very close to each other in time, samplingInterval ~= 0 and signal ~= infinity. | ||
// * If multiple updates happened during a single WeightUpdatePeriod, the actual weights are not applied, | ||
// but the PID controller keeps growing the weights and it may easily pass the balancing point. | ||
if time.Since(lastApplied) < conf.WeightUpdatePeriod { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the implementation we have we indeed added a lock. I am leaning toward implementing your proposal and tracking only most recent utilization on the data plane, and I think it is the same discussion as https://github.com/grpc/proposal/pull/430/files#r1760128032 |
||
return -1 | ||
} | ||
|
||
// use value calculated in the onEDFSchedulerUpdate method | ||
meanUtilization = data.meanUtilization | ||
|
||
// call the PID controller to get the value of the control signal. | ||
controlSignal = data.pidController.update({ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is wrong to update the PD controller in response to a load report. P and D are expected to be used at a certain update frequency and having it vary based on RPC rate means we lose the benefit prior art. P could be corrected for in the PID controller, but the feedback loop sees delay and it would require high, steady QPS. Updating the PID should be done as part of the weight updates, which is the control update. I agree that moving the update away from the utilization update creates new problems with sharing code, but I think they are easier to reason about. Hear me out and tell me what you think: The data plane just updates utilization (stored in As an option to improve utilization: Data plane sums received utilizations and increments a counter for the number received. Control plane computes the average utilization by Only updating utilization on data plane does make more obvious the problem of "low QPS." This is actually a pretty big existing problem as 1) the weights aren't actually being utilized and 2) we aren't getting (non-OOB) load reports. That means both parts of the feedback loop are busted. For WRR low QPS is just stale. But PID can accumulate error and make very bad choices if there is a burst. PID needs QPS to be high enough for at least one RPC per endpoint each Should we also disable/discourage OOB reporting? That's useful for WRR for long idle periods, but long idle periods are broken for PID. It is useful for reduced network and CPU cost, but I don't think people would be choosing it for that. If we keep it, we'd need to track whether an RPC has been received in the last period (or rely just on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this idea, but I think it also depends on the decision we make about my proposal in this thread https://github.com/grpc/proposal/pull/430/files#r1761294507 If we update load reports to give us monotonically increasing CPU counter this should solve the problem with low QPS clients. Also, if we implement what I am proposing here we can easily overwrite last load report on the data-plane without any processing and keep it lock free. Otherwise we'll need to update EWMA on every load report update in order to keep the data as accurate as possible. Also I am fine with disabling/discouraging OOB. That's what we do as well in our infrastructure. Since we pair this solution with heavy subsetting it also helps us to ensure every client has fresh data about server utilization, even if the client has low QPS. |
||
referenceSignal: meanUtilization, | ||
actualSignal: utilization, | ||
samplingInterval: time.Since(lastApplied), | ||
}) | ||
|
||
// Normalize the signal. | ||
// If meanUtilization ~= 0 the signal will be ~= 0 as well, and convergence will become painfully slow. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the reasoning here is wrong. If meanUtilization is ~= 0 then to correct a 5%pt error you need to move a higher percentage of traffic than when meanUtilization is high. 'Weight' moves percentage of traffic, so the imbalance is higher at lower utilization and thus we'd expect it to take longer to converge. If we were generating traffic, the control could be QPS and "increase by X QPS" produces the same adjustment at high and low utilization. So this is a property of weight being relative to the mean instead of an absolute. The effect of increasing the weight by 5 depends on the average weight and increasing it by 5% has a different effect on the signal at different utilizations. So I follow what you're trying to do. But I think this becomes more clear if we use the typical PID addition to adjust the control: weight = lastAppliedWeight + controlSignal / meanUtilization * weight; Dividing controlSignal by meanUtilization is the same as dividing the input to controlSignal = data.pidController.update({
referenceSignal: 1,
actualSignal: utilization / meanUtilization,
samplingInterval: time.Since(lastApplied),
})
// weight could be moved earlier as well
weight = lastAppliedWeight + controlSignal * weight; This is much easier for me to reason about. I can see now the unit for P is "% of weight/util%" and thus put the default (I think we could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, this make sense, I'll update the doc to divide PID input signal by meanUtilization |
||
// If, meanUtilization >> 1 the signal may become very high, which could lead to oscillations. | ||
if meanUtilization > 0 { | ||
controlSignal *= 1 / meanUtilization | ||
} | ||
|
||
lastAppliedWeight = data.lastAppliedWeightPerSubchannel[subchannelID] | ||
|
||
// Use controlSignal to adjust the weight. | ||
// First calculate a multiplier that will be used to determine how much weight should be changed. | ||
// The higher is the absolute value of the controlSignal the more we need to adjust the weight. | ||
if controlSignal >= 0 { | ||
// in this case mult should belong to the [1,inf) interval, so we will be increasing the weight. | ||
mult = 1.0 + controlSignal | ||
} else { | ||
// in this case mult should belong to (0, 1) interval, so we will be decreasing the weight. | ||
mult = -1.0 / (controlSignal - 1.0) | ||
} | ||
weight = lastAppliedWeight * mult | ||
|
||
// Clamp weight | ||
if weight > conf.MaxWeight { | ||
weight = conf.MaxWeight | ||
} | ||
if weight < conf.MinWeight { | ||
weight = conf.MinWeight | ||
} | ||
|
||
// Save resulting utilization and weight. | ||
data.utilizationPerSubchannel[subchannelId] = utilization | ||
data.lastAppliedWeightPerSubchannel[subchannelID] = weight | ||
|
||
return weight | ||
} | ||
|
||
func onEDFSchedulerUpdate(data callbackData) { | ||
// Calculate mean utilization across all subchannels | ||
totalUtilization = 0 | ||
count = len(data.utilizationPerSubchannel) | ||
for _, utilization in data.utilizationPerSubchannel { | ||
totalUtilization += utilization | ||
} | ||
data.meanUtilization = totalUtilization / count | ||
} | ||
``` | ||
|
||
The proposal is to make `wrrCallbacks` public. This has a number of significant benefits. Besides PID, there are other cases where one might need to extend `wrr`. For example, Spotify [demonstrates](https://www.youtube.com/watch?v=8E5zVdEfwi0) a gRPC load balancer to reduce cross-zone traffic – this can be implemented nicely in terms of `wrr` weights. We are also considering the same and incorporating things like latency into our load balancing decisions. Existing ORCA extension points don't cover these use cases. We leverage ORCA for custom server utilization metrics, but we also need the ability to combine server and client metrics to generate the resulting weight. The alternative is to write our own balancer with custom EDF scheduler and handle details related to subchannel management and interactions with resolvers. With this new API, use cases like this can be covered naturally, users have full control over the end-to-end definition of weights. | ||
s-matyukevich marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am hesitant to make this a public API. This seems like a valuable internal thing to allow us to reuse LB policy implementations, but making it public implies a stability commitment that I'm not super comfortable with. It's worth noting that the example of using latency would require that this API surface be even broader, because we'd need a way for this to trigger the LB policy to measure the latency on each RPC. (We do have that mechanism internally, but we'd need to expose it via this API in addition to the existing mechanism as part of the LB policy API itself.) I'd like to hear thoughts from @ejona86 and @dfawley on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expected this answer but decided to give it a try anyway. Making this interface private works for us as well - next time we need to extend PID of WRR we'll come back with more gRFCs. I assume that it is much easier to make an interface public than otherwise, so maybe we can revisit this decision in the future when we have more data and better defined use-cases where such interface might be useful. For now I can remove this paragraph and replace it with a note that the interface will be private. @markdroth does this work for you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that I'd want exactly this interface in C-core, even as a private API. I think it would be better to avoid being proscriptive about how the implementation needs to be structured, so that each implementation can make its own choice. That having been said, I do agree that it's useful to point out that there's so much overlap between this new policy and the existing WRR policy and to suggest that implementations consider how to avoid code duplication. And we can even describe this specific approach as one possible way to do that. Let's just leave the final decision up to the implementation. FWIW, I'm going to be putting together a gRFC soon to allow WRR to configure which ORCA utilization metrics it looks at, to make it a bit more flexible. I don't think that actually affects this design, though, because it still doesn't make the weight calculation function pluggable. |
||
|
||
### Dealing with Oscillations | ||
|
||
One of the main challenges with the pid balancer is the potential for oscillations. Several factors influence this likelihood: | ||
|
||
1. **Propagation Delay of Load Reports:** | ||
* **Direct Load Reporting**: Here, the delay depends on the request frequency and the `WeightUpdatePeriod` setting. Typically, with the default `WeightUpdatePeriod` of 1 second, propagation is very fast, making this the preferred option when using the pid balancer. | ||
* **OOB Load Reporting**: Users can control the delay by adjusting the `OobReportingPeriod` setting. While the delay is usually larger compared to direct reporting, achieving perfect convergence with OOB reporting is still possible on workloads with stable loads. | ||
2. **Proportional Gain:** | ||
* A high `ProportionalGain` can lead to significant weight adjustments, potentially overshooting the balancing point. The default value of 0.1 generally allows for fast convergence (typically faster than 30 seconds on workloads that are not spiky) while not generating oscillations. | ||
3. **Stability of Server Load:** | ||
* The pid balancer struggles with servers that exhibit spiky loads because the mean utilization is not stable, which disrupts the convergence direction for all subchannels. Unfortunately, this is one aspect users cannot directly control from the client side. To address this, the proposal includes implementing an "average window" mechanism on the server, which will be discussed in the next section. | ||
4. **Number of Subchannels:** | ||
* A larger number of subchannels generally stabilizes the mean utilization, leading to faster convergence and reducing the likelihood of oscillations. This is particularly relevant when considering the use of random subsetting, as discussed in [gRFC A68][A68]. A small subset size can hinder the `pid` balancer’s ability to converge load across backends, particularly if mean utilization is unstable and clients connect only to either overloaded or underloaded servers. However, we have achieved acceptable convergence on a spiky workload with a subset size as small as four, using a three-minute moving average window size for load reporting. A proposed default subset size of 20 typically ensures good convergence on any workload. | ||
|
||
By understanding and addressing these factors, the `pid` balancer can be more effectively tuned to manage load balancing across different environments and usage scenarios, minimizing the risks associated with oscillations. | ||
|
||
### Moving Average Window for Load Reporting | ||
|
||
As outlined in the previous section, smoothing the utilization measurements in server load reports is essential for the `pid` balancer to achieve convergence on spiky workloads. To address this, we propose integrating a moving average window mechanism into the `MetricRecorder` component, as described in [gRFC A51][A51]. This involves adding a `MovingAverageWindowSize` parameter to the component. Instead of storing a single value per metric, `MetricRecorder` will now maintain the last MovingAverageWindowSize reported values in a circular buffer. The process is detailed in the following pseudo-code: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear to me that this is something we need to build directly into gRPC APIs. Can't the application do this smoothing itself before it reports the data to the gRPC There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's what we are doing now. However, this is not a trivial amount of code and PID almost certainly requires it. If we provide the balancer but don't provide the smoothing implementation the UX won't be ideal, as users most likely will start using it without any smoothing and soon come to the conclusion that it doesn't work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is really critical to the behavior of the PID policy, wouldn't it be better to build this directly into the PID policy, so that it's not possible to have a misconfiguration between the client and the server? In other words, why not do this smoothing on the client side? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We considered this but decided to implement it on the server for the following reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bullets 1-3: these are resolved by this being configurable via xds/service config. The service is still in control. We do agree the service should be in control, but that's a bit different than which side calculates the average. Bullet 4: We agree it is very helpful to have the higher sampling when producing the utilization. That would matter a lot for something like memory utilization, which can be measured at any instant. CPU utilization however is always an average over some time period (cpu seconds used / time period). There is no instantaneous value (other than a weak boolean (running/not running) per core). WRR assumes the cpu utilization is roughly the same time period as rps, so I'd hope server-side is already averaging over at least a second-ish. We'd be very interested if this is wildly inaccurate. It seems load for short periods should be covered. For longer time periods, an exponential moving average on client-side would seem sufficient (updated each weight recalculation). The biggest concern would be too-infrequent of utilization updates, but even with server-side smoothing PD won't be able to function in such a case as there is no feedback loop. Part of the concern about server-side smoothing is the smoothing period matters to the D behavior. Having it server-side would make it harder to guarantee that all the knobs are self-consistent and harder to change the smoothing period. Aside: Understanding the server utilization monitoring period is essential for monitoring spikiness. If the PD oscillates at a rate faster than the server's utilization monitoring frequency, then you won't the see oscillations even though they are happening. This is a risk I've seen in multiple utilization-based LB schemes that prioritize fast updates. (I've done further review/consideration since this conversation so I'll have additional comments on this that I'll post separately. Specifically I'm considering slower/larger RPCs and thus longer weight_expiration_period cases, which will have interplay here.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I like this idea, we can simplify things a lot if we rely only on smoothing and proportional gain. We recently bumped into issue with a service that was using plain WRR (not PID) with round-robin and without any subsetting. The server was measuring CPU every 1 sec, and this is what we get after we added server-side smoothing with 1 min sliding window. I still don't fully understand what is going on here (why load per server pod was stable, but deviation between pods was so high) and whether we could achieve the same result using just longer measuring interval on the server. I want to reproduce this in a synthetic benchmark and will report back when I have more data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not immediately obvious to me precisely what happened, but WRR assumes utilization and rps are for the same rough time period. If you smooth one without smoothing the other, I wouldn't be surprised to see weird results. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In both cases (before and after smoothing was added) utilization and rps were measured for the same period. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, you're pointing out how the banding went away, not the initial spikes. Which is interesting because we wouldn't expect banding at all. If that was with DNS, I would question if clients aren't polling for address updates, so the high outliers would be older servers from before a scale-up. To get a hint as to the cause, I suggest watching that dashboard and seeing if the bands re-form after a period without restarting all backends. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We use a custom resolver that refreshed DNS every 5 sec and use 5 min MaxConnectionAge, so I am pretty sure DNS didn't play a role here (also double-checked that the server load imbalance was persistent and was present after fresh server deployments) Since looks like you are interested in the issue I can give you a little bit more context: The load imbalance was introduced after the team that owns the service decided to reduce the number of cores per pod from 4 to 2 (which resulted in doubling the number of server pods and the resulting rps per pod also reduced by half) You can see this on the following graph. As you can see, immediately after upscale we started seeing this problem, and it was fixed only after we added load smoothing on the server. My gut feeling is that this is related to the fact that the server has low RPS (2-9 req/sec before we added load smoothing, now it is 3-4 req/sec) and requests could be expensive. Still, if RPS is larger than 1 req/sec I would expect WRR to work if server is averaging CPU over 1 sec without any additional smoothing. Also, I think that just increasing this interval could as well solve the problem, but we didn't test it yet. |
||
|
||
``` | ||
func recordMetricXXX(value float) { | ||
// Ensure updates are atomic to avoid corruption of the circular buffer | ||
lock.Lock() | ||
|
||
if circularBufferForMetricXXX.isFull() { | ||
sum -= circularBufferForMetricXXX.last() | ||
} | ||
sum += val | ||
|
||
// Add the new value to the circular buffer, which automatically removes the oldest value if the buffer is full | ||
circularBufferForMetricXXX.add(value) | ||
|
||
// Calculate the average of the values in the circular buffer | ||
cXXXvalue = sum / circularBufferForMetricXXX.size() | ||
lock.Unlock() | ||
} | ||
``` | ||
|
||
Setting `MovingAverageWindowSize` to 1 mimics the current behavior and should remain the default setting. | ||
This modification allows for more stable load reporting by averaging fluctuations over the specified window, thus providing the pid balancer with more consistent data to inform weight adjustments. | ||
|
||
|
||
## Rationale | ||
### Alternatives Considered: | ||
|
||
The main driver for this proposal was the need to implement subsetting. We explored the possibility of using deterministic subsetting in https://github.com/grpc/proposal/pull/383 and got push-back on this for the reasons explained [here](https://github.com/grpc/proposal/pull/383#discussion_r1334587561) | ||
|
||
Additionally, we considered the "scaled wrr" approach, which would adjust the imbalance created by random subsetting by multiplying the server utilization by the number of connections a server receives. Feedback on this approach suggested that it might be more beneficial to pursue more generic solutions that focus on achieving load convergence rather than attempting to tailor the `wrr` method specifically to fit subsetting use cases. | ||
|
||
This feedback led us to explore broader, more adaptable strategies that could better address the complexities introduced by subsetting, culminating in the current proposal. | ||
|
||
## Implementation | ||
DataDog will provide Go and Java implementations. | ||
|
||
|
||
[A51]: A51-custom-backend-metrics.md | ||
[A58]: A58-client-side-weighted-round-robin-lb-policy.md | ||
[A68]: https://github.com/grpc/proposal/pull/423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says A68, and the filename says A80, and both of those numbers are already taken. :)
Looks like the next available number is A84, so let's use that.