You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[sled-agent] Don't block InstanceManager on full MPSCs
Sled-agent's `InstanceManager` task is responsible for managing the
table of all instances presently running on the sled. When the
sled-agent receives a request relating to an individual instance on the
sled, it's sent to the `InstanceManager` over a `tokio::sync::mpsc`
channel, and is then dispatched by the `InstanceManager` to the
`InstanceRunner` task responsible for that individual instance by
sending it over a *second* `tokio::sync::mpsc` channel. This is where
things start to get interesting.[^1]
`tokio::sync::mpsc` is a *bounded* channel: there is a maximum number of
messages which may be queued by a given MPSC channel at any given time.
The `mpsc::Sender::send` method is an `async fn`, and if the channel is
at capacity, that method will _wait_ until there is once again space in
the channel to send the message being sent. Presently,
`mpsc::Sender::send` is called by the `InstanceManager`'s main run loop
when dispatching a request to an individual instance. As you may have
already started to piece together, this means that if a given
`InstanceRunner` task is not able to process requests fast enough to
drain its channel, the entire `InstanceManager` loop will wait when
dispatching a request to that instance until the queue has been drained.
This means that if one instance's runner task has gotten stuck on
something, like waiting for a Crucible flush that will never complete
(as seen in #6911), that instance will prevent requests being dispatched
to *any other instance* managed by the sled-agent. This is quite
unfortunate!
This commit fixes this behavior by changing the functions that send
requests to an individual instance's task to instead *shed load* when
that instance's request queue is full. We now use the
`mpsc::Sender::try_send` method, rather than `mpsc::Sender::send`, which
does not wait and instead immediately returns an error when the channel
is full. This allows the `InstanceManager` to instead return an error to
the client indicating the channel is full, and move on to processing
requests to other instances which may not be stuck. Thus, a single stuck
instance can no longer block requests from being dispatched to other,
perfectly fine instances.
The error returned when the channel is at capacity is converted to an
HTTP 503 Service Unavailable error by the API. This indicates to the
client that their request to that instance was not able to be processed
at this time, but that it may be processed successfully in the
future.[^2] Now, we can shed load while allowing clients to retry later,
which seems much better than the present situation.
[^1]: In the sense of "may you live in interesting times", naturally.
[^2]: I also considered returning 429 Too Many Requests here, but my
understanding is that that status code is supposed to indicate that
too many requests have been received from *that specific client*. In
this case, we haven't hit a per-client rate limit; we're just
overloaded by requests more broadly, so it's not that particular
client's fault.
0 commit comments