You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stacked on top of #6913
Presently, sled-agent sends requests to terminate an instance to the
`InstanceRunner` task over the same `tokio::sync::mpsc` request channel
as all other requests sent to that instance. This means that the
`InstanceRunner` will attempt to terminate the instance only once other
requests received before the termination request have been processed,
and an instance cannot be terminated if its request channel has filled
up. Similarly, if an instance's `InstanceRunner` task is waiting for an
in-flight request to the VMM to complete, the request to terminate the
instance will not be seen until the current request to Propolis has
returned. This means that if the instance has gotten stuck for some
reason --- e.g., because it is attempting a Crucible snapshot that
cannot complete because a physical disk has gone missing, as seen in
#6911 --- the instance cannot be terminated. Sadly, in this case, the
only way to resolve the stuck request is to terminate the instance, but
we cannot do so *because* the instance is stuck.
This seems unfortunate: Ii we try to kill an instance because it's doing
something that it will never be able to finish, it shouldn't be able to
say "no, you can't kill me, I'm too *busy* to die!". Instead, requests
to terminate the instance should be prioritized over other requests.
This commit does that.
Rather than sending termination requests to the `InstanceRunner` over
the same channel as all other requests, we instead introduce a separate
channel that's *just* for termination requests, which is preferred over
the request channel in the biased `tokio::select!` in the
`InstanceRunner` run loop. This means that a full request channel cannot
stop a termination request from being sent. When a request to the VMM is
in flight, the future that awaits that request's completion is now one
branch of a similar `tokio::select!` with the termination channel. This
way, if a termination request comes in while the `InstanceRunner` is
awaiting an in-flight instance operation, it will still be notified
immediately of the termination request, cancel whatever operation it's
waiting for, and go ahead and terminate the VMM immediately. This is the
correct behavior here, since the terminate operation is intended to
forcefully terminate the VMM *now*, and is used internally for purposes
such as `use_only_these_disks` killing instances that are using a
no-longer-extant disk, or the control plane requesting that the
sled-agent forcibly unregister the instance. "Normal" requests to stop
the instance gracefully will go through the `instance_put_state` API
instead, sending requests through the normal request channel and
allowing in flight operations to complete.
0 commit comments