-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: refactor instance initialization #715
Conversation
Despite the draft status, this is more or less ready to review, and comments are very welcome. I noticed at least one more bolt I want to tighten (in external request queue cleanup) and am going to fix it before making the PR official. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to wrap my head around this change --- there's a lot going on here. Here's a cursory review of the first couple files in the diff, as I continue reading the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to spend more time going through the server bits, but here are some round-1 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had a lingering unpublished comment
This is going to need some extra work to rebase on top of 99fa564; trying to sort that out now. |
Getting rid of all the `block_on` calls and making everything async turns out to have created a somewhat substantial hassle: now certain locks have to be tokio locks (to avoid holding a lock over an await point), which means new routines have to be async, and some of those are in traits, so now everything is returning a `BoxFuture` and arghhh. Maybe this would be less of a hassle if we weren't using trait objects?
Lose the `dyn VmLifecycle` in favor of a fake-based testing scheme to be created later. Reorganize structures and state driver startup to clean things up some more. This makes the changes feel like much less of a hacky mess than before.
This doesn't happen automatically on ensure unless the target is migrating in, so fix that up.
- use wrapper types to hide the tokio-ness of the underlying reader-writer lock - provide direct access to some subcomponents of `Machine` for brevity
Run live migration protocols on the state driver's tokio task without using `block_on`: - Define `SourceProtocol` and `DestinationProtocol` traits that describe the routines the state driver uses to run a generic migration irrespective of its protocol version. (This will be useful for protocol versioning later.) - Move the `migrate::source_start` and `migrate::dest_initiate` routines into factory functions that connect to the peer Propolis, negotiate protocol versions, and return an appropriate protocol impl. - Use the protocol impls to run migration on the state driver task. Remove all the types and constructs used to pass messages between it and migration tasks. Also, improve the interface between the `vm` and `migrate` modules for inbound migrations by defining some objects that migrations can use either to fully initialize a VM or to unwind correctly if migration fails. This allows migration to take control of when precisely a VM's components get created (and from what spec) without exposing to the migration task all the complexity of unwinding from a failed attempt to create a VM. Tested via full PHD run with a Debian 11 guest.
994cdf3
to
e0ed049
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hawkw I've taken the liberty of leaving a few comments on the LHS of the bigger files that got broken up (server.rs, vm/mod.rs) to try to help point out where code has moved to (since there are a lot of things that just moved without really changing). As always, let me know if I can answer any questions!
MigratedIn, | ||
ExplicitRequest, | ||
} | ||
|
||
/// A wrapper around all the data needed to describe the status of a live |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
External state publishing structs now live in state_publisher.rs.
@@ -86,164 +74,11 @@ pub struct StaticConfig { | |||
metrics: Option<MetricsEndpointConfig>, | |||
} | |||
|
|||
/// The state of the current VM controller in this server, if there is one, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subsumed by the state machine in vm/mod.rs.
} | ||
} | ||
|
||
/// Objects related to Propolis's Oximeter metric production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structs and procedures we use to manage the Oximeter producer are more or less unchanged, but the logic now lives in vm/services.rs.
@@ -1024,66 +632,29 @@ async fn instance_issue_crucible_vcr_request( | |||
let request = request.into_inner(); | |||
let new_vcr_json = request.vcr_json; | |||
let disk_name = request.name; | |||
let log = rqctx.log.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process for reconfiguring a Crucible volume is basically unchanged, but the work now happens on the state driver.
HttpError::for_internal_error(format!("failed to create instance: {e}")) | ||
})?; | ||
|
||
if let Some(ramfb) = vm.framebuffer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup logic for the VNC and serial tasks is unchanged but now lives in VmServices::new
.
//! state (and as to the steps that are required to move it to a different | ||
//! state). Operations like live migration that require components to pause and | ||
//! resume coordinate directly with the state driver thread. | ||
//! ```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This state machine is one of the biggest sources of new code in this PR. The old method of tearing down a VM involved the following:
- when a new
VmController
is created, create a "VM stopped" oneshot in the API server layer and a task that listens on it - move the sender half of the oneshot into the
VmController
(and then into its state driver) so that it can be signaled when the VM is destroyed - when the task is signaled, tear down all the services (VNC, serial, Oximeter) that the server created and move its state machine to the "VM was here but is gone now" state
The underlying principles in this state machine are similar: when the guest stops, the VM moves from the "active" state to the "rundown" state and then waits for all references to the VM's objects to be dropped, at which point the VM is destroyed, the state machine moves to "rundown complete," and a new VM can be created. The main difference is that instead of spreading objects and their lifetimes across the VM module and the Dropshot server, this model concentrates everything in the VM module and makes the server layer much dumber (as it should be, IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)
Ok(controller) | ||
} | ||
|
||
pub fn properties(&self) -> &InstanceProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vast majority of the accessors on VmController
are now accessors on ActiveVm
or VmObjects
instead.
/// | ||
/// On success, clients may query the instance's migration status to | ||
/// determine how the migration has progressed. | ||
pub fn request_migration_from< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the migration-related song and dance in the old controller was needed because migration tasks were async but the state driver was not. Now that the state driver is async almost all of this can go away (especially now that migrations run directly on the driver).
} | ||
} | ||
|
||
impl Drop for VmController { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impl is load-bearing in that it's what publishes the final "Destroyed" state for a VM that has gone away. In the new code happens on the transition to the RundownComplete
state. However, the extra cleanup tasks in this function aren't needed and were removed:
- dropping a
Machine
implicitly callsdestroy
on it - block backends get detached during VM halt
/// having to supply mock implementations of the rest of the VM controller's | ||
/// functionality. | ||
#[cfg_attr(test, mockall::automock)] | ||
trait StateDriverVmController { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions in this trait moved to VmObjects
, but the trait itself is gone--it existed to provide a way to mock these functions out for state driver testing. I haven't lost a lot of sleep over this because (a) PHD gives us a lot of signal as to whether these state transitions are working properly, and (b) mocks as test doubles in Rust have never sparked joy for me (compared to, say, the C++ codebase I worked in at $OLDJOB). I'm open to suggestions as to how to replace this (maybe some kind of FakeVmObjects
is warranted, though I'm not sure exactly how it would look).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good --- most of my suggestions were on fairly minor style nits. The one thing I actually had some concerns about was the implementation of the InputQueue
in state_driver.rs
, which appears to be an attempt to bridge between synchronous producers and an asynchronous consumer --- IIUC, there is probably a much easier way to do this that avoids the footgunny tokio::task::block_in_place
. If there isn't, it would be nice to know more about why we have to do it this way.
Besides that, this looks good! Much nicer overall.
//! state (and as to the steps that are required to move it to a different | ||
//! state). Operations like live migration that require components to pause and | ||
//! resume coordinate directly with the state driver thread. | ||
//! ```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to hold up merging this a second time, but I think there is a potential race in the new implementation of InputQueue::wait_for_next_event
. I left a comment describing how we can fix it.
Besides that, everything else looks good --- thanks for addressing my remaining feedback! I'll happily approve this once the race has been addressed.
loop { | ||
{ | ||
let mut guard = self.inner.lock().unwrap(); | ||
if let Some(guest_event) = guard.guest_events.pop_front() { | ||
return InputQueueEvent::GuestEvent(guest_event); | ||
} else if let Some(req) = guard.external_requests.pop_front() { | ||
return InputQueueEvent::ExternalRequest(req); | ||
} | ||
} | ||
|
||
/// Sets the published instance and/or migration state and increases the | ||
/// state generation number. | ||
fn set_published_state(&mut self, state: PublishedState) { | ||
let (instance_state, migration_state) = match state { | ||
PublishedState::Instance(i) => (Some(i), None), | ||
PublishedState::Migration(m) => (None, Some(m)), | ||
PublishedState::Complete(i, m) => (Some(i), Some(m)), | ||
}; | ||
// Notifiers in this module must use `notify_one` so that their | ||
// notifications will not be lost if they arrive after this routine | ||
// checks the queues but before it actually polls the notify. | ||
self.notify.notified().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be using Notified::enable
here, to subscribe to a potential wakeup before checking the event queues. This way, if there's a race condition where new elements are pushed to the queue but we have not yet started to wait for a notification, we won't miss the wakeup.
I think the correct code would look something like this:
let notified = self.notify.notified();
tokio::pin!(notified);
loop {
// Pre-subscribe to a wakeup from `self.notify` so that we don't
// miss any notification that is sent while checking the event queues.
notify.as_mut().enable();
// Check whether there are any messages in either queue.
{
let mut guard = self.inner.lock().unwrap();
if let Some(guest_event) = guard.guest_events.pop_front() {
return InputQueueEvent::GuestEvent(guest_event);
} else if let Some(req) = guard.external_requests.pop_front() {
return InputQueueEvent::ExternalRequest(req);
}
}
// Await a notification if the queues were empty.
notified.as_mut().await;
// Reset the `notified` future in case we must await an additional
// notification.
notified.set(self.notify.notified()
}
let mut guard = self.inner.lock().unwrap(); | ||
if guard | ||
.guest_events | ||
.enqueue(guest_event::GuestEvent::VcpuSuspendHalt(when)) | ||
{ | ||
self.notify.notify_one(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: this gets repeated enough times that maybe we should have a
impl InputQueue {
// ...
fn enqueue_guest_event(&self, event: guest_event::GuestEvent) {
if self.inner.lock().unwrap().guest_events.enqueue(event) {
self.notify.notify_one()
}
}
// ...
}
not a big deal either way, though.
/// | ||
/// `None` if all the tasks completed successfully. `Some` if at least one | ||
/// task failed; the wrapped value is a `Vec` of all of the returned errors. | ||
pub async fn join_all(&self) -> Option<Vec<task::JoinError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is async, I think we could could conceivably replace it with a new implementation using tokio::task::JoinSet
instead of a Vec<JoinHandle<()>>
. This would be more efficient, as it wouldn't have to poll every JoinHandle
in a loop. I'm not sure if this is hot enough that it matters if we do that or not, though, and it would be fine to do it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I was wrong about that --- the race I was concerned about only exists in a multi-consumer use case, which isn't what this is. Sorry for the false alarm!
Refactor propolis-server to prepare for more work on instance specs and the live migration protocol. This change does a few things; in rough order of importance, they are:
Initialize migration targets' VM components after starting the migration task
Before this change, a request to create a VM that included a migration source would initialize all the VM's components first (using the parameters passed to the ensure API), then run the live migration protocol to import state into those components. Now the instance init code starts the migration protocol first and creates VM components at the LM task's behest. This enables a couple of things:
Clarify who is able change a VM's configuration
Before this change, calls to change a Crucible volume's configuration were not synchronized with other VM state changes (e.g. ongoing live migrations). They also took a lock that protected a running VM's instance spec, but not any of its component collections; this is fine when configuring a backend in-place, but dangerous for other (future) operations like device hotplug. To help stabilize this area:
VmObjects
).VmObjects
for shared access if they just need to read or query state.ActiveVm
andVmObjects
structs' impls and visibility markers prevent APIs from acquiring an exclusive lock directly). This allows the driver task to apply disposition rules to new requests (e.g. "no new requests to mutate if they follow a request to migrate out").No state machine in the Dropshot server layer
Before this change, the propolis-server Dropshot server context contained a
VmControllerState
type, which tracked whether the server thought there was an extant VM, and aServiceProviders
type that owned references to the "external" services offered from the VM's components (serial console access, VNC access, and Oximeter metrics). Move all this logic down into thevm
module so that the API layer is responsible just for dispatching incoming calls and translating responses into appropriate HTTP errors.General cleanup
The old
vm
module was a huge mess of types and associated functions. Break this up into several sub-modules for readability.Testing notes
Tested with
cargo test
and a full PHD run with a Debian 11 guest and Crucible and migrate-from-base tests enabled.Fixes #432.