Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: refactor instance initialization #715

Merged
merged 55 commits into from
Jul 18, 2024

Conversation

gjcolombo
Copy link
Contributor

Refactor propolis-server to prepare for more work on instance specs and the live migration protocol. This change does a few things; in rough order of importance, they are:

Initialize migration targets' VM components after starting the migration task

Before this change, a request to create a VM that included a migration source would initialize all the VM's components first (using the parameters passed to the ensure API), then run the live migration protocol to import state into those components. Now the instance init code starts the migration protocol first and creates VM components at the LM task's behest. This enables a couple of things:

  • The path to initializing devices from their migration payloads is now somewhat straighter; see Instance (re)creation from migration payload #706.
  • It's also much easier to have migration sources tell migration targets what components to create (instead of requiring the control plane to specify identical component manifests) if we want to do so.

Clarify who is able change a VM's configuration

Before this change, calls to change a Crucible volume's configuration were not synchronized with other VM state changes (e.g. ongoing live migrations). They also took a lock that protected a running VM's instance spec, but not any of its component collections; this is fine when configuring a backend in-place, but dangerous for other (future) operations like device hotplug. To help stabilize this area:

  • A server's Propolis components and its instance spec are now encapsulated in a single struct (VmObjects).
  • Dropshot API handlers can lock the active VmObjects for shared access if they just need to read or query state.
  • APIs that need to mutate state need to submit that request to the VM's state driver task (the ActiveVm and VmObjects structs' impls and visibility markers prevent APIs from acquiring an exclusive lock directly). This allows the driver task to apply disposition rules to new requests (e.g. "no new requests to mutate if they follow a request to migrate out").

No state machine in the Dropshot server layer

Before this change, the propolis-server Dropshot server context contained a VmControllerState type, which tracked whether the server thought there was an extant VM, and a ServiceProviders type that owned references to the "external" services offered from the VM's components (serial console access, VNC access, and Oximeter metrics). Move all this logic down into the vm module so that the API layer is responsible just for dispatching incoming calls and translating responses into appropriate HTTP errors.

General cleanup

The old vm module was a huge mess of types and associated functions. Break this up into several sub-modules for readability.

Testing notes

Tested with cargo test and a full PHD run with a Debian 11 guest and Crucible and migrate-from-base tests enabled.

Fixes #432.

@gjcolombo
Copy link
Contributor Author

Despite the draft status, this is more or less ready to review, and comments are very welcome. I noticed at least one more bolt I want to tighten (in external request queue cleanup) and am going to fix it before making the PR official.

@pfmooney pfmooney self-assigned this Jul 1, 2024
@gjcolombo gjcolombo marked this pull request as ready for review July 1, 2024 17:26
@gjcolombo gjcolombo requested review from hawkw and pfmooney July 1, 2024 17:27
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to wrap my head around this change --- there's a lot going on here. Here's a cursory review of the first couple files in the diff, as I continue reading the rest.

bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/source.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/source.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/source.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to spend more time going through the server bits, but here are some round-1 comments.

lib/propolis/src/tasks.rs Outdated Show resolved Hide resolved
lib/propolis/src/block/mod.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/initializer.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved
lib/propolis/src/block/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had a lingering unpublished comment

bin/propolis-server/src/lib/server.rs Show resolved Hide resolved
@gjcolombo
Copy link
Contributor Author

This is going to need some extra work to rebase on top of 99fa564; trying to sort that out now.

Getting rid of all the `block_on` calls and making everything async
turns out to have created a somewhat substantial hassle: now certain
locks have to be tokio locks (to avoid holding a lock over an await
point), which means new routines have to be async, and some of those are
in traits, so now everything is returning a `BoxFuture` and arghhh.

Maybe this would be less of a hassle if we weren't using trait objects?
Lose the `dyn VmLifecycle` in favor of a fake-based testing scheme to be
created later. Reorganize structures and state driver startup to clean
things up some more. This makes the changes feel like much less of a
hacky mess than before.
This doesn't happen automatically on ensure unless the target is
migrating in, so fix that up.
- use wrapper types to hide the tokio-ness of the underlying
  reader-writer lock
- provide direct access to some subcomponents of `Machine` for brevity
Run live migration protocols on the state driver's tokio task without using
`block_on`:

- Define `SourceProtocol` and `DestinationProtocol` traits that describe
  the routines the state driver uses to run a generic migration
  irrespective of its protocol version. (This will be useful for protocol
  versioning later.)
- Move the `migrate::source_start` and `migrate::dest_initiate` routines
  into factory functions that connect to the peer Propolis, negotiate
  protocol versions, and return an appropriate protocol impl.
- Use the protocol impls to run migration on the state driver task. Remove
  all the types and constructs used to pass messages between it and
  migration tasks.

Also, improve the interface between the `vm` and `migrate` modules for
inbound migrations by defining some objects that migrations can use either
to fully initialize a VM or to unwind correctly if migration fails. This
allows migration to take control of when precisely a VM's components get
created (and from what spec) without exposing to the migration task all the
complexity of unwinding from a failed attempt to create a VM.

Tested via full PHD run with a Debian 11 guest.
@gjcolombo gjcolombo force-pushed the gjcolombo/server-refactor-2024 branch from 994cdf3 to e0ed049 Compare July 16, 2024 19:01
@gjcolombo
Copy link
Contributor Author

This is going to need some extra work to rebase on top of 99fa564; trying to sort that out now.

This is done in e0ed049. I tested it by running pstack on a few Propolis servers and verifying that pstack reports the expected number of tokio-rt-vmm threads.

@gjcolombo gjcolombo requested a review from hawkw July 17, 2024 16:53
Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hawkw I've taken the liberty of leaving a few comments on the LHS of the bigger files that got broken up (server.rs, vm/mod.rs) to try to help point out where code has moved to (since there are a lot of things that just moved without really changing). As always, let me know if I can answer any questions!

MigratedIn,
ExplicitRequest,
}

/// A wrapper around all the data needed to describe the status of a live
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External state publishing structs now live in state_publisher.rs.

@@ -86,164 +74,11 @@ pub struct StaticConfig {
metrics: Option<MetricsEndpointConfig>,
}

/// The state of the current VM controller in this server, if there is one, or
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subsumed by the state machine in vm/mod.rs.

}
}

/// Objects related to Propolis's Oximeter metric production.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structs and procedures we use to manage the Oximeter producer are more or less unchanged, but the logic now lives in vm/services.rs.

@@ -1024,66 +632,29 @@ async fn instance_issue_crucible_vcr_request(
let request = request.into_inner();
let new_vcr_json = request.vcr_json;
let disk_name = request.name;
let log = rqctx.log.clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process for reconfiguring a Crucible volume is basically unchanged, but the work now happens on the state driver.

HttpError::for_internal_error(format!("failed to create instance: {e}"))
})?;

if let Some(ramfb) = vm.framebuffer() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup logic for the VNC and serial tasks is unchanged but now lives in VmServices::new.

//! state (and as to the steps that are required to move it to a different
//! state). Operations like live migration that require components to pause and
//! resume coordinate directly with the state driver thread.
//! ```text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state machine is one of the biggest sources of new code in this PR. The old method of tearing down a VM involved the following:

  • when a new VmController is created, create a "VM stopped" oneshot in the API server layer and a task that listens on it
  • move the sender half of the oneshot into the VmController (and then into its state driver) so that it can be signaled when the VM is destroyed
  • when the task is signaled, tear down all the services (VNC, serial, Oximeter) that the server created and move its state machine to the "VM was here but is gone now" state

The underlying principles in this state machine are similar: when the guest stops, the VM moves from the "active" state to the "rundown" state and then waits for all references to the VM's objects to be dropped, at which point the VM is destroyed, the state machine moves to "rundown complete," and a new VM can be created. The main difference is that instead of spreading objects and their lifetimes across the VM module and the Dropshot server, this model concentrates everything in the VM module and makes the server layer much dumber (as it should be, IMO).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)

Ok(controller)
}

pub fn properties(&self) -> &InstanceProperties {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vast majority of the accessors on VmController are now accessors on ActiveVm or VmObjects instead.

///
/// On success, clients may query the instance's migration status to
/// determine how the migration has progressed.
pub fn request_migration_from<
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the migration-related song and dance in the old controller was needed because migration tasks were async but the state driver was not. Now that the state driver is async almost all of this can go away (especially now that migrations run directly on the driver).

}
}

impl Drop for VmController {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impl is load-bearing in that it's what publishes the final "Destroyed" state for a VM that has gone away. In the new code happens on the transition to the RundownComplete state. However, the extra cleanup tasks in this function aren't needed and were removed:

  • dropping a Machine implicitly calls destroy on it
  • block backends get detached during VM halt

/// having to supply mock implementations of the rest of the VM controller's
/// functionality.
#[cfg_attr(test, mockall::automock)]
trait StateDriverVmController {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions in this trait moved to VmObjects, but the trait itself is gone--it existed to provide a way to mock these functions out for state driver testing. I haven't lost a lot of sleep over this because (a) PHD gives us a lot of signal as to whether these state transitions are working properly, and (b) mocks as test doubles in Rust have never sparked joy for me (compared to, say, the C++ codebase I worked in at $OLDJOB). I'm open to suggestions as to how to replace this (maybe some kind of FakeVmObjects is warranted, though I'm not sure exactly how it would look).

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good --- most of my suggestions were on fairly minor style nits. The one thing I actually had some concerns about was the implementation of the InputQueue in state_driver.rs, which appears to be an attempt to bridge between synchronous producers and an asynchronous consumer --- IIUC, there is probably a much easier way to do this that avoids the footgunny tokio::task::block_in_place. If there isn't, it would be nice to know more about why we have to do it this way.

Besides that, this looks good! Much nicer overall.

bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/migrate/destination.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/guest_event.rs Outdated Show resolved Hide resolved
//! state (and as to the steps that are required to move it to a different
//! state). Operations like live migration that require components to pause and
//! resume coordinate directly with the state driver thread.
//! ```text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)

bin/propolis-server/src/lib/vm/mod.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved
bin/propolis-server/src/lib/vm/objects.rs Show resolved Hide resolved
bin/propolis-server/src/lib/vm/state_driver.rs Outdated Show resolved Hide resolved
lib/propolis/src/tasks.rs Outdated Show resolved Hide resolved
@gjcolombo gjcolombo requested a review from hawkw July 18, 2024 16:38
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to hold up merging this a second time, but I think there is a potential race in the new implementation of InputQueue::wait_for_next_event. I left a comment describing how we can fix it.

Besides that, everything else looks good --- thanks for addressing my remaining feedback! I'll happily approve this once the race has been addressed.

Comment on lines 121 to 134
loop {
{
let mut guard = self.inner.lock().unwrap();
if let Some(guest_event) = guard.guest_events.pop_front() {
return InputQueueEvent::GuestEvent(guest_event);
} else if let Some(req) = guard.external_requests.pop_front() {
return InputQueueEvent::ExternalRequest(req);
}
}

/// Sets the published instance and/or migration state and increases the
/// state generation number.
fn set_published_state(&mut self, state: PublishedState) {
let (instance_state, migration_state) = match state {
PublishedState::Instance(i) => (Some(i), None),
PublishedState::Migration(m) => (None, Some(m)),
PublishedState::Complete(i, m) => (Some(i), Some(m)),
};
// Notifiers in this module must use `notify_one` so that their
// notifications will not be lost if they arrive after this routine
// checks the queues but before it actually polls the notify.
self.notify.notified().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be using Notified::enable here, to subscribe to a potential wakeup before checking the event queues. This way, if there's a race condition where new elements are pushed to the queue but we have not yet started to wait for a notification, we won't miss the wakeup.

I think the correct code would look something like this:

        let notified = self.notify.notified();
        tokio::pin!(notified);
        loop {
            // Pre-subscribe to a wakeup from `self.notify` so that we don't
            // miss any notification that is sent while checking the event queues.
            notify.as_mut().enable();

            // Check whether there are any messages in either queue.
            {
                let mut guard = self.inner.lock().unwrap();
                if let Some(guest_event) = guard.guest_events.pop_front() {
                    return InputQueueEvent::GuestEvent(guest_event);
                } else if let Some(req) = guard.external_requests.pop_front() {
                    return InputQueueEvent::ExternalRequest(req);
                }
            }
            
            // Await a notification if the queues were empty.
            notified.as_mut().await;
            // Reset the `notified` future in case we must await an additional
            // notification.
            notified.set(self.notify.notified()
        }

Comment on lines +165 to +171
let mut guard = self.inner.lock().unwrap();
if guard
.guest_events
.enqueue(guest_event::GuestEvent::VcpuSuspendHalt(when))
{
self.notify.notify_one();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: this gets repeated enough times that maybe we should have a

impl InputQueue {
    // ...
    fn enqueue_guest_event(&self, event: guest_event::GuestEvent) {
        if self.inner.lock().unwrap().guest_events.enqueue(event) {
             self.notify.notify_one()
        }
    }
    // ...
}

not a big deal either way, though.

///
/// `None` if all the tasks completed successfully. `Some` if at least one
/// task failed; the wrapped value is a `Vec` of all of the returned errors.
pub async fn join_all(&self) -> Option<Vec<task::JoinError>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is async, I think we could could conceivably replace it with a new implementation using tokio::task::JoinSet instead of a Vec<JoinHandle<()>>. This would be more efficient, as it wouldn't have to poll every JoinHandle in a loop. I'm not sure if this is hot enough that it matters if we do that or not, though, and it would be fine to do it in a follow-up.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I was wrong about that --- the race I was concerned about only exists in a multi-consumer use case, which isn't what this is. Sorry for the false alarm!

@gjcolombo gjcolombo merged commit 4708ab2 into master Jul 18, 2024
10 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/server-refactor-2024 branch July 18, 2024 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants