server: refactor instance initialization #715

gjcolombo · 2024-07-01T16:37:54Z

Refactor propolis-server to prepare for more work on instance specs and the live migration protocol. This change does a few things; in rough order of importance, they are:

Initialize migration targets' VM components after starting the migration task

Before this change, a request to create a VM that included a migration source would initialize all the VM's components first (using the parameters passed to the ensure API), then run the live migration protocol to import state into those components. Now the instance init code starts the migration protocol first and creates VM components at the LM task's behest. This enables a couple of things:

The path to initializing devices from their migration payloads is now somewhat straighter; see Instance (re)creation from migration payload #706.
It's also much easier to have migration sources tell migration targets what components to create (instead of requiring the control plane to specify identical component manifests) if we want to do so.

Clarify who is able change a VM's configuration

Before this change, calls to change a Crucible volume's configuration were not synchronized with other VM state changes (e.g. ongoing live migrations). They also took a lock that protected a running VM's instance spec, but not any of its component collections; this is fine when configuring a backend in-place, but dangerous for other (future) operations like device hotplug. To help stabilize this area:

A server's Propolis components and its instance spec are now encapsulated in a single struct (VmObjects).
Dropshot API handlers can lock the active VmObjects for shared access if they just need to read or query state.
APIs that need to mutate state need to submit that request to the VM's state driver task (the ActiveVm and VmObjects structs' impls and visibility markers prevent APIs from acquiring an exclusive lock directly). This allows the driver task to apply disposition rules to new requests (e.g. "no new requests to mutate if they follow a request to migrate out").

No state machine in the Dropshot server layer

Before this change, the propolis-server Dropshot server context contained a VmControllerState type, which tracked whether the server thought there was an extant VM, and a ServiceProviders type that owned references to the "external" services offered from the VM's components (serial console access, VNC access, and Oximeter metrics). Move all this logic down into the vm module so that the API layer is responsible just for dispatching incoming calls and translating responses into appropriate HTTP errors.

General cleanup

The old vm module was a huge mess of types and associated functions. Break this up into several sub-modules for readability.

Testing notes

Tested with cargo test and a full PHD run with a Debian 11 guest and Crucible and migrate-from-base tests enabled.

Fixes #432.

gjcolombo · 2024-07-01T16:39:45Z

Despite the draft status, this is more or less ready to review, and comments are very welcome. I noticed at least one more bolt I want to tighten (in external request queue cleanup) and am going to fix it before making the PR official.

hawkw

I'm still trying to wrap my head around this change --- there's a lot going on here. Here's a cursory review of the first couple files in the diff, as I continue reading the rest.

bin/propolis-server/src/lib/migrate/destination.rs

bin/propolis-server/src/lib/migrate/source.rs

pfmooney

I need to spend more time going through the server bits, but here are some round-1 comments.

lib/propolis/src/tasks.rs

lib/propolis/src/block/mod.rs

bin/propolis-server/src/lib/initializer.rs

bin/propolis-server/src/lib/vm/objects.rs

lib/propolis/src/block/mod.rs

pfmooney

Sorry, I had a lingering unpublished comment

bin/propolis-server/src/lib/server.rs

gjcolombo · 2024-07-16T17:44:22Z

This is going to need some extra work to rebase on top of 99fa564; trying to sort that out now.

Getting rid of all the `block_on` calls and making everything async turns out to have created a somewhat substantial hassle: now certain locks have to be tokio locks (to avoid holding a lock over an await point), which means new routines have to be async, and some of those are in traits, so now everything is returning a `BoxFuture` and arghhh. Maybe this would be less of a hassle if we weren't using trait objects?

Lose the `dyn VmLifecycle` in favor of a fake-based testing scheme to be created later. Reorganize structures and state driver startup to clean things up some more. This makes the changes feel like much less of a hacky mess than before.

This doesn't happen automatically on ensure unless the target is migrating in, so fix that up.

…tance stop

- use wrapper types to hide the tokio-ness of the underlying reader-writer lock - provide direct access to some subcomponents of `Machine` for brevity

Run live migration protocols on the state driver's tokio task without using `block_on`: - Define `SourceProtocol` and `DestinationProtocol` traits that describe the routines the state driver uses to run a generic migration irrespective of its protocol version. (This will be useful for protocol versioning later.) - Move the `migrate::source_start` and `migrate::dest_initiate` routines into factory functions that connect to the peer Propolis, negotiate protocol versions, and return an appropriate protocol impl. - Use the protocol impls to run migration on the state driver task. Remove all the types and constructs used to pass messages between it and migration tasks. Also, improve the interface between the `vm` and `migrate` modules for inbound migrations by defining some objects that migrations can use either to fully initialize a VM or to unwind correctly if migration fails. This allows migration to take control of when precisely a VM's components get created (and from what spec) without exposing to the migration task all the complexity of unwinding from a failed attempt to create a VM. Tested via full PHD run with a Debian 11 guest.

gjcolombo · 2024-07-16T19:13:04Z

This is going to need some extra work to rebase on top of 99fa564; trying to sort that out now.

This is done in e0ed049. I tested it by running pstack on a few Propolis servers and verifying that pstack reports the expected number of tokio-rt-vmm threads.

gjcolombo

@hawkw I've taken the liberty of leaving a few comments on the LHS of the bigger files that got broken up (server.rs, vm/mod.rs) to try to help point out where code has moved to (since there are a lot of things that just moved without really changing). As always, let me know if I can answer any questions!

gjcolombo · 2024-07-17T18:11:12Z

bin/propolis-server/src/lib/vm/state_driver.rs

    MigratedIn,
    ExplicitRequest,
 }

-/// A wrapper around all the data needed to describe the status of a live


External state publishing structs now live in state_publisher.rs.

gjcolombo · 2024-07-17T18:13:00Z

bin/propolis-server/src/lib/server.rs

@@ -86,164 +74,11 @@ pub struct StaticConfig {
    metrics: Option<MetricsEndpointConfig>,
 }

-/// The state of the current VM controller in this server, if there is one, or


This is subsumed by the state machine in vm/mod.rs.

gjcolombo · 2024-07-17T18:13:23Z

bin/propolis-server/src/lib/server.rs

-    }
-}
-
-/// Objects related to Propolis's Oximeter metric production.


The structs and procedures we use to manage the Oximeter producer are more or less unchanged, but the logic now lives in vm/services.rs.

gjcolombo · 2024-07-17T18:14:57Z

bin/propolis-server/src/lib/server.rs

@@ -1024,66 +632,29 @@ async fn instance_issue_crucible_vcr_request(
    let request = request.into_inner();
    let new_vcr_json = request.vcr_json;
    let disk_name = request.name;
-    let log = rqctx.log.clone();


The process for reconfiguring a Crucible volume is basically unchanged, but the work now happens on the state driver.

gjcolombo · 2024-07-17T18:17:35Z

bin/propolis-server/src/lib/server.rs

-        HttpError::for_internal_error(format!("failed to create instance: {e}"))
-    })?;
-
-    if let Some(ramfb) = vm.framebuffer() {


The setup logic for the VNC and serial tasks is unchanged but now lives in VmServices::new.

gjcolombo · 2024-07-17T18:36:34Z

bin/propolis-server/src/lib/vm/mod.rs

-//! state (and as to the steps that are required to move it to a different
-//! state). Operations like live migration that require components to pause and
-//! resume coordinate directly with the state driver thread.
+//! ```text


This state machine is one of the biggest sources of new code in this PR. The old method of tearing down a VM involved the following:

when a new VmController is created, create a "VM stopped" oneshot in the API server layer and a task that listens on it

move the sender half of the oneshot into the VmController (and then into its state driver) so that it can be signaled when the VM is destroyed

when the task is signaled, tear down all the services (VNC, serial, Oximeter) that the server created and move its state machine to the "VM was here but is gone now" state

The underlying principles in this state machine are similar: when the guest stops, the VM moves from the "active" state to the "rundown" state and then waits for all references to the VM's objects to be dropped, at which point the VM is destroyed, the state machine moves to "rundown complete," and a new VM can be created. The main difference is that instead of spreading objects and their lifetimes across the VM module and the Dropshot server, this model concentrates everything in the VM module and makes the server layer much dumber (as it should be, IMO).

Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)

gjcolombo · 2024-07-17T18:37:20Z

bin/propolis-server/src/lib/vm/mod.rs

-        Ok(controller)
-    }
-
-    pub fn properties(&self) -> &InstanceProperties {


The vast majority of the accessors on VmController are now accessors on ActiveVm or VmObjects instead.

gjcolombo · 2024-07-17T18:39:20Z

bin/propolis-server/src/lib/vm/mod.rs

    ///
-    /// On success, clients may query the instance's migration status to
-    /// determine how the migration has progressed.
-    pub fn request_migration_from<


A lot of the migration-related song and dance in the old controller was needed because migration tasks were async but the state driver was not. Now that the state driver is async almost all of this can go away (especially now that migrations run directly on the driver).

gjcolombo · 2024-07-17T18:56:49Z

bin/propolis-server/src/lib/vm/mod.rs

-    }
-}
-
-impl Drop for VmController {


This impl is load-bearing in that it's what publishes the final "Destroyed" state for a VM that has gone away. In the new code happens on the transition to the RundownComplete state. However, the extra cleanup tasks in this function aren't needed and were removed:

dropping a Machine implicitly calls destroy on it

block backends get detached during VM halt

gjcolombo · 2024-07-17T19:28:19Z

bin/propolis-server/src/lib/vm/mod.rs

-/// having to supply mock implementations of the rest of the VM controller's
-/// functionality.
-#[cfg_attr(test, mockall::automock)]
-trait StateDriverVmController {


The functions in this trait moved to VmObjects, but the trait itself is gone--it existed to provide a way to mock these functions out for state driver testing. I haven't lost a lot of sleep over this because (a) PHD gives us a lot of signal as to whether these state transitions are working properly, and (b) mocks as test doubles in Rust have never sparked joy for me (compared to, say, the C++ codebase I worked in at $OLDJOB). I'm open to suggestions as to how to replace this (maybe some kind of FakeVmObjects is warranted, though I'm not sure exactly how it would look).

hawkw

Overall, this looks good --- most of my suggestions were on fairly minor style nits. The one thing I actually had some concerns about was the implementation of the InputQueue in state_driver.rs, which appears to be an attempt to bridge between synchronous producers and an asynchronous consumer --- IIUC, there is probably a much easier way to do this that avoids the footgunny tokio::task::block_in_place. If there isn't, it would be nice to know more about why we have to do it this way.

Besides that, this looks good! Much nicer overall.

bin/propolis-server/src/lib/migrate/destination.rs

bin/propolis-server/src/lib/vm/guest_event.rs

hawkw · 2024-07-17T21:29:41Z

bin/propolis-server/src/lib/vm/mod.rs

-//! state (and as to the steps that are required to move it to a different
-//! state). Operations like live migration that require components to pause and
-//! resume coordinate directly with the state driver thread.
+//! ```text


Lovely, thanks for writing up this comment! And, I think the changes you've described here is the Right Thing :)

bin/propolis-server/src/lib/vm/mod.rs

bin/propolis-server/src/lib/vm/objects.rs

bin/propolis-server/src/lib/vm/state_driver.rs

lib/propolis/src/tasks.rs

hawkw

Sorry to hold up merging this a second time, but I think there is a potential race in the new implementation of InputQueue::wait_for_next_event. I left a comment describing how we can fix it.

Besides that, everything else looks good --- thanks for addressing my remaining feedback! I'll happily approve this once the race has been addressed.

hawkw · 2024-07-18T22:41:38Z

bin/propolis-server/src/lib/vm/state_driver.rs

+        loop {
+            {
+                let mut guard = self.inner.lock().unwrap();
+                if let Some(guest_event) = guard.guest_events.pop_front() {
+                    return InputQueueEvent::GuestEvent(guest_event);
+                } else if let Some(req) = guard.external_requests.pop_front() {
+                    return InputQueueEvent::ExternalRequest(req);
+                }
+            }

-    /// Sets the published instance and/or migration state and increases the
-    /// state generation number.
-    fn set_published_state(&mut self, state: PublishedState) {
-        let (instance_state, migration_state) = match state {
-            PublishedState::Instance(i) => (Some(i), None),
-            PublishedState::Migration(m) => (None, Some(m)),
-            PublishedState::Complete(i, m) => (Some(i), Some(m)),
-        };
+            // Notifiers in this module must use `notify_one` so that their
+            // notifications will not be lost if they arrive after this routine
+            // checks the queues but before it actually polls the notify.
+            self.notify.notified().await;


I think we should be using Notified::enable here, to subscribe to a potential wakeup before checking the event queues. This way, if there's a race condition where new elements are pushed to the queue but we have not yet started to wait for a notification, we won't miss the wakeup.

I think the correct code would look something like this:

let notified = self.notify.notified(); tokio::pin!(notified); loop { // Pre-subscribe to a wakeup from `self.notify` so that we don't // miss any notification that is sent while checking the event queues. notify.as_mut().enable(); // Check whether there are any messages in either queue. { let mut guard = self.inner.lock().unwrap(); if let Some(guest_event) = guard.guest_events.pop_front() { return InputQueueEvent::GuestEvent(guest_event); } else if let Some(req) = guard.external_requests.pop_front() { return InputQueueEvent::ExternalRequest(req); } } // Await a notification if the queues were empty. notified.as_mut().await; // Reset the `notified` future in case we must await an additional // notification. notified.set(self.notify.notified() }

hawkw · 2024-07-18T22:45:23Z

bin/propolis-server/src/lib/vm/state_driver.rs

+        let mut guard = self.inner.lock().unwrap();
+        if guard
+            .guest_events
+            .enqueue(guest_event::GuestEvent::VcpuSuspendHalt(when))
+        {
+            self.notify.notify_one();
+        }


nit, take it or leave it: this gets repeated enough times that maybe we should have a

impl InputQueue { // ... fn enqueue_guest_event(&self, event: guest_event::GuestEvent) { if self.inner.lock().unwrap().guest_events.enqueue(event) { self.notify.notify_one() } } // ... }

not a big deal either way, though.

hawkw · 2024-07-18T22:54:08Z

lib/propolis/src/tasks.rs

+    ///
+    /// `None` if all the tasks completed successfully. `Some` if at least one
+    /// task failed; the wrapped value is a `Vec` of all of the returned errors.
+    pub async fn join_all(&self) -> Option<Vec<task::JoinError>> {


Now that this is async, I think we could could conceivably replace it with a new implementation using tokio::task::JoinSet instead of a Vec<JoinHandle<()>>. This would be more efficient, as it wouldn't have to poll every JoinHandle in a loop. I'm not sure if this is hot enough that it matters if we do that or not, though, and it would be fine to do it in a follow-up.

hawkw

Nevermind, I was wrong about that --- the race I was concerned about only exists in a multi-consumer use case, which isn't what this is. Sorry for the false alarm!

pfmooney self-assigned this Jul 1, 2024

gjcolombo marked this pull request as ready for review July 1, 2024 17:26

gjcolombo requested review from hawkw and pfmooney July 1, 2024 17:27

hawkw reviewed Jul 1, 2024

View reviewed changes

pfmooney reviewed Jul 2, 2024

View reviewed changes

gjcolombo mentioned this pull request Jul 2, 2024

Device and block backend lifecycle ops could return immediately if no async work is required #716

Open

pfmooney reviewed Jul 2, 2024

View reviewed changes

bin/propolis-server/src/lib/vm/objects.rs Outdated Show resolved Hide resolved

lib/propolis/src/block/mod.rs Outdated Show resolved Hide resolved

gjcolombo mentioned this pull request Jul 3, 2024

Run live migration directly on the state driver task #720

Merged

pfmooney reviewed Jul 12, 2024

View reviewed changes

bin/propolis-server/src/lib/server.rs Show resolved Hide resolved

gjcolombo added 7 commits July 16, 2024 17:27

[WIP] refactor to make VM creation async

a69c52b

[WIP] get rid of some request queue stuff we don't need

21869eb

[WIP] checkpoint: move some event handling logic

301ad33

[WIP] more vm lifecycle routines

df4658d

[WIP] do_reboot

356f84f

[WIP] migration's not there but it builds at least

a7af98f

[WIP] rip out block_on

52d6eb8

gjcolombo added 11 commits July 16, 2024 17:51

[WIP] refactor state driver to resolve disaster

0c98b80

Lose the `dyn VmLifecycle` in favor of a fake-based testing scheme to be created later. Reorganize structures and state driver startup to clean things up some more. This makes the changes feel like much less of a hacky mess than before.

[WIP] figure out VM services

c23065c

[WIP] partially replace server's VmController

a2dbd78

[WIP] improve ActiveVm cleanup (maybe)

a5f1ec2

[WIP] fix VM start

3156df6

This doesn't happen automatically on ensure unless the target is migrating in, so fix that up.

[WIP] resume fixing server build errors

e54ec7c

[WIP] shunt Crucible mutation to worker thread

eb33b42

[WIP] clean up warnings

1a3430d

[WIP] oops I blocked the executor

8cf4b9a

[WIP] improve log inheritance

c2fc6d2

gjcolombo added 8 commits July 16, 2024 17:55

notify Crucible reconfiguration waiters when they're preempted by ins…

d933bc9

…tance stop

TaskGroup mutex can still be std

82376e5

use async_trait to reduce some paperwork

2e70f58

LifecycleMap -> DeviceMap

f9efa92

clean up interface to VmObjects

3b4870a

- use wrapper types to hide the tokio-ness of the underlying reader-writer lock - provide direct access to some subcomponents of `Machine` for brevity

clean up unused async

4e80e26

reimplement tokio task limits

e0ed049

gjcolombo force-pushed the gjcolombo/server-refactor-2024 branch from 994cdf3 to e0ed049 Compare July 16, 2024 19:01

lifecycle_components -> devices again

8606927

gjcolombo added 5 commits July 16, 2024 19:41

don't undercut the state driver when completing rundown

7b87f5b

use shutdown_background instead of moving to an OS thread

45e76d3

rm unused file mistakenly added back in merge

75d6a95

clean up a few fully-qualified paths

09d942c

fmt

96a1263

gjcolombo requested a review from hawkw July 17, 2024 16:53

gjcolombo added 2 commits July 17, 2024 17:53

rm one more unused file

fc8a45e

remove extra fn call now that block_on isn't needed

749ee17

gjcolombo commented Jul 17, 2024

View reviewed changes

hawkw reviewed Jul 17, 2024

View reviewed changes

gjcolombo added 3 commits July 17, 2024 23:13

PR feedback: style fixes

c9b1bf6

remove block_in_place from input queue

f168f30

rename async block_until_joined to join_all

c4c80f2

gjcolombo requested a review from hawkw July 18, 2024 16:38

hawkw requested changes Jul 18, 2024

View reviewed changes

hawkw approved these changes Jul 18, 2024

View reviewed changes

explain more clearly why it's safe not to use enable()

e110a65

gjcolombo merged commit 4708ab2 into master Jul 18, 2024
10 checks passed

gjcolombo deleted the gjcolombo/server-refactor-2024 branch July 18, 2024 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: refactor instance initialization #715

server: refactor instance initialization #715

gjcolombo commented Jul 1, 2024

gjcolombo commented Jul 1, 2024

hawkw left a comment

pfmooney left a comment

pfmooney left a comment

gjcolombo commented Jul 16, 2024

gjcolombo commented Jul 16, 2024

gjcolombo left a comment

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

hawkw Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

gjcolombo Jul 17, 2024

hawkw left a comment

hawkw Jul 17, 2024

hawkw left a comment

hawkw Jul 18, 2024

hawkw Jul 18, 2024

hawkw Jul 18, 2024

hawkw left a comment

server: refactor instance initialization #715

server: refactor instance initialization #715

Conversation

gjcolombo commented Jul 1, 2024

Initialize migration targets' VM components after starting the migration task

Clarify who is able change a VM's configuration

No state machine in the Dropshot server layer

General cleanup

Testing notes

gjcolombo commented Jul 1, 2024

hawkw left a comment

Choose a reason for hiding this comment

pfmooney left a comment

Choose a reason for hiding this comment

pfmooney left a comment

Choose a reason for hiding this comment

gjcolombo commented Jul 16, 2024

gjcolombo commented Jul 16, 2024

gjcolombo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment