Plan zone updates for target release #8024

plotnick · 2025-04-22T20:50:36Z

Plumb the TUF repo representing the current target_release through the policy and planner. Update one zone at a time to the control-plane artifacts in that repo. Non-Nexus zones are updated first, then Nexus.

TODO:

Correctly stage updates (including new zones) according to RFD 565 §9. ~~The current test asserts the opposite of what should happen with, e.g., a new Nexus zone.~~
~~Plumb the target release generation number through the planner so that we can verify our idea of current and previous TUF repos.~~ We'll restrict changes to target_release during updates intead: Restrict changes to target_release during an update #8056.

jgallagher · 2025-05-08T14:22:04Z

nexus/db-queries/src/db/datastore/update.rs

@@ -106,8 +107,41 @@ impl DataStore {
            })
    }

+    /// Returns a TUF repo description.
+    pub async fn update_tuf_repo_get_by_id(


I realize you're following the naming conventions already present in this file, but I wonder if others have the same reaction I did: when I see update_tuf_repo_* my first thought is "I'm updating a tuf repo in some way", not "the noun I'm acting on is an update tuf repo".

Dave had similar thoughts on #7518, and I agreed but didn't finish the job. 55b8e0e renames all the TUF datastore methods.

jgallagher · 2025-05-08T14:29:11Z

nexus/db-queries/src/db/datastore/update.rs

+        let conn = self.pool_connection_authorized(opctx).await?;
+        let repo_id = repo_id.into_untyped_uuid();
+        let repo = dsl::tuf_repo
+            .filter(dsl::id.eq(repo_id))


Tiny nit, feel free to ignore: this would be marginally safer if we removed the into_untyped_uuid() a few lines up and instead used

Suggested change

.filter(dsl::id.eq(repo_id))

.filter(dsl::id.eq(nexus_db_model::to_db_typed_uuid(repo_id)))

because to_db_typed_uuid checks that the type matches. (Although we'd still have to have a repo_id.into_untyped_uuid() in the error path below, so not a huge win.)

Thank you, fixed in a7c1e17.

jgallagher · 2025-05-08T14:34:45Z

nexus/reconfigurator/planning/src/planner/update_sequence.rs

+pub enum OrderedComponent {
+    HostOs,
+    SpRot,
+    ControlPlaneZone,


Naming nit - maybe - NonNexusControlPlaneZone or NonNexusOmicronZone? Nexus is a control plane zone, so it seems important to clarify the name somehow.

Thank you, renamed in 57f7154.

jgallagher · 2025-05-08T14:49:36Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

+        let new_artifact = Self::zone_image_artifact(new_repo, zone_kind);
+        let old_artifact = Self::zone_image_artifact(old_repo, zone_kind);
+        if let Some(prev) = OrderedComponent::from(zone_kind).prev() {
+            if prev >= OrderedComponent::ControlPlaneZone


This conditional can only be true if zone_kind is Nexus, right? (For any other zone kind, the previous component would be SpRot?)

I wonder if this would be clearer broken out more explicitly. Untested, but something like:

match OrderedComponent::from(zone_kind).prev() { // If our previous component isn't a control plane zone at all, it's // safe to always use the new artifact. Some(OrderedComponent::HostOs | OrderedComponent::SpRot) | None => { new_artifact } // If our previous component is "any non-Nexus control plane zone", // it's only safe to use the new artifact if _all_ of the other // non-Nexus control plane zones are themselves using new artifacts. Some(OrderedComponent::ControlPlaneZone) => { if /* any zone is using an old artifact */ { old_artifact } else { new_artifact } } // We called `.prev()`; we can't get back the maximal component. Some(OrderedComponent::NexusZone) => { unreachable!("NexusZone is the last component") } }

Reading back over this, I wonder if it would be even clearer if we matched on the current kind instead of prev()? Still untested but something like

match OrderedComponent::from(zone_kind) { // Nexus can only be updated if all non-Nexus zones have been updated OrderedComponent::NexusZone => { // all the complicated checks } // It's always safe to use the newer artifacts for newer non-Nexus zones OrderedComponent::ControlPlaneZone => { new_artifact } OrderedComponent::HostOs | OrderedComponent::SpRot => { unreachable!("zone_kind isn't an OS or SP/RoT") } }

I think I was trying too hard to make this code generic and easy to modify for hypothetical futures in which we have more than just Nexus/non-Nexus components. But it's much simpler as you suggest, and really no less future-proof, so 77b2156 implements this idea (and your next two) and drops a bunch of needless complexity. Thanks!

jgallagher · 2025-05-08T14:54:31Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

+                        let old_artifact =
+                            Self::zone_image_artifact(old_repo, kind);
+                        OrderedComponent::from(kind) == prev
+                            && z.image_source == old_artifact


Will z.image_source == old_artifact prevent us from ever upgrading a Nexus zone if some non-Nexus zone has the same hash in both TUF repos? I think we settled on "that should never happen, even for releases candidate respins". But I'm wondering if this check would be more correct as z.image_source != new_artifact anyway (assuming we look up the new_artifact for this zone inside this any closure); i.e., what we really care about is "are all the zones running their new artifacts"?

I was also going to ask "what if z.image_source is the install dataset" (because then this check would always be false, which I think would let us upgrade Nexus when we shouldn't?). Maybe that's guarded by other planner bits (haven't gotten to that yet)? But if we changed this to z.image_source != new_artifact that seems safer in the install dataset case too, since we'd refuse to upgrade Nexus until the other zones were back on a known artifact from the current release.

Indeed, also incorporated into 77b2156. Thank you once again for helping to dramatically simplify and improve this core logic.

An idle thought I had over the weekend - there's upcoming work to add checks to cockroach for "is it okay to update this based on the cluster status" (underreplicated ranges, etc.). Instead of viewing Nexus as a separate OrderedComponent variant, we could phrase it as:

we have a pile of control plane zones

some zones have "can this be updated yet" checks

Then the operation becomes: filter out the zones whose "can it be updated yet" check is false, and apply some stable (but presumably unimportant-to-correctness) ordering to the zones that are left when choosing which to update. Then Nexus's "can it be updated yet" check is "are all the non-Nexus zones running off the new version", and we have a ready-made spot to insert the Cockroach "can it be updated yet check" once that work is ready.

(I realize that would be a nontrivial change to the way this is implemented! Mostly wanted to float the idea out there and see what you think.)

jgallagher · 2025-05-08T17:42:14Z

nexus/reconfigurator/planning/src/planner.rs

+            match zone_kind {
+                ZoneKind::Crucible
+                | ZoneKind::CockroachDb
+                | ZoneKind::Clickhouse => {


Should ClickhouseKeeper and/or ClickhouseServer be in this list too? (Those and DNS are the other kinds that have durable datasets, and I know DNS's data is small enough we don't need to update it in place. But I assume at least one of those multinode clickhouse variants will have a lot of data we don't want to expunge.)

Sounds right to me, fixed in 050933e.

jgallagher · 2025-05-08T17:43:15Z

nexus/reconfigurator/planning/src/planner.rs

+        let zone_kind = zone.zone_type.kind();
+        let image_source = self.blueprint.zone_image_source(zone_kind);
+        if zone.image_source == image_source {
+            // This should only happen in the event of a planning error above.


Can we return an Err in this case? If the planner has internal inconsistencies, we should emit something that makes us investigate.

Indeed! 7f1f522 logs and returns an error.

jgallagher · 2025-05-08T17:53:01Z

nexus/reconfigurator/planning/src/planner.rs

+
+        // We should start with no specified TUF repo and nothing to do.
+        assert!(example.input.tuf_repo().is_none());
+        // assert_planning_makes_no_changes(


Nit - can we uncomment this check? (Seems like it should be valid if there's no TUF repo yet, right?)

We can indeed, that was left over from a debugging session where I didn't want the unchanged diff output. Uncommented in 2daabf5.

jgallagher · 2025-05-08T17:55:14Z

nexus/reconfigurator/planning/src/planner.rs

+        let artifacts = vec![
+            TufArtifactMeta {
+                id: ArtifactId {
+                    name: String::from("crucible-pantry-zone"),


Nit - maybe

Suggested change

name: String::from("crucible-pantry-zone"),

name: ZoneKind::CruciblePantry.artifact_name().to_string(),

instead (and similar for other names below)? Slightly longer, but avoids any concerns about typos or what the string should be.

Thank you, fixed in 88733d2 for this test and the next by refactoring its little macro, now called fake_zone_artifact.

jgallagher · 2025-05-08T17:58:11Z

nexus/reconfigurator/planning/src/planner.rs

+    }
+
+    #[test]
+    fn test_update_all_zones() {


This is a nice test 👍

Thanks, I like it too! It's how I knew I was (more or less) done.

I did have one question: do you think we should test the exact number of iterations required to converge, or leave it as-is with a maximum? The former seems more in-line with, e.g., expectorate tests (and indeed, we could use expectorate for this and match the exact update sequence); but the latter is more resilient to changes in planner behavior that aren't exactly the focus of the test.

Good question, and honestly I don't have a strong feeling either way. I think if we made it an exact count, the way it fails would be important: if the error is just "didn't converge by iteration N" that would kinda suck. What if we did something like:

Keep the loop as-is

When it converges, break out

After the loop, add an assert for the number of iterations it took to converge, with a comment noting that (a) it's okay to update this if incidental planner work changed the number of iterations or (b) it's okay to remove this if it's needing to change so much the assertion isn't valuable

That way the failure mode would be something like "expected 23 iterations, but converged in 26" instead of just "didn't converge after 23 iterations"?

All that said, I'd also be fine with just leaving it as-is. Certainly "number of iterations needed to converge has changed" is way less important than "converges within some reasonable number of iterations".

jgallagher · 2025-05-12T16:11:39Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

+                        OrderedComponent::from(z.zone_type.kind())
+                            == OrderedComponent::NonNexusOmicronZone
+                    })
+                    .any(|z| z.image_source != new_artifact)


I don't think this is the correct new_artifact, right? This new_artifact is the Nexus zone we're trying to add, but we want to check z.zone_type's artifact from new_repo?

jgallagher · 2025-05-12T16:12:35Z

nexus/reconfigurator/planning/src/planner.rs

@@ -110,14 +113,11 @@ impl<'a> Planner<'a> {
    }

    fn do_plan(&mut self) -> Result<(), Error> {
-        // We perform planning in two loops: the first one turns expunged sleds
-        // into expunged zones, and the second one adds services.


Hah, thanks for trimming this; it was already badly out of date 🤦

plotnick force-pushed the plan-target-release branch 4 times, most recently from 8019b67 to b254f32 Compare April 25, 2025 15:23

plotnick mentioned this pull request Apr 28, 2025

Restrict changes to target_release during an update #8056

Open

plotnick marked this pull request as ready for review April 28, 2025 19:00

plotnick added 5 commits April 29, 2025 11:59

Fix docstring for BlueprintZoneImageSource

17d9ae4

Plumb target release TUF repo through planner

a87f755

Plan zone updates from TUF repo

851128a

Plan according to RFD 565 §9

85e94bb

Don't update an already up-to-date zone

194a8a3

plotnick force-pushed the plan-target-release branch from c6ece7d to d6a6734 Compare May 2, 2025 19:24

Don't trust inventory zones' image_source

50c1169

plotnick force-pushed the plan-target-release branch from d6a6734 to 50c1169 Compare May 2, 2025 19:25

Fix failing tests

ad7103e

plotnick force-pushed the plan-target-release branch from 276ae84 to ad7103e Compare May 5, 2025 17:52

jgallagher self-requested a review May 6, 2025 21:09

jgallagher reviewed May 8, 2025

View reviewed changes

plotnick added 11 commits May 10, 2025 11:00

Rename datastore methods: update_tuf_* → tuf_*

55b8e0e

Fix typed UUID

a7c1e17

Rename ControlPlaneZone → NonNexusOmicronZone

57f7154

Simplify OrderedComponent logic

77b2156

Simplify out-of-date zone collection

e31fc60

Make planning error an error

7f1f522

Explicitly list which zone kinds get in-place updates

050933e

Uncomment test assertion

2daabf5

Type-safe fake-zone names

88733d2

Merge branch 'main' into plan-target-release

2e0ee41

Fix doc bug from renamed method

3ae171e

jgallagher reviewed May 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan zone updates for target release #8024

Plan zone updates for target release #8024

plotnick commented Apr 22, 2025 •

edited

Loading

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

jgallagher May 9, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 12, 2025

jgallagher May 12, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 8, 2025

plotnick May 10, 2025

jgallagher May 12, 2025

jgallagher May 12, 2025

jgallagher May 12, 2025

	.filter(dsl::id.eq(repo_id))
	.filter(dsl::id.eq(nexus_db_model::to_db_typed_uuid(repo_id)))

	name: String::from("crucible-pantry-zone"),
	name: ZoneKind::CruciblePantry.artifact_name().to_string(),

Plan zone updates for target release #8024

Are you sure you want to change the base?

Plan zone updates for target release #8024

Conversation

plotnick commented Apr 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

plotnick commented Apr 22, 2025 •

edited

Loading