Skip to content

Plan zone updates for target release #8024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Conversation

plotnick
Copy link
Contributor

@plotnick plotnick commented Apr 22, 2025

Plumb the TUF repo representing the current target_release through the policy and planner. Update one zone at a time to the control-plane artifacts in that repo. Non-Nexus zones are updated first, then Nexus.

TODO:

  • Correctly stage updates (including new zones) according to RFD 565 §9. The current test asserts the opposite of what should happen with, e.g., a new Nexus zone.
  • Plumb the target release generation number through the planner so that we can verify our idea of current and previous TUF repos. We'll restrict changes to target_release during updates intead: Restrict changes to target_release during an update #8056.

@plotnick plotnick force-pushed the plan-target-release branch 4 times, most recently from 8019b67 to b254f32 Compare April 25, 2025 15:23
@plotnick plotnick marked this pull request as ready for review April 28, 2025 19:00
@plotnick plotnick force-pushed the plan-target-release branch from c6ece7d to d6a6734 Compare May 2, 2025 19:24
@plotnick plotnick force-pushed the plan-target-release branch from d6a6734 to 50c1169 Compare May 2, 2025 19:25
@plotnick plotnick force-pushed the plan-target-release branch from 276ae84 to ad7103e Compare May 5, 2025 17:52
@jgallagher jgallagher self-requested a review May 6, 2025 21:09
@@ -106,8 +107,41 @@ impl DataStore {
})
}

/// Returns a TUF repo description.
pub async fn update_tuf_repo_get_by_id(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you're following the naming conventions already present in this file, but I wonder if others have the same reaction I did: when I see update_tuf_repo_* my first thought is "I'm updating a tuf repo in some way", not "the noun I'm acting on is an update tuf repo".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dave had similar thoughts on #7518, and I agreed but didn't finish the job. 55b8e0e renames all the TUF datastore methods.

let conn = self.pool_connection_authorized(opctx).await?;
let repo_id = repo_id.into_untyped_uuid();
let repo = dsl::tuf_repo
.filter(dsl::id.eq(repo_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit, feel free to ignore: this would be marginally safer if we removed the into_untyped_uuid() a few lines up and instead used

Suggested change
.filter(dsl::id.eq(repo_id))
.filter(dsl::id.eq(nexus_db_model::to_db_typed_uuid(repo_id)))

because to_db_typed_uuid checks that the type matches. (Although we'd still have to have a repo_id.into_untyped_uuid() in the error path below, so not a huge win.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in a7c1e17.

pub enum OrderedComponent {
HostOs,
SpRot,
ControlPlaneZone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit - maybe - NonNexusControlPlaneZone or NonNexusOmicronZone? Nexus is a control plane zone, so it seems important to clarify the name somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, renamed in 57f7154.

let new_artifact = Self::zone_image_artifact(new_repo, zone_kind);
let old_artifact = Self::zone_image_artifact(old_repo, zone_kind);
if let Some(prev) = OrderedComponent::from(zone_kind).prev() {
if prev >= OrderedComponent::ControlPlaneZone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional can only be true if zone_kind is Nexus, right? (For any other zone kind, the previous component would be SpRot?)

I wonder if this would be clearer broken out more explicitly. Untested, but something like:

        match OrderedComponent::from(zone_kind).prev() {
            // If our previous component isn't a control plane zone at all, it's
            // safe to always use the new artifact.
            Some(OrderedComponent::HostOs | OrderedComponent::SpRot) | None => {
                new_artifact
            }
            // If our previous component is "any non-Nexus control plane zone",
            // it's only safe to use the new artifact if _all_ of the other
            // non-Nexus control plane zones are themselves using new artifacts.
            Some(OrderedComponent::ControlPlaneZone) => {
                if /* any zone is using an old artifact */ {
                    old_artifact
                } else {
                    new_artifact
                }
            }
            // We called `.prev()`; we can't get back the maximal component.
            Some(OrderedComponent::NexusZone) => {
                unreachable!("NexusZone is the last component")
            }
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading back over this, I wonder if it would be even clearer if we matched on the current kind instead of prev()? Still untested but something like

        match OrderedComponent::from(zone_kind) {
            // Nexus can only be updated if all non-Nexus zones have been updated
            OrderedComponent::NexusZone => {
                // all the complicated checks
            }
            // It's always safe to use the newer artifacts for newer non-Nexus zones
            OrderedComponent::ControlPlaneZone => {
                new_artifact
            }
            OrderedComponent::HostOs | OrderedComponent::SpRot => {
                unreachable!("zone_kind isn't an OS or SP/RoT")
            }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying too hard to make this code generic and easy to modify for hypothetical futures in which we have more than just Nexus/non-Nexus components. But it's much simpler as you suggest, and really no less future-proof, so 77b2156 implements this idea (and your next two) and drops a bunch of needless complexity. Thanks!

let old_artifact =
Self::zone_image_artifact(old_repo, kind);
OrderedComponent::from(kind) == prev
&& z.image_source == old_artifact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will z.image_source == old_artifact prevent us from ever upgrading a Nexus zone if some non-Nexus zone has the same hash in both TUF repos? I think we settled on "that should never happen, even for releases candidate respins". But I'm wondering if this check would be more correct as z.image_source != new_artifact anyway (assuming we look up the new_artifact for this zone inside this any closure); i.e., what we really care about is "are all the zones running their new artifacts"?

I was also going to ask "what if z.image_source is the install dataset" (because then this check would always be false, which I think would let us upgrade Nexus when we shouldn't?). Maybe that's guarded by other planner bits (haven't gotten to that yet)? But if we changed this to z.image_source != new_artifact that seems safer in the install dataset case too, since we'd refuse to upgrade Nexus until the other zones were back on a known artifact from the current release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, also incorporated into 77b2156. Thank you once again for helping to dramatically simplify and improve this core logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idle thought I had over the weekend - there's upcoming work to add checks to cockroach for "is it okay to update this based on the cluster status" (underreplicated ranges, etc.). Instead of viewing Nexus as a separate OrderedComponent variant, we could phrase it as:

  • we have a pile of control plane zones
  • some zones have "can this be updated yet" checks

Then the operation becomes: filter out the zones whose "can it be updated yet" check is false, and apply some stable (but presumably unimportant-to-correctness) ordering to the zones that are left when choosing which to update. Then Nexus's "can it be updated yet" check is "are all the non-Nexus zones running off the new version", and we have a ready-made spot to insert the Cockroach "can it be updated yet check" once that work is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I realize that would be a nontrivial change to the way this is implemented! Mostly wanted to float the idea out there and see what you think.)

match zone_kind {
ZoneKind::Crucible
| ZoneKind::CockroachDb
| ZoneKind::Clickhouse => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ClickhouseKeeper and/or ClickhouseServer be in this list too? (Those and DNS are the other kinds that have durable datasets, and I know DNS's data is small enough we don't need to update it in place. But I assume at least one of those multinode clickhouse variants will have a lot of data we don't want to expunge.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right to me, fixed in 050933e.

let zone_kind = zone.zone_type.kind();
let image_source = self.blueprint.zone_image_source(zone_kind);
if zone.image_source == image_source {
// This should only happen in the event of a planning error above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return an Err in this case? If the planner has internal inconsistencies, we should emit something that makes us investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! 7f1f522 logs and returns an error.


// We should start with no specified TUF repo and nothing to do.
assert!(example.input.tuf_repo().is_none());
// assert_planning_makes_no_changes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - can we uncomment this check? (Seems like it should be valid if there's no TUF repo yet, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indeed, that was left over from a debugging session where I didn't want the unchanged diff output. Uncommented in 2daabf5.

let artifacts = vec![
TufArtifactMeta {
id: ArtifactId {
name: String::from("crucible-pantry-zone"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - maybe

Suggested change
name: String::from("crucible-pantry-zone"),
name: ZoneKind::CruciblePantry.artifact_name().to_string(),

instead (and similar for other names below)? Slightly longer, but avoids any concerns about typos or what the string should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in 88733d2 for this test and the next by refactoring its little macro, now called fake_zone_artifact.

}

#[test]
fn test_update_all_zones() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice test 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like it too! It's how I knew I was (more or less) done.

I did have one question: do you think we should test the exact number of iterations required to converge, or leave it as-is with a maximum? The former seems more in-line with, e.g., expectorate tests (and indeed, we could use expectorate for this and match the exact update sequence); but the latter is more resilient to changes in planner behavior that aren't exactly the focus of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and honestly I don't have a strong feeling either way. I think if we made it an exact count, the way it fails would be important: if the error is just "didn't converge by iteration N" that would kinda suck. What if we did something like:

  • Keep the loop as-is
  • When it converges, break out
  • After the loop, add an assert for the number of iterations it took to converge, with a comment noting that (a) it's okay to update this if incidental planner work changed the number of iterations or (b) it's okay to remove this if it's needing to change so much the assertion isn't valuable

That way the failure mode would be something like "expected 23 iterations, but converged in 26" instead of just "didn't converge after 23 iterations"?

All that said, I'd also be fine with just leaving it as-is. Certainly "number of iterations needed to converge has changed" is way less important than "converges within some reasonable number of iterations".

OrderedComponent::from(z.zone_type.kind())
== OrderedComponent::NonNexusOmicronZone
})
.any(|z| z.image_source != new_artifact)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct new_artifact, right? This new_artifact is the Nexus zone we're trying to add, but we want to check z.zone_type's artifact from new_repo?

@@ -110,14 +113,11 @@ impl<'a> Planner<'a> {
}

fn do_plan(&mut self) -> Result<(), Error> {
// We perform planning in two loops: the first one turns expunged sleds
// into expunged zones, and the second one adds services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, thanks for trimming this; it was already badly out of date 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants