Skip to content

Allow configuring control_plane_storage_buffer #8099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 6, 2025

Instead of a hard-coded const, allow configuring control plane storage buffer as a tunable.

Fixes #7979

Instead of a hard-coded const, allow configuring control plane storage
buffer as a tunable.

Fixes oxidecomputer#7979
@jmpesp jmpesp requested a review from davepacheco May 6, 2025 16:04
@@ -429,18 +429,6 @@ do
ACTUAL_ZPOOL_COUNT=$(pfexec zlogin oxz_switch /opt/oxide/omdb/bin/omdb db zpool list -i | wc -l)
done

# The bootstrap command creates a disk, so before that: adjust the control plane
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
/// allocation query will not use for each U2.
pub control_plane_storage_buffer_gb: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right: this is part of PackageConfig, which is documented as "configuration parameters known at compile-time"; as opposed to DeploymentConfig, which are runtime, deployment-specific tunables. But isn't the point of doing it this way to make it runtime?

Edit: I think this is basically fine -- see my long comment below for more.

Comment on lines 308 to 309
/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
/// The amount of space to reserve per-disk for non-Crucible storage (i.e., control plane
/// storage) in gibibytes. This amount represents a buffer that the region

Trying to be really clear for people not already familiar with this.

@@ -86,6 +86,9 @@ default_request_body_max_bytes = 1048576
# IPv4 subnetwork. This size allows for ~60 hosts.
max_vpc_ipv4_subnet_prefix = 26

# For development environments, choose a zero sized storage buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For development environments, choose a zero sized storage buffer
# In development environments, reserve no space for control plane storage.

@@ -72,6 +72,9 @@ url = "postgresql://root@[::1]:32221/omicron?sslmode=disable"
# IPv4 subnetwork. This size allows for ~60 hosts.
max_vpc_ipv4_subnet_prefix = 26

# For development environments, choose a zero sized storage buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For development environments, choose a zero sized storage buffer
# In development environments, reserve no space for control plane storage.

@@ -959,6 +961,10 @@ pub struct BackgroundTasksData {
pub webhook_delivery_client: reqwest::Client,
/// Channel for configuring pending MGS updates
pub mgs_updates_tx: watch::Sender<PendingMgsUpdates>,
/// The amount of disk space to reserve for non-Crucible / control plane
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same edit as above.

(Mainly, I think it's easy to misparse this as "non-(Crucible/control-plane) storage", which is very wrong.)

/// The amount of disk space to reserve for non-Crucible / control plane
/// storage in gibibytes. This amount represents a buffer that the region
/// allocation query will not use for each U2.
pub control_plane_storage_buffer_gb: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub control_plane_storage_buffer_gb: u32,
pub control_plane_storage_buffer_gib: u32,

Comment on lines 30 to 32
# Reserve space for non-Crucible storage
# See oxidecomputer/omicron#7875 for the 250G determination.
control_plane_storage_buffer_gb = 250
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I think what you've got here is probably right but it took me a while to work through some other options and I want to record that here.


This is a different approach than I intended, but it might be just as good for what we want!

What I was trying to propose was something like:

  1. RSS accept this value as configuration here, where it takes other deployment-time parameters:

    /// Configuration for the "rack setup service".
    ///
    /// The Rack Setup Service should be responsible for one-time setup actions,
    /// such as CockroachDB placement and initialization. Without operator
    /// intervention, however, these actions need a way to be automated in our
    /// deployment.
    #[derive(Clone, Deserialize, Serialize, PartialEq, JsonSchema)]
    #[serde(try_from = "UnvalidatedRackInitializeRequest")]
    pub struct RackInitializeRequest {
    /// The set of peer_ids required to initialize trust quorum
    ///
    /// The value is `None` if we are not using trust quorum
    pub trust_quorum_peers: Option<Vec<Baseboard>>,
    /// Describes how bootstrap addresses should be collected during RSS.
    pub bootstrap_discovery: BootstrapAddressDiscovery,
    /// The external NTP server addresses.
    pub ntp_servers: Vec<String>,
    /// The external DNS server addresses.
    pub dns_servers: Vec<IpAddr>,
    /// Ranges of the service IP pool which may be used for internal services.
    // TODO(https://github.com/oxidecomputer/omicron/issues/1530): Eventually,
    // we want to configure multiple pools.
    pub internal_services_ip_pool_ranges: Vec<IpRange>,
    /// Service IP addresses on which we run external DNS servers.
    ///
    /// Each address must be present in `internal_services_ip_pool_ranges`.
    pub external_dns_ips: Vec<IpAddr>,
    /// DNS name for the DNS zone delegated to the rack for external DNS
    pub external_dns_zone_name: String,
    /// initial TLS certificates for the external API
    pub external_certificates: Vec<Certificate>,
    /// Configuration of the Recovery Silo (the initial Silo)
    pub recovery_silo: RecoverySiloConfig,
    /// Initial rack network configuration
    pub rack_network_config: RackNetworkConfig,
    /// IPs or subnets allowed to make requests to user-facing services
    #[serde(default = "default_allowed_source_ips")]
    pub allowed_source_ips: AllowedSourceIps,
    }

  2. Wicket provides it with a hardcoded value of 250 here, where it constructs that struct:

    let request = RackInitializeRequest {
    trust_quorum_peers,
    bootstrap_discovery: BootstrapAddressDiscovery::OnlyThese(
    bootstrap_ips,
    ),
    ntp_servers: self.ntp_servers.clone(),
    dns_servers: self.dns_servers.clone(),
    internal_services_ip_pool_ranges,
    external_dns_ips: self.external_dns_ips.clone(),
    external_dns_zone_name: self.external_dns_zone_name.clone(),
    external_certificates: self.external_certificates.clone(),
    recovery_silo: RecoverySiloConfig {
    silo_name: Name::try_from(RECOVERY_SILO_NAME).unwrap(),
    user_name: UserId::try_from(RECOVERY_SILO_USERNAME).unwrap(),
    user_password_hash,
    },
    rack_network_config,
    allowed_source_ips: self
    .allowed_source_ips
    .clone()
    .unwrap_or(AllowedSourceIps::Any),
    };

  3. Sled agent would write it into the "deployment" section of the Nexus config file, here:

    let deployment_config = DeploymentConfig {
    id: *id,
    rack_id: info.rack_id,
    techport_external_server_port: NEXUS_TECHPORT_EXTERNAL_PORT,
    dropshot_external: ConfigDropshotWithTls {
    tls: *external_tls,
    dropshot: dropshot::ConfigDropshot {
    bind_address: SocketAddr::new(*opte_ip, nexus_port),
    default_request_body_max_bytes: 1048576,
    default_handler_task_mode:
    HandlerTaskMode::Detached,
    log_headers: vec![],
    },
    },
    dropshot_internal: dropshot::ConfigDropshot {
    bind_address: (*internal_address).into(),
    default_request_body_max_bytes: 1048576,
    default_handler_task_mode: HandlerTaskMode::Detached,
    log_headers: vec![],
    },
    internal_dns: nexus_config::InternalDns::FromSubnet {
    subnet: Ipv6Subnet::<RACK_PREFIX>::new(
    info.underlay_address,
    ),
    },
    database: nexus_config::Database::FromDns,
    external_dns_servers: external_dns_servers.clone(),
    // TCP connections bound by HTTP clients of external services
    // should always be bound on our OPTE interface.
    external_http_clients:
    nexus_config::ExternalHttpClientConfig {
    interface: Some(opte_iface_name.to_string()),
    },
    };

  4. For non-Wicket deployments (which includes helios-deploy, the "how-to-run" flow, and a4x2), these use the RSS config files in smf/sled-agent, and those would all specify 0 (or some other suitable number).

I thought this was similar to how some other config works (e.g., whether Nexus is configured for TLS), but it's slightly different:

  • re: Nexus TLS: Sled Agent does write it into the config here, but it's based on the Nexus zone config, which itself is presumably determined by RSS based on the presence of certificates
  • the name of the recovery silo and its user follow a similar flow except they get provided to Nexus with the RSS-to-Nexus handoff instead of being written into the Nexus config file
  • the rack id is written into the Nexus config by Sled Agent but it's based on how the Sled was initialized, not the rack initialization request.

If we want this to come in as an RSS tunable, I think there are basically three flows:

  1. Put it into the sled config so every sled agent has it. Use that when Sled Agent writes Nexus's config file. Easy, works, but kind of gross because this isn't really sled config.
  2. Put it into the Nexus zone config and have RSS fill it in there, then have sled agent use that similar to how it fills in external_dns_servers, external_tls, etc. This is fine but a fair bit of work and also makes Sled Agent uses implicit interfaces with components it provisions #3407 slightly worse.
  3. Put it into the RSS-to-Nexus handoff. This is fine but means that it's actually dynamic in Nexus -- it'd be stored in the database and presumably re-read periodically or something. This too is a lot more work (though less than (2) and avoids making Sled Agent uses implicit interfaces with components it provisions #3407 any worse).

The main advantage to one of these approaches is that it's truly a deployment-time config. What's in this PR is still compile-time. It's just being determined late enough that we know which deployment we're targeting. This works okay right now because we already need different sets of packages for these deployments. I think we want to get away from needing different packages. But that will already be a bunch of work and this PR doesn't make it much harder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just finished updating and testing this branch with a4x2, and you were spot on with your suspicion below: the 250 GiB storage buffer is applied in the a4x2 case.

My reasoning here was that the value(s) were known at "compile-time": it's either 0 or 250 GiB. But this doesn't work for a4x2, which uses the multi-sled nexus config-partial. This distinction I was missing was that multi-sled does not equal prod, and single sled does not equal dev.

I'm going to change this PR to use the method recommended in this comment, which will require a follow up testbed PR to change the config-rss.toml over there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco I started implementing 2, but ended up implementing 3 - I think it works better to store the value in the database.

# Reserve space for non-Crucible storage
# See oxidecomputer/omicron#7875 for the 250G determination.
control_plane_storage_buffer_gb = 250

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bigger question I have is: does a4x2 not use this file? or: have you tested that this does the right thing on a4x2?

Introduce idea of a database resident Nexus setting that is configured
through rack initialization. This can be changed or removed later.

Revert changes adding control plane storage buffer to tunables list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CONTROL_PLANE_STORAGE_BUFFER is larger than virtual-hardware vdevs and isn't configurable
2 participants