Skip to content

Trust Quorum: Handle prepare messages + Alarms #8062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: ajs/realtq-4
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions trust-quorum/src/alarm.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! A mechanism for reporting protocol invariant violations
//!
//! Invariant violations should _never_ occur. They represent a critical bug in
//! the implementation of the system. In certain scenarios we can detect these
//! invariant violations and record them. This allows reporting them to higher
//! levels of control plane software so that we can debug them and fix them in
//! future releases, as well as rectify outstanding issues on systems where such
//! an alarm arose.

use crate::{Epoch, PlatformId};
use omicron_uuid_kinds::RackUuid;
use serde::{Deserialize, Serialize};

/// A critical invariant violation that should never occur.
///
/// Many invariant violations are only possible on receipt of peer messages,
/// and are _not_ a result of API calls. This means that there isn't a good
/// way to directly inform the rest of the control plane. Instead we provide a
/// queryable API for `crate::Node` status that includes alerts.
///
/// If an `Alarm` is ever seen by an operator then support should be contacted
/// immediately.
#[derive(
Debug, Clone, thiserror::Error, PartialEq, Eq, Serialize, Deserialize,
)]
pub enum Alarm {
#[error(
"TQ Alarm: commit attempted with invalid rack_id. Expected {expected}, got {got}."
)]
CommitWithInvalidRackId { expected: RackUuid, got: RackUuid },
#[error(
"TQ Alarm: prepare for a later configuration exists: \
last_prepared_epoch = {last_prepared_epoch:?}, \
commit_epoch = {commit_epoch}"
)]
OutOfOrderCommit { last_prepared_epoch: Epoch, commit_epoch: Epoch },

#[error(
"TQ Alarm: commit attempted, but missing prepare message: \
epoch = {epoch}. Latest seen epoch = {latest_seen_epoch:?}."
)]
MissingPrepare { epoch: Epoch, latest_seen_epoch: Option<Epoch> },

#[error(
"TQ Alarm: prepare received from {from} with mismatched \
last_committed_epoch: prepare's last committed epoch = \
{prepare_last_committed_epoch:?}, \
persisted prepare's last_committed_epoch = \
{persisted_prepare_last_committed_epoch:?}"
)]
PrepareLastCommittedEpochMismatch {
from: PlatformId,
prepare_last_committed_epoch: Option<Epoch>,
persisted_prepare_last_committed_epoch: Option<Epoch>,
},

#[error(
"TQ Alarm: prepare received with invalid rack_id from {from}. \
Expected {expected}, got {got}."
)]
PrepareWithInvalidRackId {
from: PlatformId,
expected: RackUuid,
got: RackUuid,
},

#[error(
"TQ Alarm: different nodes coordinating same epoch = {epoch}: \
them = {them}, us = {us}"
)]
DifferentNodesCoordinatingSameEpoch {
epoch: Epoch,
them: PlatformId,
us: PlatformId,
},
}
2 changes: 1 addition & 1 deletion trust-quorum/src/coordinator_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ impl CoordinatorState {

/// Record a `PrepareAck` from another node as part of tracking
/// quorum for the prepare phase of the trust quorum protocol.
pub fn ack_prepare(&mut self, from: PlatformId) {
pub fn record_prepare_ack(&mut self, from: PlatformId) {
match &mut self.op {
CoordinatorOperation::Prepare {
prepares, prepare_acks, ..
Expand Down
4 changes: 2 additions & 2 deletions trust-quorum/src/crypto.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ use zeroize::{Zeroize, ZeroizeOnDrop, Zeroizing};
const LRTQ_SHARE_SIZE: usize = 33;

/// We don't distinguish whether this is an Ed25519 Scalar or set of GF(256)
/// polynomials points with an x-coordinate of 0. Both can be treated as 32 byte
/// blobs when decrypted, as they are immediately fed into HKDF.
/// polynomials' points with an x-coordinate of 0. Both can be treated as 32
/// byte blobs when decrypted, as they are immediately fed into HKDF.
#[derive(
Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize,
)]
Expand Down
16 changes: 0 additions & 16 deletions trust-quorum/src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,6 @@ use crate::configuration::ConfigurationError;
use crate::{Epoch, PlatformId, Threshold};
use omicron_uuid_kinds::RackUuid;

#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
pub enum CommitError {
#[error("invalid rack id")]
InvalidRackId(
#[from]
#[source]
MismatchedRackIdError,
),

#[error("missing prepare msg")]
MissingPrepare,

#[error("prepare for a later configuration exists")]
OutOfOrderCommit,
}

#[derive(Debug, Clone, thiserror::Error, PartialEq, Eq)]
#[error(
"sled was decommissioned on msg from {from:?} at epoch {epoch:?}: last prepared epoch = {last_prepared_epoch:?}"
Expand Down
6 changes: 4 additions & 2 deletions trust-quorum/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
use derive_more::Display;
use serde::{Deserialize, Serialize};

mod alarm;
mod configuration;
mod coordinator_state;
pub(crate) mod crypto;
Expand All @@ -20,9 +21,10 @@ mod messages;
mod node;
mod persistent_state;
mod validators;
pub use configuration::Configuration;
pub use alarm::Alarm;
pub use configuration::{Configuration, PreviousConfiguration};
pub(crate) use coordinator_state::CoordinatorState;
pub use crypto::RackSecret;
pub use crypto::{EncryptedRackSecret, RackSecret, Salt, Sha3_256Digest};
pub use messages::*;
pub use node::Node;
pub use persistent_state::{PersistentState, PersistentStateSummary};
Expand Down
Loading
Loading