Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: don't hold detached tenants in memory #10264

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
51 changes: 46 additions & 5 deletions storage_controller/src/service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -415,8 +415,8 @@ pub struct Service {
/// Queue of tenants who are waiting for concurrency limits to permit them to reconcile
/// Send into this queue to promptly attempt to reconcile this shard next time units are available.
///
/// Note that this state logically lives inside ServiceInner, but carrying Sender here makes the code simpler
/// by avoiding needing a &mut ref to something inside the ServiceInner. This could be optimized to
/// Note that this state logically lives inside ServiceState, but carrying Sender here makes the code simpler
/// by avoiding needing a &mut ref to something inside the ServiceState. This could be optimized to
/// use a VecDeque instead of a channel to reduce synchronization overhead, at the cost of some code complexity.
delayed_reconcile_tx: tokio::sync::mpsc::Sender<TenantShardId>,

Expand Down Expand Up @@ -1162,6 +1162,11 @@ impl Service {
}
}

// If we just finished detaching all shards for a tenant, it might be time to drop it from memory.
if tenant.policy == PlacementPolicy::Detached {
self.maybe_drop_tenant(tenant.tenant_shard_id.tenant_id, &mut locked);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit scary. Since service state sits behind a sync lock we follow this pattern:

  1. acquire lock and get some in-mem state
  2. do something async
  3. acquire lock again and update something

If step (3) doesn't expect the removal, then we run into trouble. I couldn't find any place with problematic expect or unwrap calls. Generally, this should be pretty safe since we wait on the reconcile spawned by the detach in tenant_location_config and that holds the tenant exclusive lock, but might run into issues if detaches end up taking long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't really actionable, but I'm curious about your thoughts on this.

Copy link
Collaborator Author

@jcsp jcsp Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In request handlers, the tenant_op_locks for the tenant should prevent any shenanigans like this.

However, you make an excellent point in this particular context: process_result does not hold that lock. If a request handler took the lock, read the existence of the tenant, and then raced with the processing of a reconciler completion, this could violate the assumption that while holding the lock, a tenant that is in memory should remain in memory.

I think the neatest solution to this is to try and get an exclusive lock around this, and to make our maybe_load and maybe_drop functions take refs to lock handles to prove the callers aren't using them outside the lock, let's try that...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened up the use of op locks in 61bf182

In the process I had the interesting observation that tenant_location_conf holds the lock much longer than it needs to, really it should drop it before waiting for reconcilers, but I don't want to make that change inline here in case it has any spooky side effects.


// Maybe some other work can proceed now that this job finished.
if self.reconciler_concurrency.available_permits() > 0 {
while let Ok(tenant_shard_id) = locked.delayed_reconcile_rx.try_recv() {
Expand All @@ -1176,9 +1181,6 @@ impl Service {
}
}
}

// If the shard we just finished with is configured to be detached and is fully idle, then
// we may drop it from memory.
}

async fn process_results(
Expand Down Expand Up @@ -2472,16 +2474,55 @@ impl Service {
tenant_shards.len(),
tenant_id
);

locked.tenants.extend(tenant_shards.into_iter().map(|p| {
let intent = IntentState::new();
let shard =
TenantShard::from_persistent(p, intent).expect("Corrupt shard row in database");

// Sanity check: when loading on-demand, we should always be loaded something Detached
debug_assert!(shard.policy == PlacementPolicy::Detached);
if shard.policy != PlacementPolicy::Detached {
tracing::error!(
"Tenant shard {} loaded on-demand, but has non-Detached policy {:?}",
shard.tenant_shard_id,
shard.policy
);
}

(shard.tenant_shard_id, shard)
}));

Ok(())
}

/// If all shards for a tenant are detached, and in a fully quiescent state (no observed locations on pageservers),
/// and have no reconciler running, then we can drop the tenant from memory. It will be reloaded on-demand
/// if we are asked to attach it again (see [`Self::maybe_load_tenant`]).
fn maybe_drop_tenant(
&self,
tenant_id: TenantId,
locked: &mut std::sync::RwLockWriteGuard<ServiceState>,
) {
let mut tenant_shards = locked.tenants.range(TenantShardId::tenant_range(tenant_id));
if tenant_shards.all(|(_id, shard)| {
shard.policy == PlacementPolicy::Detached
&& shard.reconciler.is_none()
&& shard.observed.is_empty()
}) {
let keys = locked
.tenants
.range(TenantShardId::tenant_range(tenant_id))
.map(|(id, _)| id)
.copied()
.collect::<Vec<_>>();
for key in keys {
tracing::info!("Dropping detached tenant shard {} from memory", key);
locked.tenants.remove(&key);
}
}
}

/// This API is used by the cloud control plane to migrate unsharded tenants that it created
/// directly with pageservers into this service.
///
Expand Down
4 changes: 4 additions & 0 deletions storage_controller/src/tenant_shard.rs
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,10 @@ impl ObservedState {
locations: HashMap::new(),
}
}

pub(crate) fn is_empty(&self) -> bool {
self.locations.is_empty()
}
}

impl TenantShard {
Expand Down