[sled-agent-config-reconciler] Flesh out internal disks task #8103

jgallagher · 2025-05-06T19:12:56Z

Nothing too fancy here. A lot of the bulk of the code is around funneling results out through watch channels.

The only new behavior beyond what sled-storage does is that this will periodically retry calling Disk::new() on internal disks for which Disk::new() previously failed. I'm very open to feedback here - maybe this is wrong because the most common errors are fatal anyway (e.g., disk isn't formatted correctly)? Maybe it's okay but should be less frequent? Maybe it's okay but we should check for specific kinds of errors we believe to be fatal and not retry in those cases?

andrewjstone

Looks good.

andrewjstone · 2025-05-07T15:35:21Z

sled-agent/config-reconciler/src/internal_disks.rs

+        // short transient errors without constantly retrying a doomed
+        // operation.
+        //
+        // We could be smarter here and check for retryable vs non-retryable


I think it's reasonable to set a limit here, and just report the disk as failed after a few attempts. However, as long as we can accurately get access to the errors and push them up to a higher level, then it probably doesn't matter.

I'm not even sure we can accurately tell the difference between a transient and permanent failure, so this seems fine for now.

andrewjstone · 2025-05-07T15:51:22Z

sled-agent/config-reconciler/src/internal_disks.rs


+    // Output channel summarizing any adoption errors.


I was trying to think how to order the errors and adoptions so we can know if an error has occurred after an adoption (not sure how this could actually happen ) or before so we know if the disk is healthy when in disks or if it has just become unhealthy.

I think a simple way to do this would be to bump a counter/generation number each time through the loop. We could then store that counter as part of the value in each watch channel. If we saw an error and disk at the same time we could compare the counters to see which is newer. If the value of the counter is the same and the disk and and error for it exist than there's almost certainly a bug.

I think this has the same problem as storing the errors in the same watch channel as the disks, right? If the disks channel has a generation that gets bumped every attempt, we wake up anyone watching for changes to the disks just to bump the generation.

@andrewjstone and I chatted about this briefly. Storing a generation in both channels does indeed have the same problem. We talked over a few options:

Move the errors and disks into a single channel, and live with changed() being fired for changes to either of them. That means if we had an internal disk that we couldn't adopt and were periodically retrying, we'd wake up any listeners every time we failed (~once a minute, after the backoff grows to its max). The listeners we know of are:

The ledger task (would cause slightly more frequent rewrites of the config ledgers, but not a big deal)

The dump device setup task (I'm not familiar, so not sure what impact this would have)

Store two separate counters in each of the two channels, and keep a third value under a lock that stored the most recent combo of "current disk counter, current error counter". All three locked values could be sampled independently and therefore out of sync, but this third one would tell you if you were in that case.

Just ignore this problem entirely, and live with the fact that it's possible in some rare cases a disk might (briefly!) be in both or neither channel.

Given main today doesn't have a way of reporting internal disk errors at all, it seems okay to go with option 3. We can include any errors in the inventory moving forward, and if we have problems due to the errors possibly being slightly out of sync with in-use disks, we can revisit other options.

[sled-agent-config-reconciler] Flesh out internal disks task

d327324

jgallagher requested review from andrewjstone and smklein May 6, 2025 19:12

andrewjstone approved these changes May 7, 2025

View reviewed changes

jgallagher mentioned this pull request May 7, 2025

[sled-agent-config-reconciler] Flesh out ExternalDisks type #8111

Open

jgallagher merged commit 638d3aa into main May 8, 2025
16 checks passed

jgallagher deleted the john/sled-agent-config-reconciler-internal-disks branch May 8, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent-config-reconciler] Flesh out internal disks task #8103

[sled-agent-config-reconciler] Flesh out internal disks task #8103

jgallagher commented May 6, 2025

andrewjstone left a comment

andrewjstone May 7, 2025

andrewjstone May 7, 2025

jgallagher May 8, 2025 •

edited

Loading

jgallagher May 8, 2025

[sled-agent-config-reconciler] Flesh out internal disks task #8103

[sled-agent-config-reconciler] Flesh out internal disks task #8103

Conversation

jgallagher commented May 6, 2025

andrewjstone left a comment

Choose a reason for hiding this comment

andrewjstone May 7, 2025

Choose a reason for hiding this comment

andrewjstone May 7, 2025

Choose a reason for hiding this comment

jgallagher May 8, 2025 • edited Loading

Choose a reason for hiding this comment

jgallagher May 8, 2025

Choose a reason for hiding this comment

jgallagher May 8, 2025 •

edited

Loading