-
Notifications
You must be signed in to change notification settings - Fork 43
[sled-agent-config-reconciler] Flesh out internal disks task #8103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent-config-reconciler] Flesh out internal disks task #8103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
// short transient errors without constantly retrying a doomed | ||
// operation. | ||
// | ||
// We could be smarter here and check for retryable vs non-retryable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to set a limit here, and just report the disk as failed after a few attempts. However, as long as we can accurately get access to the errors and push them up to a higher level, then it probably doesn't matter.
I'm not even sure we can accurately tell the difference between a transient and permanent failure, so this seems fine for now.
|
||
// Output channel summarizing any adoption errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to think how to order the errors and adoptions so we can know if an error has occurred after an adoption (not sure how this could actually happen ) or before so we know if the disk is healthy when in disks
or if it has just become unhealthy.
I think a simple way to do this would be to bump a counter/generation number each time through the loop. We could then store that counter as part of the value in each watch channel. If we saw an error and disk at the same time we could compare the counters to see which is newer. If the value of the counter is the same and the disk and and error for it exist than there's almost certainly a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has the same problem as storing the errors in the same watch channel as the disks, right? If the disks channel has a generation that gets bumped every attempt, we wake up anyone watching for changes to the disks just to bump the generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewjstone and I chatted about this briefly. Storing a generation in both channels does indeed have the same problem. We talked over a few options:
- Move the errors and disks into a single channel, and live with
changed()
being fired for changes to either of them. That means if we had an internal disk that we couldn't adopt and were periodically retrying, we'd wake up any listeners every time we failed (~once a minute, after the backoff grows to its max). The listeners we know of are:
- The ledger task (would cause slightly more frequent rewrites of the config ledgers, but not a big deal)
- The dump device setup task (I'm not familiar, so not sure what impact this would have)
- Store two separate counters in each of the two channels, and keep a third value under a lock that stored the most recent combo of "current disk counter, current error counter". All three locked values could be sampled independently and therefore out of sync, but this third one would tell you if you were in that case.
- Just ignore this problem entirely, and live with the fact that it's possible in some rare cases a disk might (briefly!) be in both or neither channel.
Given main
today doesn't have a way of reporting internal disk errors at all, it seems okay to go with option 3. We can include any errors in the inventory moving forward, and if we have problems due to the errors possibly being slightly out of sync with in-use disks, we can revisit other options.
Nothing too fancy here. A lot of the bulk of the code is around funneling results out through watch channels.
The only new behavior beyond what
sled-storage
does is that this will periodically retry callingDisk::new()
on internal disks for whichDisk::new()
previously failed. I'm very open to feedback here - maybe this is wrong because the most common errors are fatal anyway (e.g., disk isn't formatted correctly)? Maybe it's okay but should be less frequent? Maybe it's okay but we should check for specific kinds of errors we believe to be fatal and not retry in those cases?