-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Respect systemd inhibitor locks #2862
Conversation
654f797
to
ce553a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to coreos/zincati#245 somewhere in the commit message?
It's not obvious to me that we want to default to failing on block inhibitors for all reboot operations. I think the behaviour makes sense for finalize-deployment
since it's tailor-made for Zincati and we do want that there.
But otherwise, ISTM like we should do something like matching the systemd behaviour of respecting inhibitors if interactive, and not respecting it otherwise. (This implies having a similar --check-inhibitors
flag and proxying that via a new check-inhibitors
option in the D-Bus methods.)
This also technically breaks any script/service which rely on the current behaviour of ignoring inhibitors.
I think also even in Zincati we should enforce some time limit after which it should just ignore inhibitors. That will prevent some buggy service breaking updates entirely.
So my strawman is:
- make
FinalizeDeployment
fail on block inhibitors by default - have
rpm-ostree finalize-deployment
exit with e.g. code 66 if inhibitors blocked us - add a
finalize-deployment --ignore-inhibitors
to force finalization and reboot even if inhibitors are present - teach Zincati to use
--ignore-inhibitors
if we've been getting error code 66 after X amount of time (probably would have to make this configurable; and then users can opt into blocking forever if they really want that using e.g.<newknob>=-1
) - leave all other rpm-ostree APIs alone for now
WDYT?
@jlebon Thanks for the thorough feedback and review! Your strawman makes sense to me. Just a comment on the last note though, there currently isn't any way to block rpm-ostree using systemd inhibitors. For example, if I hold a lock with the root user and then do an
This is indeed a concern, but should we at least provide a way to get interactive rpm-ostree CLI calls to respect inhibitors as well? |
I think we can debate this more, though offhand, I think it would make sense to match systemd like I mentioned above (not ignore it if interactive, ignore otherwise). This has the least chance of breaking workflows.
Right yeah, if we go with matching systemd, we could have a similar But honestly, unless there's a request for this outside of the Zincati context, I would just punt on this for now for the other APIs and focus on the Zincati path. |
I think the systemd default behavior is basically just broken. It makes The problem here really revolves around what happens when one just writes If we make the default in rpm-ostree to honor them we bypass all of that. I just can't think of a real world scenario where:
|
But that's creating a whole new kind of And things get worse if e.g. multiple components start implementing this I feel pretty strongly if an admin (or privileged code) adds a Our default is already to reboot autonomously. This mechanism comes into play when an admin wants more control over their computer. And we should respect that. |
Yeah, it does seem weird to me how the default upstream is that there are cases (e.g. scripts, non-interactive code) that are immune to inhibitors in |
My core argument here for not being consistent with systemd's default The semantics and control over it are hence greatly magnified in importance. The I think we just fix this at our level and then related to the concern around "what if there is a |
My concern is more with systemd services which e.g. take a lock and then hang. That would potentially create deadends in the update graph and require manual intervention to unbreak. ISTM like at some point, we should be able to make the call and say "no reasonable service should be inhibiting updates for X days; ignore it". For the interactive admin case, we already have coreos/zincati#485 which helps there and I think we can expect admins to simply do |
I think for the desktop case that's true. For the server case/non-interactive cases, it's not unreasonable IMO to have whatever is driving updates provide a grace period before going ahead and rebooting anyway. Kinda like systemd will just kill processes after some time if they're blocking reboot. |
Do you know of a scenario where that's happened? And furthermore the admin would actually prefer us to ignore it and reboot anyways after some period of time? I can say for sure I know of one RH customer who really, really really wanted absolute control over when individual machines rebooted. They have once-a-month maintenance windows and need some manual action taken around managing workloads. In this scenario, I don't think there's any default value for a "timeout" that they'd want. I'd love to be able to tell them "Hey we fixed systemd inhibitors, so you can now use those and e.g. the MCO will honor them". |
I think that analogy doesn't quite apply here because there the admin (or a privileged process) has explicitly initiated a reboot - we're trying to honor its will. We necessarily only give (mostly unprivileged) processes a grace period to delay. Speaking of "honoring its will" - that's the simple way to say this: Who is in charge? Is it the admin or code initiating a |
This is hard to answer because that's exactly the behaviour we're changing now (going from not caring to caring about systemd inhibitors). So we may have hit this 0 times or 10 times in the past. I realize this isn't a convincing argument. But I'm also very conservative when it comes to avoiding deadends.
IMO, that doesn't really sound like a good use case for systemd inhibitors. In a Zincati-controlled scenario, they'd be better served just having coreos/zincati#498 and doing
Yeah, that's a good way to look at it. If I had to order them, it'd be: admins, Zincati/MCO, and everything else. Since we can't easily tell apart admins from "everything else", we're left having to make this call. I think it's a tough decision, and in the end I'm fine going with whatever the team decides! 👍 |
I think we also want to guard against other things that might invoke
No, but it could learn to check inhibitors in the same way we're doing it here. |
Random note from chat; I remembered that
It's not installed in FCOS by default, but seems to be in e.g. FSB. |
Ah OK, finally discovered that CI errors are caused by rpm-ostree/tests/common/libvm.sh Line 102 in 15e945c
systemd-logind.service to be enabled since we need to call systemd-logind 's D-Bus interface's ListInhibitors method.
|
Ahh yes, this dates from a while ago now. Feel free to drop that and let's see if that bug is fixed now! |
ce553a0
to
fb7ae9a
Compare
Sample error output when trying to reboot with inhibitor locks presentCommon case:
Multiple
|
b7c682b
to
8806593
Compare
@@ -2582,6 +2598,10 @@ finalize_deployment_transaction_execute (RpmostreedTransaction *transaction, | |||
OstreeSysroot *sysroot = rpmostreed_transaction_get_sysroot (transaction); | |||
OstreeRepo *repo = ostree_sysroot_repo (sysroot); | |||
|
|||
// Check for inhibitor locks before unlocking staged deployment. | |||
if (!check_sd_inhibitor_locks (cancellable, error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing this twice in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah oops. I had this for ease of manual testing and forgot to remove it :(
src/daemon/rpmostreed-utils.cxx
Outdated
if (!inhibitors_array_tuple) | ||
return glnx_prefix_error (error, "Checking systemd inhibitor locks"); | ||
else if (g_variant_n_children (inhibitors_array_tuple) < 1) | ||
return glnx_throw (error, "LoadInhibitors returned empty tuple"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return glnx_throw (error, "LoadInhibitors returned empty tuple"); | |
return glnx_throw (error, "ListInhibitors returned empty tuple"); |
GError **error) | ||
{ | ||
GDBusConnection *connection = rpmostreed_daemon_connection (); | ||
g_autoptr(GVariant) inhibitors_array_tuple = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a link around here to the systemd D-Bus API docs for this?
src/daemon/rpmostreed-utils.cxx
Outdated
while (g_variant_iter_loop (&viter, "(ssssuu)", &what, &who, &why, &mode, NULL, NULL)) | ||
{ | ||
// Only consider shutdown inhibitors in `block` mode. | ||
if (strstr (what, "shutdown") && strcmp (mode, "block") == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor/optional: could also do
if (strstr (what, "shutdown") && strcmp (mode, "block") == 0) | |
if (strstr (what, "shutdown") && g_str_equal (mode, "block")) |
src/daemon/rpmostreed-utils.cxx
Outdated
char *what; | ||
char *who; | ||
char *why; | ||
char *mode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to initialize all those to NULL
in case it triggers static analyzers.
src/daemon/rpmostreed-utils.cxx
Outdated
while (g_variant_iter_loop (&viter, "(ssssuu)", &what, &who, &why, &mode, NULL, NULL)) | ||
{ | ||
// Only consider shutdown inhibitors in `block` mode. | ||
if (strstr (what, "shutdown") && strcmp (mode, "block") == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add braces here to avoid nesting single statements like this?
src/daemon/rpmostreed-utils.cxx
Outdated
{ | ||
// Only consider shutdown inhibitors in `block` mode. | ||
if (strstr (what, "shutdown") && strcmp (mode, "block") == 0) | ||
if (++num_block_inhibitors == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh prefix increment, don't see those often nowadays. Combined with the previous comment to make it a block, I'd suggest just splitting that out to be easier to read/harder to mess up, e.g.:
{
if (num_block_inhibitors == 0)
...
num_block_inhibitors++;
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah I tried pretty hard to not use braces there.
I thought it'd look simpler to not have those two extra lines.
tests/common/libvm.sh
Outdated
# XXX: hack around https://github.com/systemd/systemd/issues/14328 | ||
vm_cmd systemctl mask --now systemd-logind | ||
# # XXX: hack around https://github.com/systemd/systemd/issues/14328 | ||
# vm_cmd systemctl mask --now systemd-logind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this works now then? Feel free to just delete these lines entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man, my bad. I think I've been debugging these bits of code for so long my eyes just got fatigued, leading to the bunch of errors 😿
tests/vmcheck/test-misc-2.sh
Outdated
assert_file_has_content err.txt 'Reboot blocked' | ||
vm_cmd test -f /run/ostree/staged-deployment-locked # Make sure that staged deployment is still locked. | ||
vm_cmd touch /run/wakeup | ||
sleep 0.1 # Wait a bit for the process holding the lock to exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just bump it to 1s to be tolerant of e.g. busy CI systems?
8806593
to
5249272
Compare
Today, non-interactive processes, including rpm-ostree, are not inhibited by systemd inhibitor locks. systemd 248 added the `--check-inhibitors` option to `systemctl` commands, which can be used to make non-interactive proccesses check for shutdown/sleep inhibitors. However, it also checks for `delay` mode inhibitors, which isn't what we want. Replicate this logic prior to reboots, but ignore `delay` mode inhibitors.
5249272
to
a7e7095
Compare
if (strstr (what, "shutdown") && g_str_equal (mode, "block")) | ||
{ | ||
num_block_inhibitors++; | ||
if (num_block_inhibitors == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine as is, though I think we could simplify the logic by doing const guint num_block_inhibitors = g_variant_n_children (inhibitors_array);
and dispatch on that. But fine as is!
Today, non-interactive processes, including rpm-ostree, are not
inhibited by systemd inhibitor locks. systemd 248 added the
--check-inhibitors
option tosystemctl
commands, which can beused to make non-interactive proccesses check for shutdown/sleep
inhibitors. However, it also checks for
delay
mode inhibitors,which isn't what we want. Replicate this logic prior to reboots,
but ignore
delay
mode inhibitors.