daemon fails to start when inotify initialization fails (out of watches?) #944

aendi123 · 2021-08-29T20:48:15Z

Describe the bug
The rpm-ostreed.service fails, sometimes directly on boot, sometimes it works for some minutes and then it crashes. If it runs for some minutes after booting, rollbacks work normally. Updates are done correctly by Zincati, but the new version is never activated during the next reboot and I can't see the new version anymore in rpm-ostree status after the reboot.

Following logs are produced when I try to run systemctl start rpm-ostreed after a crash:

Aug 29 22:28:52 inuc-srv-001 systemd[1]: Starting rpm-ostree System Management Daemon...
Aug 29 22:28:52 inuc-srv-001 rpm-ostree[50957]: Reading config file '/etc/rpm-ostreed.conf'
Aug 29 22:28:53 inuc-srv-001 rpm-ostree[50957]: error: Couldn't start daemon: Error setting up sysroot: Unable to find default local file monitor type
Aug 29 22:28:53 inuc-srv-001 systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Aug 29 22:28:53 inuc-srv-001 systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.

Reproduction steps
I have done nothing special, the machine was set up about a year ago with Fedora CoreOS 33, now it runs 34. The updates were always automatically installed, it is used as a Rancher RKE node. Two other machines with the exact same config/function don't have this problem.

Expected behavior
Rpm-ostree should work.

Actual behavior
Rpm-ostree crashes on boot or after some minutes, I am unable to install updates, rollbacks work.

System details

Bare Metal
Fedora CoreOS 34.20210711.3.0 and 34.20210725.3.0 both don't work.

The text was updated successfully, but these errors were encountered:

lucab · 2021-08-30T08:23:56Z

Thanks for the report.
This is the place where the failure is surfacing https://github.com/coreos/rpm-ostree/blob/v2021.10/src/daemon/rpmostreed-sysroot.cxx#L782-L785, but the underlying error is coming from gio file monitor (i.e. inotify) library.

I suspect that something bad is going on on your nodes. Specifically, I think that the rpm-ostree daemon is now unable to setup an inotify watcher because the service (or the system as a whole) is hitting a resource limit.

The inotify_init(2) manpage lists several resources that could get exhausted.
It would be good if you could double-check your system resources and verify whether it is running close to any limits.

Another quick smoke test would be to temporarily stop/drain all the workloads on the node (especially all the ones running as root) and see whether the service starts reacting to a simple rpm-ostree status at that point.

Self-note: this failure mode is quite opaque, but I did some quick code walking in gio and it seems to be compatible with a failure when calling inotify_init() at https://github.com/GNOME/glib/blob/2.68.4/gio/inotify/inotify-kernel.c#L390.

aendi123 · 2021-08-30T20:57:46Z

Thank you so much for the great input, I think that was the reason. I stopped docker (since there is only container workload on the node) and rpm-ostreed worked again. After that I increased the inotify limits in sysctl.conf and it worked also with docker running:

fs.inotify.max_user_watches = 999999
fs.inotify.max_queued_events = 999999
fs.inotify.max_user_instances = 999999

However, a new problem appeared, updating is still not working. Every time the host is started, Zincati installs the newest update correctly and I can see it in rpm-ostree status. But when I reboot, the new deployment isn't listed.

This is the log starting at a fresh boot until Zincati tries to update again after a reboot.
log.txt

lucab · 2021-08-31T07:32:57Z

@aendi123 it would be helpful to see the logs from zincati.service journal too to understand what's happening.

However, as you said "when I reboot", I suspect you are forcing a manual reboot instead of letting Zincati reboot the machine to finalize the update.
If that is the case, the behavior you see is expected: FCOS is designed so that a random reboot at any time (e.g. a power loss or a manual trigger) does not switch to a different OS version.

dustymabe · 2021-08-31T13:31:48Z

However, as you said "when I reboot", I suspect you are forcing a manual reboot instead of letting Zincati reboot the machine to finalize the update.

I think this expected "workflow" plays back in to coreos/zincati#498

aendi123 · 2021-09-02T06:33:10Z

Oops, you are absolutely right. It works correctly when Zincati reboots the host.

aendi123 added the kind/bug label Aug 29, 2021

lucab added component/rpm-ostree needs/more-information labels Aug 30, 2021

cgwalters changed the title ~~rpm-ostree not working anymore~~ daemon fails to start when inotify initialization fails (out of watches?) Aug 30, 2021

aendi123 closed this as completed Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon fails to start when inotify initialization fails (out of watches?) #944

daemon fails to start when inotify initialization fails (out of watches?) #944

aendi123 commented Aug 29, 2021

lucab commented Aug 30, 2021 •

edited

Loading

aendi123 commented Aug 30, 2021

lucab commented Aug 31, 2021 •

edited

Loading

dustymabe commented Aug 31, 2021

aendi123 commented Sep 2, 2021

daemon fails to start when inotify initialization fails (out of watches?) #944

daemon fails to start when inotify initialization fails (out of watches?) #944

Comments

aendi123 commented Aug 29, 2021

lucab commented Aug 30, 2021 • edited Loading

aendi123 commented Aug 30, 2021

lucab commented Aug 31, 2021 • edited Loading

dustymabe commented Aug 31, 2021

aendi123 commented Sep 2, 2021

lucab commented Aug 30, 2021 •

edited

Loading

lucab commented Aug 31, 2021 •

edited

Loading