Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon fails to start when inotify initialization fails (out of watches?) #944

Closed
aendi123 opened this issue Aug 29, 2021 · 5 comments
Closed

Comments

@aendi123
Copy link

Describe the bug
The rpm-ostreed.service fails, sometimes directly on boot, sometimes it works for some minutes and then it crashes. If it runs for some minutes after booting, rollbacks work normally. Updates are done correctly by Zincati, but the new version is never activated during the next reboot and I can't see the new version anymore in rpm-ostree status after the reboot.

Following logs are produced when I try to run systemctl start rpm-ostreed after a crash:

Aug 29 22:28:52 inuc-srv-001 systemd[1]: Starting rpm-ostree System Management Daemon...
Aug 29 22:28:52 inuc-srv-001 rpm-ostree[50957]: Reading config file '/etc/rpm-ostreed.conf'
Aug 29 22:28:53 inuc-srv-001 rpm-ostree[50957]: error: Couldn't start daemon: Error setting up sysroot: Unable to find default local file monitor type
Aug 29 22:28:53 inuc-srv-001 systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Aug 29 22:28:53 inuc-srv-001 systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.

Reproduction steps
I have done nothing special, the machine was set up about a year ago with Fedora CoreOS 33, now it runs 34. The updates were always automatically installed, it is used as a Rancher RKE node. Two other machines with the exact same config/function don't have this problem.

Expected behavior
Rpm-ostree should work.

Actual behavior
Rpm-ostree crashes on boot or after some minutes, I am unable to install updates, rollbacks work.

System details

  • Bare Metal
  • Fedora CoreOS 34.20210711.3.0 and 34.20210725.3.0 both don't work.
@lucab
Copy link
Contributor

lucab commented Aug 30, 2021

Thanks for the report.
This is the place where the failure is surfacing https://github.com/coreos/rpm-ostree/blob/v2021.10/src/daemon/rpmostreed-sysroot.cxx#L782-L785, but the underlying error is coming from gio file monitor (i.e. inotify) library.

I suspect that something bad is going on on your nodes. Specifically, I think that the rpm-ostree daemon is now unable to setup an inotify watcher because the service (or the system as a whole) is hitting a resource limit.

The inotify_init(2) manpage lists several resources that could get exhausted.
It would be good if you could double-check your system resources and verify whether it is running close to any limits.

Another quick smoke test would be to temporarily stop/drain all the workloads on the node (especially all the ones running as root) and see whether the service starts reacting to a simple rpm-ostree status at that point.

Self-note: this failure mode is quite opaque, but I did some quick code walking in gio and it seems to be compatible with a failure when calling inotify_init() at https://github.com/GNOME/glib/blob/2.68.4/gio/inotify/inotify-kernel.c#L390.

@cgwalters cgwalters changed the title rpm-ostree not working anymore daemon fails to start when inotify initialization fails (out of watches?) Aug 30, 2021
@aendi123
Copy link
Author

Thank you so much for the great input, I think that was the reason. I stopped docker (since there is only container workload on the node) and rpm-ostreed worked again. After that I increased the inotify limits in sysctl.conf and it worked also with docker running:

fs.inotify.max_user_watches = 999999
fs.inotify.max_queued_events = 999999
fs.inotify.max_user_instances = 999999

However, a new problem appeared, updating is still not working. Every time the host is started, Zincati installs the newest update correctly and I can see it in rpm-ostree status. But when I reboot, the new deployment isn't listed.

This is the log starting at a fresh boot until Zincati tries to update again after a reboot.
log.txt

@lucab
Copy link
Contributor

lucab commented Aug 31, 2021

@aendi123 it would be helpful to see the logs from zincati.service journal too to understand what's happening.

However, as you said "when I reboot", I suspect you are forcing a manual reboot instead of letting Zincati reboot the machine to finalize the update.
If that is the case, the behavior you see is expected: FCOS is designed so that a random reboot at any time (e.g. a power loss or a manual trigger) does not switch to a different OS version.

@dustymabe
Copy link
Member

However, as you said "when I reboot", I suspect you are forcing a manual reboot instead of letting Zincati reboot the machine to finalize the update.

I think this expected "workflow" plays back in to coreos/zincati#498

@aendi123
Copy link
Author

aendi123 commented Sep 2, 2021

Oops, you are absolutely right. It works correctly when Zincati reboots the host.

@aendi123 aendi123 closed this as completed Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants