Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting the Zincati Service fails randomly #671

Closed
redRolf opened this issue Nov 4, 2021 · 7 comments
Closed

Restarting the Zincati Service fails randomly #671

redRolf opened this issue Nov 4, 2021 · 7 comments

Comments

@redRolf
Copy link

redRolf commented Nov 4, 2021

Bug Report

The following happened:

Our CI/CD performs an update of Zincati configuration when changes are made. After the new .toml files have been uploaded the zincati service is restarted in order to load the latest configuration using the command: sudo systemctl restart zincati.service. But lately we are running into the problem, that the restart fails. The error message can be seen below. This causes the not only the CI/CD Pipeline to fail but also causes the server to enter a deadlocked state, where no applications running on the server are responsive and even trying to establish an SSH connection to fails. What is causing this issue and how can I prevent it?

● zincati.service - Zincati Update Agent
     Loaded: loaded (/usr/lib/systemd/system/zincati.service; enabled; vendor preset: enabled)
     Active: activating (start) since Tue 2021-11-02 19:47:47 UTC; 1s ago
       Docs: https://github.com/coreos/zincati
   Main PID: 940074 (zincati)
      Tasks: 7 (limit: 9430)
     Memory: 1.4M
        CPU: 58ms
     CGroup: /system.slice/zincati.service
             └─940074 /usr/libexec/zincati agent -v

Nov 02 19:47:47 re.intra.redguard.ch-fcos systemd[1]: Starting Zincati Update Agent...
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.23)
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cincinnati] Cincinnati service: https://updates.coreos.fedoraproject.org
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cli::agent] agent running on node '14c0d09360844ae5a6bed8904e81eefa', in update group 'default'
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::update_agent::actor] registering as the update driver for rpm-ostree
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]: [ERROR zincati::rpm_ostree::cli_deploy] rpm-ostree deploy --register-driver failed:
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:     error: Transaction in progress: deploy --lock-finalization revision=a44c3b4d10b94db300d420cba76249b6c6de368fa1f93613796e50d3ee8b3568 --disallow-downgrade
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:      You can cancel the current transaction with `rpm-ostree cancel`
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:     retrying in 1s

Environment

What hardware/cloud provider/hypervisor is being used?

Exoscale FCOS Template

Expected Behavior

The command sudo systemctl restart zincati.service to restart the service without failing

Actual Behavior

Reproduction Steps

  1. Run fedora coreOS
  2. Run the command: sudo systemctl restart zincati.service this my randomly fail and cause a time out.

Other Information

I wasn't quiet sure what would be helpful information so I tried to only include what I thought was the most relevant information. But if you would like any other logs or me to test things I am more than happy to oblige.

@lucab
Copy link
Contributor

lucab commented Nov 4, 2021

It looks like you are restarting Zincati in the middle of an upgrade, which then leaves rpm-ostreed.service busy dealing with the new deployment. It should be possible to recover from that state by canceling the transaction or restarting the rpm-ostree daemon.

Taking a step back, I think what you need is a Reload method, which we currently don't have.
Though it's unclear to me how you ended up in a scenario where you want to change Zincati configuration mid-fly. As you can see from this bug, you are racing with upgrades which are happening under your current/previous configuration.

@lucab
Copy link
Contributor

lucab commented Nov 4, 2021

For reference, the underlying bug is that we leak behind a transaction running in rpm-ostree daemon, even if the client had disappeared. This came up already in coreos/rpm-ostree#3194 (comment) and we should enhance the daemon so that the lifetime of the transaction is automatically bound to the caller.

@redRolf
Copy link
Author

redRolf commented Nov 4, 2021

@lucab thank you very much for the clarifications. I was suspecting that something along those lines is happening.

To your question on how I ended up in this situation:

  • We have a maintenance window every Tuesday evening, so I created a .toml file to only allow zincati to reboot the server every Tuesday evening between 23:00-00:00
  • Once a month a fresh deployment is made to the server using a ci/cd pipeline (This pipeline runs before 23:00). During the deployment, the .toml file for the zincati service configuration is updated (just incase modifications were made to it)
  • Naturally after the .toml config file is updated the ci/cd pipeline tries to restart the zincati service.

So that is how I arrived at the situation. So the problem appears to be that zincati is trying to finalize an update and is waiting until its time window to restart the server comes around. Then once a month my ci/cd pipeline comes around and tries to restart the service which it doesn't like since it's trying to finalize the update (understandable)

So if I swap the order i.e. allow zincati to finalize updates before running the ci/cd pipeline I should be able to mitigate this problem to a large extent.

A Reload function would be awesome 😍 but I do fully understand that these features take time, effort and resources.

Aside: By finalizing the update I mean, that Zincati is either actively installing updates or is just waiting to reboot the server 😇 for my case and suggest approach it is not relevant which one it is.

I hope I understood you correctly :)

@lucab
Copy link
Contributor

lucab commented Nov 5, 2021

Thanks for the additional context.

Yes, it looks like you are currently racing with Zincati trying to eagerly fetch/stage updates beforehand (so that they are ready to be applied as soon as your configuration allows it).
But you could as well race with the finalization of updates (and rebooting), depending on the specific configuration and timings.

Unfortunately I don't currently have a perfect solution to suggest. Some mitigations could be:

  • having the CD job check rpm-ostree status for pending transactions before doing any restart
  • having the CD job restart both zincati and rpm-ostreed so that all previous pending state is flushed
  • implementing some graceful reload/restart in zincati itself (though in general not all old→new configuration combinations are possible)
  • implementing client-bound transactions in rpm-ostreed

@redRolf
Copy link
Author

redRolf commented Nov 5, 2021

Good morning :) Thank you very much for your help and your suggested mitigations. I will try these :)
From my standpoint, we can close this issue, as these mitigations should solve my problem.

Thank you again and have a great day and weekend.

@lucab
Copy link
Contributor

lucab commented Nov 5, 2021

Ack, thanks! I will forward the last two bullet items to separate tickets (no ETA though, both of them may require quite a bit of work) and then close this.

@lucab
Copy link
Contributor

lucab commented Nov 5, 2021

Followup tickets at #673 and coreos/rpm-ostree#3206.

@lucab lucab closed this as completed Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants