zincati sticks with staged deployments even if newer is available #928

dustymabe · 2023-01-13T19:07:25Z

Bug Report

If I have zincati set to only update say on the weekends:

# cat /etc/zincati/config.d/51-weekend-updates.toml 
# start at 11:00 UTC - 6AM EST
[updates]
strategy = "periodic"
[[updates.periodic.window]]
days = [ "Sat", "Sun" ]
start_time = "11:00"
length_minutes = 60

I'd expect that if a new build comes available before my update window happens then my system would delete the pending/staged one and move on to the next one.

For example.. This week we released two testing builds. 37.20230107.2.0 on Tuesday and 37.20230110.2.0 on Thursday. My system saw and staged 37.20230107.2.0 on Wednesday. Here is the current status:

# systemctl status zincati | cat
● zincati.service - Zincati Update Agent
     Loaded: loaded (/usr/lib/systemd/system/zincati.service; enabled; preset: enabled)
     Active: active (running) since Sat 2023-01-07 11:06:01 UTC; 6 days ago
       Docs: https://github.com/coreos/zincati
   Main PID: 1151 (zincati)
     Status: "update staged: 37.20230107.2.0; reboot pending due to update strategy"
      Tasks: 8 (limit: 4581)
     Memory: 16.4M
        CPU: 3min 48.607s
     CGroup: /system.slice/zincati.service
             └─1151 /usr/libexec/zincati agent -v

Jan 10 16:43:12 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 00:24:33 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 05:39:54 apu2 zincati[1151]: [ERROR zincati::cincinnati] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)
Jan 11 07:03:39 apu2 zincati[1151]: [INFO  zincati::update_agent::actor] target release '37.20230107.2.0' selected, proceeding to stage it
Jan 11 07:08:56 apu2 zincati[1151]: [INFO  zincati::update_agent::actor] update staged: 37.20230107.2.0

I would expect that zincati would keep checking the update graph and throw away the pending deployment and go straight to the next one if the update graph allowed for it.

Environment

Local bare metal x86_64 machine.

Expected Behavior

Pending deployment gets thrown away and newer update gets staged.

Actual Behavior

Pending (older) deployment appears to continue to be staged.

Reproduction Steps

This is hard because it requires the remote update server to be in certain states at different times. In summary:

Deploy a node with a periodic update strategy that only let's it update on certain days of the week.
Have a new release happen and the node stage an update
Have another release happen before the update window your node has set.
Notice that the system sticks with the old update and doesn't switch to the new one.

Other Information

[root@apu2 ~]# rpm-ostree status 
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 37.20230107.2.0; reboot pending due to update strategy
Deployments:
  fedora:fedora/x86_64/coreos/testing
                  Version: 37.20230107.2.0 (2023-01-09T18:09:12Z)
               BaseCommit: 181c145a3c9e200439016bbc78ac3cce501f596c20f37fe927af5096f38b00fd
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
                     Diff: 52 upgraded, 1 removed, 1 added
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd

● fedora:fedora/x86_64/coreos/testing
                  Version: 37.20221225.2.2 (2023-01-03T16:06:54Z)
               BaseCommit: e339f79de0d679296a875d8cb0c9d2fe39089f516ed14fb29705f472a85ccbd0
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd

  fedora:fedora/x86_64/coreos/testing
                  Version: 37.20221225.2.1 (2022-12-26T16:01:30Z)
               BaseCommit: 5f6f5e6ec7ad1ad7c49f29a44bce2b8432dfecb876ad174e1cd29566eacf2da1
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: bridge-utils dmidecode firewalld flashrom iwd libimobiledevice libimobiledevice-utils lshw NetworkManager-wifi
                           pciutils speedtest-cli systemd-oomd-defaults tmux usbmuxd
[root@apu2 ~]# 
[root@apu2 ~]# rpm -q zincati
zincati-0.0.25-1.fc37.x86_64

The text was updated successfully, but these errors were encountered:

dustymabe · 2023-01-13T19:10:41Z

One particular reason this is important is that we typically only do ad-hoc out of cycle releases when bugs/regressions were introduced. The current behavior means we can't prevent systems with periodic update strategies from booting into the buggy release.

Furthermore their update window might not allow for another update for another period of time, so they'd be on the buggy release for even longer.

jlebon · 2023-01-13T20:48:07Z

I think this is how the state machine was designed. As much work is done upfront so that when the strategy says "go", it's just a simple reboot. Changing this sounds reasonable. E.g. in the worst case, if an update node's metadata changes to a deadend, that should absolutely block finalization and reset the state machine to go back to looking for the next update. In the case where the preferred node changed but the old node is still valid, maybe it should be up to the strategy logic whether swapping them out is permitted. For the periodic strategy, I could see an argument for not allowing it if the next window is in e.g. 10 minutes.

cgwalters added area/updates kind/enhancement triaged This issue was triaged labels May 31, 2023

cgwalters mentioned this issue Aug 24, 2023

Unable to trigger a manual update with rpm-ostree when a zincati update strategy is active #1072

Closed

dustymabe mentioned this issue Nov 8, 2023

Zincati fails to update nodes: Too many open files coreos/fedora-coreos-tracker#1608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zincati sticks with staged deployments even if newer is available #928

zincati sticks with staged deployments even if newer is available #928

dustymabe commented Jan 13, 2023

dustymabe commented Jan 13, 2023

jlebon commented Jan 13, 2023

zincati sticks with staged deployments even if newer is available #928

zincati sticks with staged deployments even if newer is available #928

Comments

dustymabe commented Jan 13, 2023

Bug Report

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

dustymabe commented Jan 13, 2023

jlebon commented Jan 13, 2023