fix: use throttle: 1 for localhost unarchive #498

travisdowns · 2024-12-20T16:38:34Z

During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture.

This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above.

To fix this, we use throttle: 1 on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output.

Fixes #457.

travisdowns · 2024-12-20T16:49:03Z

I tested this locally with a file that is more likely to trigger the race: a tar of 10,000 small files all at the top level of the tar: this triggers the error 100% of the time w/o the fix and passes 1,000 iterations locally with no failures.

I also verified that deploying prom to a 4 host inventory works, 100 iterations without failure. This inventory has 2 hosts each for 2 different architectures (amd64 and arm64).

gardar

That's a clean and simple solution - I like it! Thanks!

roles/_common/tasks/install.yml

During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457. Co-authored-by: Ben Kochie <[email protected]> Signed-off-by: Travis Downs <[email protected]>

travisdowns · 2025-01-08T21:30:47Z

The test failures look like this:

TASK [prometheus.prometheus._common : Get checksum list for fail2ban_exporter_0.10.2_linux_amd64.tar.gz] ***
fatal: [almalinux-9]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'url'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Received HTTP error for https://gitlab.com/hectorjsmith/fail2ban-prometheus-exporter/-/releases/v0.10.2/downloads/fail2ban_exporter_0.10.2_checksums.txt : HTTP Error 308: Permanent Redirect"}

I believe this means the external resource has changed. It looks like the user changed their name and the correct link is:

https://gitlab.com/hctrdev/fail2ban-prometheus-exporter/-/releases/v0.10.2/downloads/fail2ban_exporter_0.10.2_checksums.txt

SuperQ · 2025-01-10T17:20:11Z

I think the 308 error is related to #507

gardar · 2025-01-11T15:53:48Z

Please rebase now that the 308 error should be resolved

travisdowns force-pushed the td-fix-457 branch from 6582161 to f140430 Compare December 20, 2024 16:41

travisdowns changed the title ~~common/install: throttle 1 on unpack task~~ fix: use throttle: 1 for localhost unarchive Dec 20, 2024

travisdowns force-pushed the td-fix-457 branch from f140430 to d0e0f24 Compare December 20, 2024 16:45

github-actions bot removed the bugfix label Dec 20, 2024

github-actions bot added the bugfix label Dec 20, 2024

travisdowns mentioned this pull request Jan 2, 2025

Concurrent local extraction of binaries fails sometimes #457

Open

gardar approved these changes Jan 6, 2025

View reviewed changes

SuperQ reviewed Jan 6, 2025

View reviewed changes

roles/_common/tasks/install.yml Outdated Show resolved Hide resolved

github-actions bot added bugfix and removed bugfix labels Jan 6, 2025

travisdowns force-pushed the td-fix-457 branch from 8079f7f to be70705 Compare January 6, 2025 19:11

github-actions bot added bugfix and removed bugfix labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use throttle: 1 for localhost unarchive #498

fix: use throttle: 1 for localhost unarchive #498

travisdowns commented Dec 20, 2024

travisdowns commented Dec 20, 2024 •

edited

Loading

gardar left a comment

travisdowns commented Jan 8, 2025

SuperQ commented Jan 10, 2025

gardar commented Jan 11, 2025

fix: use throttle: 1 for localhost unarchive #498

Are you sure you want to change the base?

fix: use throttle: 1 for localhost unarchive #498

Conversation

travisdowns commented Dec 20, 2024

travisdowns commented Dec 20, 2024 • edited Loading

gardar left a comment

Choose a reason for hiding this comment

travisdowns commented Jan 8, 2025

SuperQ commented Jan 10, 2025

gardar commented Jan 11, 2025

travisdowns commented Dec 20, 2024 •

edited

Loading