-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use throttle: 1 for localhost unarchive #498
base: main
Are you sure you want to change the base?
Conversation
6582161
to
f140430
Compare
f140430
to
d0e0f24
Compare
I tested this locally with a file that is more likely to trigger the race: a tar of 10,000 small files all at the top level of the tar: this triggers the error 100% of the time w/o the fix and passes 1,000 iterations locally with no failures. I also verified that deploying prom to a 4 host inventory works, 100 iterations without failure. This inventory has 2 hosts each for 2 different architectures (amd64 and arm64). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a clean and simple solution - I like it! Thanks!
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457. Co-authored-by: Ben Kochie <[email protected]> Signed-off-by: Travis Downs <[email protected]>
8079f7f
to
be70705
Compare
The test failures look like this:
I believe this means the external resource has changed. It looks like the user changed their name and the correct link is: |
I think the 308 error is related to #507 |
Please rebase now that the 308 error should be resolved |
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like
/tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture.
This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above.
To fix this, we use
throttle: 1
on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output.Fixes #457.