Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Delay per-job of 0.87 seconds due from capturing EE metadata #11699

Closed
AlanCoding opened this issue Feb 8, 2022 · 8 comments
Closed

Comments

@AlanCoding
Copy link
Member

ISSUE TYPE
  • Feature Idea
SUMMARY

To reproduce:

podman run --tty --interactive --rm quay.io/ansible/awx-ee:latest /bin/bash
bash-4.4$ time entrypoint
Usage: /usr/local/bin/dumb-init [option] program [args]
Try /usr/local/bin/dumb-init --help for full usage.

real	0m0.058s
user	0m0.016s
sys	0m0.040s
bash-4.4$ time AWX_ISOLATED_DATA_DIR=/tmp/alan entrypoint
Usage: /usr/local/bin/dumb-init [option] program [args]
Try /usr/local/bin/dumb-init --help for full usage.

real	0m0.870s
user	0m0.697s
sys	0m0.135s

If you use the AWX_ISOLATED_DATA_DIR, then you trigger special bash scripts in the entrypoint:

https://github.com/ansible/ansible-runner/blob/devel/utils/entrypoint.sh

The time delay comes from two commands

ansible-galaxy collection list --format json 2> /dev/null
ansible --version 2> /dev/null | head -n 1

The point of these 2 commands is to save the collection list and the Ansible version. These are saved to the job record, but not presented in the API.

I argue that these would be better associated with execution environments, and as such, this metadata does not need to be collected each job run.

@AlanCoding
Copy link
Member Author

I hear that @nitzmahone may be effectively accomplishing this with the builder 3.0 refactor!

@nitzmahone
Copy link
Member

I'm really torn on how best to handle this- the entrypoint is almost certainly not the right place, since in almost all cases project-embedded collections wouldn't be included (which IIUC was part of the original design expectations). Further- even if you do tell it all the right places to look, ansible-galaxy collection list won't tell you which collections will actually be loaded at runtime when they're available in multiple locations (eg overridden by the project, config, user collection path, etc)- that's not what the underlying functionality was designed for.

We could preserve the existing behavior and get rid of the container startup penalty (at least temporarily) by having builder squirrel this away during the build, then have builder's default entrypoint script conditionally copy the memoized files into the job output instead of running core and Galaxy CLI at every container startup- that wouldn't fix any of the correctness issues, but would maintain the existing behavior.

I haven't yet heard what's actually being done with any of this stuff. IIUC at least for AWX,it's stored in the DB per job execution (not per EE definition/JT/whatever), so it seems like capturing the collections that actually loaded during the execution would be much more useful than "here's the collections we found in the EE, they might or might not actually be what got used by the job". That info could probably be captured by awx_display by subscribing to the collection loader's "on load" event...

@AlanCoding
Copy link
Member Author

by having builder squirrel this away during the build, then have builder's default entrypoint script conditionally copy the memoized files into the job output instead of running core and Galaxy CLI at every container startup- that wouldn't fix any of the correctness issues, but would maintain the existing behavior.

that did occur to me before. I feel fairly ambivalent about it, it could get the job done.

I haven't yet heard what's actually being done with any of this stuff.

It's not surfaced to AWX users in any way I know of. But we still have an obligation to collect it.

My thought (which I was hoping would be easier and faster) would be to produce this somewhere in the awx_display callback. Because in that case, we could import some of the same stuff used by the Galaxy CLI command, but doing so wouldn't incur the penalty of starting up a new process. Again, I feel indifferent about the exact means by which this happens, but I just want to avoid the current slow/spammy entrypoint commands. This would be a very substantial reduction in the time before the first event.

@nitzmahone
Copy link
Member

nitzmahone commented Jun 1, 2023

Just for giggles, I tried wiring up a collection load event handler on awx_display, but right now the core playbook parse occurs before the callbacks are loaded, so most of the collections have already been loaded by the time the callback is able to capture those events. That's probably not that hard to change (and there are lots of other reasons to change it), but at least at the moment with shipping core, that option won't fly. 😞

(plugin init timing aside though, the concept worked great- two lines of code in awx_display to get notified with the name and path of every collection that gets loaded)

@AlanCoding
Copy link
Member Author

I totally get your point about displaying the loaded collections, but I officially don't care either. What we currently get isn't much more than a few os.listdir calls for the expected collection paths and looking for certain identifying files. I 💯 want to see this performance improvement, and believe almost any mostly functioning replacement would work.

I will give a heads up to some people for performance testing, and also try to give a notice to anyone who might be receiving and processing this data to be braced for format/content changes.

@nitzmahone
Copy link
Member

Sounds like we're still struggling to come up with a valid answer to "who's actually using this info?". With that, my current inclination is to assume YAGNI and ignore it in builder unless/until there's actually a documented end-to-end use case for it that also acknowledges the shortcomings of collecting it during container startup.

@shanemcd
Copy link
Member

shanemcd commented Jun 5, 2023

@nitzmahone We discussed this in chat last week, but also capturing here for posterity:

main_unifiedjob.installed_collections,

I'm not saying we should keep the existing (broken) implementation, but just dropping it altogether doesn't seem like the right answer either.

@AlanCoding
Copy link
Member Author

Closing, as this was broken, leading to ansible/ansible-runner#1273 but no longer running this logic in the entrypoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants