-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test multiple nodes and usermode #84
Conversation
be5e6df
to
d652c86
Compare
200e168
to
211f3ef
Compare
Basic usermode testing is working. It doesn't cover all cases, but it can help us identify if While working on testing I noticed some issues that I want to review. For example, Prometheus takes a surprisingly long time to start scraping data from Omnistat. After Prometheus and the Omnistat monitor are up and responding to requests, it takes approximately 5 seconds for Prometheus to start scraping values. I also think we need a way to (optionally) store the output of the exporters for easier debugging. |
Prometheus can take some time to start scraping targets. Starting Prometheus first allows overlapping Prometheus and exporter initialization (instead of having an additional wait for Prometheus).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of changes in this PR, but most of it is in the test environment: adding a second compute node and supporting usermode execution.
There is room for improvement (packaged usermode execution, testing different configuration options, etc.), but this initial version can already be useful to identify the most basic issues in usermode. We can continue improving the tests after this PR is merged.
elif args.start: | ||
userUtils.startExporters() | ||
userUtils.startPromServer() | ||
userUtils.startExporters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koomie Any objections to changing the order here? I've noticed Prometheus takes a few seconds to start scraping data (after the Prometheus server is up and running and accepting requests). We can add a more complex wait after Prometheus to make sure it's scraping data, but I thought we can overlap the initialization of Prometheus and Omnistat to minimize waiting time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. From a purist point of view, I feel like we should start the data collectors before trying to ping them with prometheus.
numactl = shutil.which("numactl") | ||
if numactl: | ||
command = ["numactl", f"--physcpubind={ps_corebinding}"] + command | ||
else: | ||
logging.info("Ignoring Prometheus corebinding; unable to find numactl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only other change outside of the testing environment: making numactl optional.
Improvements to the Docker Compose environment for testing.
Tasks: