Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test multiple nodes and usermode #84

Merged
merged 27 commits into from
Sep 9, 2024
Merged

Test multiple nodes and usermode #84

merged 27 commits into from
Sep 9, 2024

Conversation

jordap
Copy link
Collaborator

@jordap jordap commented Aug 27, 2024

Improvements to the Docker Compose environment for testing.

Tasks:

  • Add a second compute node to test environment.
  • Enable SSH to compute nodes to allow more complex multi-node jobs.
  • Test multiple nodes in system mode.
  • Run integration test in usermode.

@jordap jordap force-pushed the jorda/usermode-test branch from be5e6df to d652c86 Compare August 27, 2024 21:35
@jordap jordap force-pushed the jorda/usermode-test branch from 200e168 to 211f3ef Compare August 29, 2024 04:21
@jordap
Copy link
Collaborator Author

jordap commented Aug 29, 2024

Basic usermode testing is working. It doesn't cover all cases, but it can help us identify if omnistat-usermode is not working. It doesn't cover omnistat-query yet because that won't work on Github, but we can potentially have a test for omnistat-query that only runs locally in machines with GPUs.

While working on testing I noticed some issues that I want to review. For example, Prometheus takes a surprisingly long time to start scraping data from Omnistat. After Prometheus and the Omnistat monitor are up and responding to requests, it takes approximately 5 seconds for Prometheus to start scraping values.

I also think we need a way to (optionally) store the output of the exporters for easier debugging.

jordap added 7 commits August 29, 2024 13:38
Prometheus can take some time to start scraping targets. Starting
Prometheus first allows overlapping Prometheus and exporter
initialization (instead of having an additional wait for Prometheus).
@jordap jordap requested a review from koomie August 29, 2024 22:11
@jordap jordap marked this pull request as ready for review August 29, 2024 22:12
Copy link
Collaborator Author

@jordap jordap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of changes in this PR, but most of it is in the test environment: adding a second compute node and supporting usermode execution.

There is room for improvement (packaged usermode execution, testing different configuration options, etc.), but this initial version can already be useful to identify the most basic issues in usermode. We can continue improving the tests after this PR is merged.

Comment on lines 274 to +276
elif args.start:
userUtils.startExporters()
userUtils.startPromServer()
userUtils.startExporters()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koomie Any objections to changing the order here? I've noticed Prometheus takes a few seconds to start scraping data (after the Prometheus server is up and running and accepting requests). We can add a more complex wait after Prometheus to make sure it's scraping data, but I thought we can overlap the initialization of Prometheus and Omnistat to minimize waiting time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. From a purist point of view, I feel like we should start the data collectors before trying to ping them with prometheus.

Comment on lines +143 to +147
numactl = shutil.which("numactl")
if numactl:
command = ["numactl", f"--physcpubind={ps_corebinding}"] + command
else:
logging.info("Ignoring Prometheus corebinding; unable to find numactl")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only other change outside of the testing environment: making numactl optional.

@koomie koomie merged commit 9cebce7 into main Sep 9, 2024
5 checks passed
@jordap jordap deleted the jorda/usermode-test branch September 10, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants