-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test multiple nodes and usermode #84
Changes from all commits
44057ce
0bd3cf5
5ebf53b
47c0f8a
e179206
d4e8832
d652c86
39f9e54
a9cc481
dab7e3a
9841733
b519e54
f5c0bf3
9ee7039
211f3ef
86f6731
1c6aa76
df0cd03
ad84d4c
92dec28
74a7880
47f70db
05943d4
a5d7a31
b610893
219cfa7
a4dac8a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: Test | ||
on: | ||
push: | ||
branches: [ main, dev ] | ||
pull_request: | ||
branches: [ main, dev ] | ||
jobs: | ||
test: | ||
name: User-level Omnistat | ||
runs-on: ubuntu-22.04 | ||
strategy: | ||
matrix: | ||
execution: [ source ] | ||
steps: | ||
- name: Check out repository code | ||
uses: actions/checkout@v4 | ||
- name: Comment out GPU devices (not available in GitHub) | ||
run: sed -i "/devices:/,+2 s/^/#/" test/docker/slurm/compose.yaml | ||
- name: Disable SMI collector (won't work in GitHub) | ||
run: > | ||
sed -i "s/enable_rocm_smi = True/enable_rocm_smi = False/" \ | ||
test/docker/slurm/omnistat-user.config | ||
- name: Set execution type | ||
run: export TEST_OMNISTAT_EXECUTION=${{ matrix.execution }} | ||
- name: Start containerized environment | ||
run: docker compose -f test/docker/slurm/compose.yaml -f test/docker/slurm/compose-user.yaml up -d | ||
- name: Wait for user-level Omnistat | ||
run: > | ||
timeout 5m bash -c \ | ||
'for i in controller node1 node2; do \ | ||
until [[ $(docker logs -n 1 slurm-$i) == READY ]]; do \ | ||
echo "Waiting for $i..."; \ | ||
sleep 5; \ | ||
done \ | ||
done' | ||
- name: Install test dependencies | ||
run: pip3 install -r test/requirements.txt | ||
- name: Run tests | ||
working-directory: ./test | ||
run: pytest -v test_job_user.py |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,4 +7,4 @@ data_prom | |
docker/prometheus-data | ||
build | ||
omnistat.egg-info/ | ||
|
||
test/slurm-job-user.sh |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,12 +135,17 @@ def startPromServer(self): | |
yaml.dump(prom_config, yaml_file, sort_keys=False) | ||
|
||
command = [ | ||
"numactl", | ||
"--physcpubind=%s" % ps_corebinding, | ||
ps_binary, | ||
"--config.file=%s" % "prometheus.yml", | ||
"--storage.tsdb.path=%s" % ps_datadir, | ||
] | ||
|
||
numactl = shutil.which("numactl") | ||
if numactl: | ||
command = ["numactl", f"--physcpubind={ps_corebinding}"] + command | ||
else: | ||
logging.info("Ignoring Prometheus corebinding; unable to find numactl") | ||
|
||
logging.debug("Server start command: %s" % command) | ||
utils.runBGProcess(command, outputFile=ps_logfile) | ||
else: | ||
|
@@ -267,8 +272,8 @@ def main(): | |
elif args.stopexporters: | ||
userUtils.stopExporters() | ||
elif args.start: | ||
userUtils.startExporters() | ||
userUtils.startPromServer() | ||
userUtils.startExporters() | ||
Comment on lines
274
to
+276
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @koomie Any objections to changing the order here? I've noticed Prometheus takes a few seconds to start scraping data (after the Prometheus server is up and running and accepting requests). We can add a more complex wait after Prometheus to make sure it's scraping data, but I thought we can overlap the initialization of Prometheus and Omnistat to minimize waiting time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure about this. From a purist point of view, I feel like we should start the data collectors before trying to ping them with prometheus. |
||
elif args.stop: | ||
userUtils.stopPromServer() | ||
userUtils.stopExporters() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
import shutil | ||
|
||
# Variable used to skip tests that depend on a ROCm installation; assume | ||
# ROCm is installed if we can find `rocminfo' in the host. | ||
rocm_host = True if shutil.which("rocminfo") else False | ||
|
||
# List of available nodes in the test environment. | ||
nodes = ["node1", "node2"] | ||
|
||
# Prometheus URL and query configuration. | ||
prometheus_url = "http://localhost:9090/" | ||
time_range = "30m" | ||
|
||
# Omnistat monitor port; same port is used for system and user tests. | ||
port = "8000" | ||
|
||
# Path to prometheus data for user-level executions; needs to match datadir | ||
# as defined in docker/slurm/omnistat-user.config. | ||
prometheus_data_user = "/jobs/prometheus-data" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
services: | ||
node1: | ||
command: node-user | ||
|
||
node2: | ||
command: node-user | ||
|
||
controller: | ||
command: controller-user |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,63 @@ | ||
# Use compose extension fields to keep the common definition of services. | ||
# Meant to be used to instantiate multiple node containers with pre-defined | ||
# hostnames. Deploy replicas are avoided due to issues with container name | ||
# resolution and SLURM. | ||
x-node: &node | ||
build: | ||
context: ../../../ | ||
dockerfile: test/docker/slurm/Dockerfile | ||
image: slurm | ||
command: node-system | ||
volumes: | ||
- jobs_dir:/jobs | ||
- ssh_dir:/root/.ssh | ||
- ../../../:/host-source | ||
expose: | ||
- 6818 | ||
- 8000 | ||
devices: | ||
- /dev/kfd | ||
- /dev/dri | ||
security_opt: | ||
- seccomp=unconfined | ||
depends_on: | ||
- controller | ||
links: | ||
- controller | ||
environment: | ||
- TEST_OMNISTAT_EXECUTION | ||
|
||
x-controller: &controller | ||
build: | ||
context: ../../../ | ||
dockerfile: test/docker/slurm/Dockerfile | ||
image: slurm | ||
command: controller-system | ||
volumes: | ||
- jobs_dir:/jobs | ||
- ssh_dir:/root/.ssh | ||
- ../../../:/host-source | ||
expose: | ||
- 6817 | ||
ports: | ||
- 9090:9090 | ||
|
||
services: | ||
node1: | ||
<<: *node | ||
hostname: node1 | ||
container_name: slurm-node1 | ||
|
||
node2: | ||
<<: *node | ||
hostname: node2 | ||
container_name: slurm-node2 | ||
|
||
controller: | ||
build: | ||
context: ../../../ | ||
dockerfile: test/docker/slurm/Dockerfile | ||
command: controller | ||
image: slurm | ||
<<: *controller | ||
hostname: controller | ||
volumes: | ||
- jobs_dir:/jobs | ||
- ../../../:/host-source | ||
expose: | ||
- 6817 | ||
ports: | ||
- 9090:9090 | ||
|
||
node: | ||
build: | ||
context: ../../../ | ||
dockerfile: test/docker/slurm/Dockerfile | ||
command: node | ||
image: slurm | ||
hostname: node | ||
volumes: | ||
- jobs_dir:/jobs | ||
- ../../../:/host-source | ||
expose: | ||
- 6818 | ||
- 8000 | ||
devices: | ||
- /dev/kfd | ||
- /dev/dri | ||
security_opt: | ||
- seccomp=unconfined | ||
depends_on: | ||
- controller | ||
links: | ||
- controller | ||
environment: | ||
- TEST_OMNISTAT_EXECUTION | ||
container_name: slurm-controller | ||
|
||
volumes: | ||
jobs_dir: | ||
ssh_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only other change outside of the testing environment: making numactl optional.