Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test multiple nodes and usermode #84

Merged
merged 27 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
44057ce
Enable SSH in compute nodes
jordap Aug 26, 2024
0bd3cf5
Install numactl
jordap Aug 26, 2024
5ebf53b
Add compute node to test environment
jordap Aug 27, 2024
47c0f8a
Wait until node1 is available
jordap Aug 27, 2024
e179206
Add usermode configuration file for testing
jordap Aug 27, 2024
d4e8832
Test execution in multiple nodes
jordap Aug 27, 2024
d652c86
Fix query to wait for Omnistat
jordap Aug 27, 2024
39f9e54
Tweak entrypoint to allow system and usermode
jordap Aug 27, 2024
a9cc481
Tweak usermode configuration
jordap Aug 28, 2024
dab7e3a
Define port
jordap Aug 28, 2024
9841733
Split test configuration to support multiple test files
jordap Aug 28, 2024
b519e54
Test usermode job execution
jordap Aug 28, 2024
f5c0bf3
Fix format
jordap Aug 28, 2024
9ee7039
Run only a subset of tests
jordap Aug 28, 2024
211f3ef
Test usermode in Github
jordap Aug 29, 2024
86f6731
Check Prometheus data after executing job
jordap Aug 29, 2024
1c6aa76
Install test dependencies
jordap Aug 29, 2024
df0cd03
Make numactl optional for Prometheus server
jordap Aug 29, 2024
ad84d4c
Start Prometheus before exporters
jordap Aug 29, 2024
92dec28
Shorter test job time
jordap Aug 29, 2024
74a7880
Consolidate test dependencies
jordap Aug 29, 2024
47f70db
Split job execution in system mode
jordap Aug 29, 2024
05943d4
Describe user-level test deployment and environment details
jordap Aug 29, 2024
a5d7a31
Tweak exposed data paragraph
jordap Aug 29, 2024
b610893
Set PrologFlags to match expected SLURM configuration
jordap Aug 29, 2024
219cfa7
Provide overview of how user-level tests work
jordap Aug 30, 2024
a4dac8a
Add more expected patterns in the log file
jordap Aug 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/test-user.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Test
on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main, dev ]
jobs:
test:
name: User-level Omnistat
runs-on: ubuntu-22.04
strategy:
matrix:
execution: [ source ]
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Comment out GPU devices (not available in GitHub)
run: sed -i "/devices:/,+2 s/^/#/" test/docker/slurm/compose.yaml
- name: Disable SMI collector (won't work in GitHub)
run: >
sed -i "s/enable_rocm_smi = True/enable_rocm_smi = False/" \
test/docker/slurm/omnistat-user.config
- name: Set execution type
run: export TEST_OMNISTAT_EXECUTION=${{ matrix.execution }}
- name: Start containerized environment
run: docker compose -f test/docker/slurm/compose.yaml -f test/docker/slurm/compose-user.yaml up -d
- name: Wait for user-level Omnistat
run: >
timeout 5m bash -c \
'for i in controller node1 node2; do \
until [[ $(docker logs -n 1 slurm-$i) == READY ]]; do \
echo "Waiting for $i..."; \
sleep 5; \
done \
done'
- name: Install test dependencies
run: pip3 install -r test/requirements.txt
- name: Run tests
working-directory: ./test
run: pytest -v test_job_user.py
15 changes: 7 additions & 8 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,27 +1,25 @@
name: ubuntu/slurm
name: Test
on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main, dev ]
jobs:
test:
name: Test Omnistat client
name: System-level Omnistat
runs-on: ubuntu-22.04
strategy:
matrix:
execution: [ source, package ]
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Install pytest
run: sudo apt-get install -y python3-pytest
- name: Comment out GPU devices (not available in GitHub)
run: sed -i "/devices:/,+2 s/^/#/" test/docker/slurm/compose.yaml
- name: Disable SMI collector (won't work in GitHub)
run: >
sed -i "s/enable_rocm_smi = True/enable_rocm_smi = False/" \
test/docker/slurm/omnistat.slurm
test/docker/slurm/omnistat-system.config
- name: Set execution type
run: export TEST_OMNISTAT_EXECUTION=${{ matrix.execution }}
- name: Start containerized environment
Expand All @@ -36,11 +34,12 @@ jobs:
- name: Wait for Omnistat
run: >
timeout 15m bash -c \
'until [[ $(curl -s -g "localhost:9090/api/v1/series?match[]={instance=\"node:8000\"}" | jq ".data|length") != 0 ]]; do \
'until [[ $(curl -s -g "localhost:9090/api/v1/query?query=up{instance=\"node1:8000\"}>0" | jq ".data.result|length") != 0 ]]; do \
echo "Waiting for Omnistat..."; \
sleep 15; \
done'
- name: Install test dependencies
run: pip3 install prometheus_api_client
run: pip3 install -r test/requirements.txt
- name: Run tests
run: pytest-3 -v test
working-directory: ./test
run: pytest -v test_integration.py test_job_system.py
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ data_prom
docker/prometheus-data
build
omnistat.egg-info/

test/slurm-job-user.sh
11 changes: 8 additions & 3 deletions omnistat/omni_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,12 +135,17 @@ def startPromServer(self):
yaml.dump(prom_config, yaml_file, sort_keys=False)

command = [
"numactl",
"--physcpubind=%s" % ps_corebinding,
ps_binary,
"--config.file=%s" % "prometheus.yml",
"--storage.tsdb.path=%s" % ps_datadir,
]

numactl = shutil.which("numactl")
if numactl:
command = ["numactl", f"--physcpubind={ps_corebinding}"] + command
else:
logging.info("Ignoring Prometheus corebinding; unable to find numactl")
Comment on lines +143 to +147
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only other change outside of the testing environment: making numactl optional.


logging.debug("Server start command: %s" % command)
utils.runBGProcess(command, outputFile=ps_logfile)
else:
Expand Down Expand Up @@ -267,8 +272,8 @@ def main():
elif args.stopexporters:
userUtils.stopExporters()
elif args.start:
userUtils.startExporters()
userUtils.startPromServer()
userUtils.startExporters()
Comment on lines 274 to +276
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koomie Any objections to changing the order here? I've noticed Prometheus takes a few seconds to start scraping data (after the Prometheus server is up and running and accepting requests). We can add a more complex wait after Prometheus to make sure it's scraping data, but I thought we can overlap the initialization of Prometheus and Omnistat to minimize waiting time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. From a purist point of view, I feel like we should start the data collectors before trying to ping them with prometheus.

elif args.stop:
userUtils.stopPromServer()
userUtils.stopExporters()
Expand Down
72 changes: 65 additions & 7 deletions test/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ and installs the working copy of Omnistat in the container at run time. It is
meant to help make development easier, and enables testing without relying on
access to real clusters.

### Deploy
### Deploy System-level Omnistat

From the root directory of the project:

Expand All @@ -23,15 +23,73 @@ From the root directory of the project:
TEST_OMNISTAT_EXECUTION=package docker compose -f test/docker/slurm/compose.yaml up -d
```

2. Submit a SLURM job.
2. Run tests with `pytest`:
```
docker exec slurm-controller-1 bash -c "cd /jobs; sbatch --wrap='sleep 10'"
cd test
pytest test/test_integration.py test/test_job_system.py
```

3. Run tests with `pytest`, or check Prometheus data, which is exposed to the
host and can be accessed at [http://localhost:9090](http://localhost:9090).
3. Stop containers.
```
docker compose -f test/docker/slurm/compose.yaml down -v
```

### Deploy User-level Omnistat

User-level deployments are very similar and only require passing an additional
file to `docker compose`:

1. Start containers.
```
docker compose -f test/docker/slurm/compose.yaml -f test/docker/slurm/compose-user.yaml up -d
```

2. Run tests with `pytest`:
```
cd test
pytest test_job_user.py
```

4. Stop containers.
3. Stop containers.
```
docker compose -f test/docker/slurm/compose.yaml down
docker compose -f test/docker/slurm/compose.yaml -f test/docker/slurm/compose-user.yaml down -v
```

### Additional Information for Testing and Debugging

The test environment includes a controller node (`controller`) and two compute
nodes (`node1` and `node2`). These nodes are launched as different containers
with Docker Compose. All containers use the same base image and are configured
at run time by launching the container with different commands. Currently
supported commands include:
- `controller-system`
- `node-system`
- `controller-user`
- `node-user`

The main difference between `-system` and `-user` variants is that the latter
won't start the Prometheus server and the Omnistat monitor.

Inside of the container, these are the most relevant paths for testing:
- `/host-source`: Omnistat source in the host exposed to the containers.
- `/source`: Copy of the Omnistat source used for installation and/or
execution. When `TEST_OMNISTAT_EXECUTION` is set to `package`, this directory
will be removed after the installation completes.
- `/jobs`: Shared directory across all nodes. Executing jobs from this
directory is recommended to make sure job logs and data is easily accessible
from the controller.
- `/opt/omnistat`: Python virtual environment containing Omnistat dependencies.
When `TEST_OMNISTAT_EXECUTION` is set to `package`, this virtual environment
will also include Omnistat.

Jobs can be submitted from the controller:
```
docker exec slurm-controller bash -c "cd /jobs; sbatch --wrap='sleep 10'"
```

Compute nodes are reachable using SSH from any of the containers in the
network; the controller is not reachable using SSH.

In system-level deployments, Prometheus data is exposed to the host and can be
accessed at [http://localhost:9090](http://localhost:9090). Omnistat monitor is
only exposed to the internal network.
19 changes: 19 additions & 0 deletions test/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import shutil

# Variable used to skip tests that depend on a ROCm installation; assume
# ROCm is installed if we can find `rocminfo' in the host.
rocm_host = True if shutil.which("rocminfo") else False

# List of available nodes in the test environment.
nodes = ["node1", "node2"]

# Prometheus URL and query configuration.
prometheus_url = "http://localhost:9090/"
time_range = "30m"

# Omnistat monitor port; same port is used for system and user tests.
port = "8000"

# Path to prometheus data for user-level executions; needs to match datadir
# as defined in docker/slurm/omnistat-user.config.
prometheus_data_user = "/jobs/prometheus-data"
3 changes: 2 additions & 1 deletion test/docker/slurm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update -y && \
apt-get install -y --no-install-recommends \
bind9-dnsutils \
build-essential \
git \
bind9-dnsutils \
munge \
numactl \
openssh-server \
prometheus \
python3-venv \
Expand Down
9 changes: 9 additions & 0 deletions test/docker/slurm/compose-user.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
services:
node1:
command: node-user

node2:
command: node-user

controller:
command: controller-user
94 changes: 57 additions & 37 deletions test/docker/slurm/compose.yaml
Original file line number Diff line number Diff line change
@@ -1,43 +1,63 @@
# Use compose extension fields to keep the common definition of services.
# Meant to be used to instantiate multiple node containers with pre-defined
# hostnames. Deploy replicas are avoided due to issues with container name
# resolution and SLURM.
x-node: &node
build:
context: ../../../
dockerfile: test/docker/slurm/Dockerfile
image: slurm
command: node-system
volumes:
- jobs_dir:/jobs
- ssh_dir:/root/.ssh
- ../../../:/host-source
expose:
- 6818
- 8000
devices:
- /dev/kfd
- /dev/dri
security_opt:
- seccomp=unconfined
depends_on:
- controller
links:
- controller
environment:
- TEST_OMNISTAT_EXECUTION

x-controller: &controller
build:
context: ../../../
dockerfile: test/docker/slurm/Dockerfile
image: slurm
command: controller-system
volumes:
- jobs_dir:/jobs
- ssh_dir:/root/.ssh
- ../../../:/host-source
expose:
- 6817
ports:
- 9090:9090

services:
node1:
<<: *node
hostname: node1
container_name: slurm-node1

node2:
<<: *node
hostname: node2
container_name: slurm-node2

controller:
build:
context: ../../../
dockerfile: test/docker/slurm/Dockerfile
command: controller
image: slurm
<<: *controller
hostname: controller
volumes:
- jobs_dir:/jobs
- ../../../:/host-source
expose:
- 6817
ports:
- 9090:9090

node:
build:
context: ../../../
dockerfile: test/docker/slurm/Dockerfile
command: node
image: slurm
hostname: node
volumes:
- jobs_dir:/jobs
- ../../../:/host-source
expose:
- 6818
- 8000
devices:
- /dev/kfd
- /dev/dri
security_opt:
- seccomp=unconfined
depends_on:
- controller
links:
- controller
environment:
- TEST_OMNISTAT_EXECUTION
container_name: slurm-controller

volumes:
jobs_dir:
ssh_dir:
Loading