Skip to content

Commit

Permalink
Setup a SLURM cluster in the GitHub CI for integration tests [MT-34] (#…
Browse files Browse the repository at this point in the history
…84)

* Test out GH Action to setup a fake SLURM cluster

Signed-off-by: Fabrice Normandin <[email protected]>

* Change the scope to also run on PRs

Signed-off-by: Fabrice Normandin <[email protected]>

* Change the command to use `srun (...) hostname`

Signed-off-by: Fabrice Normandin <[email protected]>

* Test out running tests that call srun over ssh

Signed-off-by: Fabrice Normandin <[email protected]>

* Use `poetry run pytest` instead of `pytest`

Signed-off-by: Fabrice Normandin <[email protected]>

* Try to test the `ensure_allocation` method

Signed-off-by: Fabrice Normandin <[email protected]>

* Simplify to avoid hanging on test setup

Signed-off-by: Fabrice Normandin <[email protected]>

* Skip making a Connection (hopefully fixes hang)

Signed-off-by: Fabrice Normandin <[email protected]>

* Try using a custom version of setup-slurm action

Signed-off-by: Fabrice Normandin <[email protected]>

* Rename custom action file

Signed-off-by: Fabrice Normandin <[email protected]>

* Try to fix the path to the custom action file

Signed-off-by: Fabrice Normandin <[email protected]>

* Fix role number in custom action file

Signed-off-by: Fabrice Normandin <[email protected]>

* Only mark one partition with Default: YES

Signed-off-by: Fabrice Normandin <[email protected]>

* Only have `localhost` as a node

Signed-off-by: Fabrice Normandin <[email protected]>

* Re-simplify test to check that slurm works

Signed-off-by: Fabrice Normandin <[email protected]>

* Put the slurm playbook in a file

Signed-off-by: Fabrice Normandin <[email protected]>

* Add main and unkillable partitions

Signed-off-by: Fabrice Normandin <[email protected]>

* Trying to add tests using the local SLURM cluster

Signed-off-by: Fabrice Normandin <[email protected]>

* Add `in_stream=False` to `run` and `simple_run`

Signed-off-by: Fabrice Normandin <[email protected]>

* Simplify tests: greatly reduce need for -s flag

Signed-off-by: Fabrice Normandin <[email protected]>

* `SlurmRemote.ensure_allocation` test works on Mila

Signed-off-by: Fabrice Normandin <[email protected]>

* Try to make tests timeout instead of hang in CI

Signed-off-by: Fabrice Normandin <[email protected]>

* Make slurm tests the integration tests in build

Signed-off-by: Fabrice Normandin <[email protected]>

* Skip some tests for now to debug the CI issues

Signed-off-by: Fabrice Normandin <[email protected]>

* Only run integration tests with slurm on linux :(

Signed-off-by: Fabrice Normandin <[email protected]>

* Debugging hanging integration test

Signed-off-by: Fabrice Normandin <[email protected]>

* Test if hanging test is due to nested sallocs

Signed-off-by: Fabrice Normandin <[email protected]>

* Skip tests that use salloc/sbatch in GitHub CI :(

Signed-off-by: Fabrice Normandin <[email protected]>

* Minor tying/docstring improvements to Remote class

Signed-off-by: Fabrice Normandin <[email protected]>

* Add some tests for SlurmRemote.run and such

Signed-off-by: Fabrice Normandin <[email protected]>

* Don't actually extract jobid from salloc for now

Signed-off-by: Fabrice Normandin <[email protected]>

* Add sleeps so sacct can update to show recent jobs

Signed-off-by: Fabrice Normandin <[email protected]>

* Mark tests that cause a hang in GitHub CI

Signed-off-by: Fabrice Normandin <[email protected]>

* Add timeout of 3 minutes to integration tests step

Signed-off-by: Fabrice Normandin <[email protected]>

* Remove check that fails in GitHub CI

Signed-off-by: Fabrice Normandin <[email protected]>

* Update tests/cli/test_slurm_remote.py

Co-authored-by: satyaog <[email protected]>

---------

Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Co-authored-by: satyaog <[email protected]>
  • Loading branch information
lebrice and satyaog authored Jan 23, 2024
1 parent 4adedbc commit aa953ab
Show file tree
Hide file tree
Showing 8 changed files with 674 additions and 38 deletions.
54 changes: 54 additions & 0 deletions .github/custom_setup_slurm_action/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: "setup-slurm-action"
description: "Setup slurm cluster on GitHub Actions using https://github.com/galaxyproject/ansible-slurm"
branding:
icon: arrow-down-circle
color: blue
runs:
using: "composite"
steps:
# prior to slurm-setup we need the podmand-correct command
# see https://github.com/containers/podman/issues/13338
- name: Download slurm ansible roles
shell: bash -e {0}
# ansible-galaxy role install https://github.com/galaxyproject/ansible-slurm/archive/1.0.1.tar.gz
run: |
ansible-galaxy role install https://github.com/mila-iqia/ansible-slurm/archive/1.1.2.tar.gz
- name: Apt prerequisites
shell: bash -e {0}
run: |
sudo apt-get update
sudo apt-get install retry
- name: Set XDG_RUNTIME_DIR
shell: bash -e {0}
run: |
mkdir -p /tmp/1002-runtime # work around podman issue (https://github.com/containers/podman/issues/13338)
echo XDG_RUNTIME_DIR=/tmp/1002-runtime >> $GITHUB_ENV
- name: Setup slurm
shell: bash -e {0}
run: |
ansible-playbook ./.github/custom_setup_slurm_action/slurm-playbook.yml || (journalctl -xe && exit 1)
- name: Add Slurm Account
shell: bash -e {0}
run: |
sudo retry --until=success -- sacctmgr -i create account "Name=runner"
sudo sacctmgr -i create user "Name=runner" "Account=runner"
- name: Test srun submission
shell: bash -e {0}
run: |
srun -vvvv echo "hello world"
sudo cat /var/log/slurm/slurmd.log
- name: Show partition info
shell: bash -e {0}
run: |
scontrol show partition
- name: Test sbatch submission
shell: bash -e {0}
run: |
sbatch -vvvv -N 1 --mem 5 --wrap "echo 'hello world'"
74 changes: 74 additions & 0 deletions .github/custom_setup_slurm_action/slurm-playbook.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
- name: Slurm all in One
hosts: localhost
roles:
- role: 1.1.2
become: true
vars:
slurm_upgrade: true
slurm_roles: ["controller", "exec", "dbd"]
slurm_config_dir: /etc/slurm
slurm_config:
ClusterName: cluster
SlurmctldLogFile: /var/log/slurm/slurmctld.log
SlurmctldPidFile: /run/slurmctld.pid
SlurmdLogFile: /var/log/slurm/slurmd.log
SlurmdPidFile: /run/slurmd.pid
SlurmdSpoolDir: /tmp/slurmd # the default /var/lib/slurm/slurmd does not work because of noexec mounting in github actions
StateSaveLocation: /var/lib/slurm/slurmctld
AccountingStorageType: accounting_storage/slurmdbd
SelectType: select/cons_res
slurmdbd_config:
StorageType: accounting_storage/mysql
PidFile: /run/slurmdbd.pid
LogFile: /var/log/slurm/slurmdbd.log
StoragePass: root
StorageUser: root
StorageHost: 127.0.0.1 # see https://stackoverflow.com/questions/58222386/github-actions-using-mysql-service-throws-access-denied-for-user-rootlocalh
StoragePort: 8888
DbdHost: localhost
slurm_create_user: yes
#slurm_munge_key: "../../../munge.key"
slurm_nodes:
- name: localhost
State: UNKNOWN
Sockets: 1
CoresPerSocket: 2
RealMemory: 2000
# - name: cn-a[001-011]
# NodeAddr: localhost
# Gres: gpu:rtx8000:8
# CPUs: 40
# Boards: 1
# SocketsPerBoard: 2
# CoresPerSocket: 20
# ThreadsPerCore: 1
# RealMemory: 386618
# TmpDisk: 3600000
# State: UNKNOWN
# Feature: x86_64,turing,48gb
# - name: "cn-c[001-010]"
# CoresPerSocket: 18
# Gres: "gpu:rtx8000:8"
# Sockets: 2
# ThreadsPerCore: 2
slurm_partitions:
- name: long
Default: YES
MaxTime: UNLIMITED
Nodes: "localhost"
- name: main
Default: NO
MaxTime: UNLIMITED
Nodes: "localhost"
- name: unkillable
Default: NO
MaxTime: UNLIMITED
Nodes: "localhost"
slurm_user:
comment: "Slurm Workload Manager"
gid: 1002
group: slurm
home: "/var/lib/slurm"
name: slurm
shell: "/bin/bash"
uid: 1002
82 changes: 78 additions & 4 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ name: Python package
on: [push, pull_request]

jobs:
pre-commit:
name: Run pre-commit checks
linting:
name: Run linting/pre-commit checks
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
Expand All @@ -16,8 +16,8 @@ jobs:
- run: pre-commit install
- run: pre-commit run --all-files

test:
needs: [pre-commit]
unit-tests:
needs: [linting]
runs-on: ${{ matrix.platform }}
strategy:
max-parallel: 4
Expand Down Expand Up @@ -70,3 +70,77 @@ jobs:
env_vars: PLATFORM,PYTHON
name: codecov-umbrella
fail_ci_if_error: false

integration-tests:
name: integration tests
needs: [unit-tests]
runs-on: ${{ matrix.platform }}

strategy:
max-parallel: 5
matrix:
# TODO: We should ideally also run this with Windows/Mac clients and a Linux
# server. Unsure how to set that up with GitHub Actions though.
platform: [ubuntu-latest]
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11']

# For the action to work, you have to supply a mysql
# service as defined below.
services:
mysql:
image: mysql:8.0
env:
MYSQL_ROOT_PASSWORD: root
ports:
- "8888:3306"
options: --health-cmd="mysqladmin ping" --health-interval=10s --health-timeout=5s --health-retries=3

steps:
- uses: actions/checkout@v3

# NOTE: Replacing this with our customized version of
# - uses: koesterlab/setup-slurm-action@v1
- uses: ./.github/custom_setup_slurm_action

- name: Test if the slurm cluster is setup correctly
run: srun --nodes=1 --ntasks=1 --cpus-per-task=1 --mem=1G --time=00:01:00 hostname

- name: Setup passwordless SSH access to localhost for tests
# Adapted from https://stackoverflow.com/a/60367309/6388696
run: |
ssh-keygen -t ed25519 -f ~/.ssh/testkey -N ''
cat > ~/.ssh/config <<EOF
Host localhost
User $USER
HostName 127.0.0.1
IdentityFile ~/.ssh/testkey
EOF
echo -n 'from="127.0.0.1" ' | cat - ~/.ssh/testkey.pub > ~/.ssh/authorized_keys
chmod og-rw ~
ssh -o 'StrictHostKeyChecking no' localhost id
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install poetry
poetry install --with=dev
- name: Launch integration tests
run: poetry run pytest tests/cli/test_slurm_remote.py --cov=milatools --cov-report=xml --cov-append -s -vvv --log-level=DEBUG
timeout-minutes: 3
env:
SLURM_CLUSTER: localhost

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: integrationtests
env_vars: PLATFORM,PYTHON
name: codecov-umbrella
fail_ci_if_error: false
Loading

0 comments on commit aa953ab

Please sign in to comment.