Feature/slurm remote support #250

mhrtmnn · 2020-11-05T21:38:50Z

This pull request extends the SLURM support of tapasco such that remote compute nodes can be used for carrying out HLS and compose jobs.

The required architecture consist of three networked machines:

Host (front end):
Runs a tapasco instance that takes in the user CLI arguments and collects all required files for the selected job (e.g. kernel source files for HLS jobs or IPCores for compose jobs). These dependencies are copied over the network to a separate node referred to as Workstation. The artefacts that are generated by a job (e.g. IPCore for HLS, bitstream for compose) are copied back to the Host once the job finishes.
Workstation:
In the simplest case a network attached storage. It is required, since in the general case we cannot directly push files to the SLURM compute node. Thus, the files are deposited in a known directory on this node, and the SLURM compute node can pull the files from here by itself.
SLURM node (back end):
Login node to the compute node that has SLURM control tools such as sbatch and squeue installed. The compute node runs its own tapasco instance.

The above setup is configurable through a JSON config file. This PR contains an example file at toolflow/vivado/common/SLURM/ESA.json that describes an ESA internal compute node. Different configurations can be selected via tapasco CLI options at the Host, for example --slurm ESA.

Job scheduling logic shall be moved from {Compose,HLS}-Task into the Slurm object. For this, some additional information is required.

Preamble copies all files that are required for the current job to the SLURM node, postamble copies all generated artefacts back from node.

The absolute path to both scripts may be supplied via the key "PreambleScript" resp. "PostambleScript" in the SLURM JSON cfg file.

Previously, a job would be broken into its tasks, and a new tapasco job would be created for each task. These jobs were then executed on the SLURM cluster. Refactor this, such that the original job is executed on the SLURM cluster as-is, which simplifies the SLURM logic.

Since SLURM cluster now processes whole jobs (instead of single tasks), dependencies (preamble) and produced artefacts (postamble) of multiple platform/architecture pairs may need to be transferred.

mhrtmnn · 2020-12-29T16:52:15Z

Executing Tapasco in SLURM mode, e.g. tapasco --slurm ESA hls arraysum -p pynq, assumes a working Tapasco installation on the SLURM node.

Tapasco can be installed via a SLURM job script like the following:

#!/bin/bash -xe
#SBATCH -e       /net/balin/Slurm/SLURM_stderr.txt
#SBATCH -o       /net/balin/Slurm/SLURM_stdout.txt
#SBATCH --output=/net/balin/Slurm/SLURM_output.txt

# Clean install?
# rm -rf /scratch/SLURM/

echo "Check if installation exists"
if [ ! -d "/scratch/SLURM/tapasco" ]
then
	mkdir -p /scratch/SLURM
	cd /scratch/SLURM
	git clone https://github.com/esa-tu-darmstadt/tapasco.git --depth=1
fi

echo "Check if workdir exists"
if [ ! -d "/scratch/SLURM/tapasco_workdir" ]
then
	mkdir -p /scratch/SLURM/tapasco_workdir
	cd /scratch/SLURM/tapasco_workdir
	bash /scratch/SLURM/tapasco/tapasco-init.sh
	source tapasco-setup.sh

	echo "Building toolflow"
	cd ${TAPASCO_HOME_TOOLFLOW}/scala
	./gradlew --project-cache-dir=/tmp -g /tmp installDist
fi
echo "Checking installation"
cd /scratch/SLURM/tapasco_workdir
source tapasco-setup.sh
tapasco --help

Note: Building toolflow via tapasco-build-toolflow does not work, as gradlew will complain about missing write permissions to the home directory.

mhrtmnn added 18 commits November 5, 2020 21:47

Move slurm job template to separate SLURM directory

60b78d0

Add parser for new JSON file type, that describes a remote SLURM node

68f49fb

Change CLI such that different SLURM node templates can be selected

f5ea82d

Add additional fields to Slurm.Job case class

0ba49cb

Job scheduling logic shall be moved from {Compose,HLS}-Task into the Slurm object. For this, some additional information is required.

Add option for remote execution to Slurm object

4b01364

Add a wrapper that handles local/remote execution of shell commands

b35dc81

Add Pre/Postamble that runs before/after SLURM Job

9683925

Preamble copies all files that are required for the current job to the SLURM node, postamble copies all generated artefacts back from node.

Make writeJobScript() work with remote paths

3b390c8

Adapt Slurm.apply() to remote execution

b029b77

Add postamble that pulls generated files from node

5e43cfa

Use explicit identity func instead of custom lambda

c54e988

Postamble: add timing and utilization report

92415ba

Fix local SLURM execution

3e2183f

Add optional sbatch CLI options to slurm json

619893e

Add support for user-defined pre/postamble

16b4464

The absolute path to both scripts may be supplied via the key "PreambleScript" resp. "PostambleScript" in the SLURM JSON cfg file.

Remove unnecessary func parameter

773c47b

Fix: Copy SLURM job script to correct host

0d4c832

Finalize SLURM job template for ESA cluster

c5911ef

mhrtmnn requested a review from lukasmweber November 5, 2020 21:38

mhrtmnn added 10 commits November 16, 2020 12:58

Use Paths.get() instead of Path.of()

6b01e25

Use unique slurm script names

74436f4

Fix hls postamble

48c5fde

Only pull SLURM artefacts if job was successful

6c92d0c

HLS task: check slurm return code

27d8af3

Handle return value correctly in slurm job

9d496b3

Fix vivado hang

3207acd

Reduce verbosity, cleanup

1d81cd6

Cancel SLURM job if tapasco is terminated

ff4a8e9

Show duration for which SLURM job has been running

b7c0a20

mhrtmnn linked an issue Nov 28, 2020 that may be closed by this pull request

Investigate and Improve the TaPaSCo SLURM Support #236

Open

mhrtmnn force-pushed the feature/SlurmRemoteSupport branch from f1e7781 to b7c0a20 Compare December 14, 2020 12:37

mhrtmnn added 2 commits December 29, 2020 16:05

Adapt Slurm pre/post-amble to changes in 259e6ee

f84e5cc

Since SLURM cluster now processes whole jobs (instead of single tasks), dependencies (preamble) and produced artefacts (postamble) of multiple platform/architecture pairs may need to be transferred.

mhrtmnn marked this pull request as draft February 18, 2021 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/slurm remote support #250

Feature/slurm remote support #250

mhrtmnn commented Nov 5, 2020

mhrtmnn commented Dec 29, 2020

Feature/slurm remote support #250

Are you sure you want to change the base?

Feature/slurm remote support #250

Conversation

mhrtmnn commented Nov 5, 2020

mhrtmnn commented Dec 29, 2020