Infrastructure for a Toil cluster (http://toil.ucsc-cgl.org) in AWS.
Provisioning a new cluster in AWS is done using pull requests. Requests using PRs provide history, gating, reviewing and an approval process.
A Toil Cluster requires use of the following templates, some of which are interdependent:
- toil-cluster-role-and-policy.yaml: creates policies, role, instance profile for use by EC2 instances
- toil-infra-essentials.yaml: creates a AWS KMS Key
- toil-instance.yaml: Creates the jumpbox EC2 instance with Toil installed. This instance is configured to run a toil launch-cluster command as a sysv service. The cluster is launched on jumpbox startup, and torn down when the jumpbox is stopped or terminated.
Contributions are welcome.
Requirements:
- Install pre-commit app
- Clone this repo
- Run
pre-commit install
to install the git hook.
As a pre-deployment step we syntatically validate our sceptre and cloudformation yaml files with pre-commit.
Please install pre-commit. Once installed, the file validations will
automatically run on every commit. Alternatively you can manually
execute the validations by running pre-commit run --all-files
.
In the AWS Console, find the jumpbox instance, and ssh it to using the PEM
that was specified in the stack creation. For example,
ssh -i "toil.pem" [email protected]
.
Once you're on the Jumpbox, use this command to reach the cluster's leader,
where zone is the zone where you created the jumpbox and leader and you
specify the cluster name, for example, "rna-seq-reprocessing-toil-cluster-v001":
sudo toil ssh-cluster --zone us-east-1c rna-seq-reprocessing-toil-cluster-v001
The leader requires a little setup. Our typical workflow is to update, install a few utilities, checkout the source code, and place the synapse configuration file.
# Update and Install
apt-get update
apt-get install -y git vim
# Check out the project and switch to the relevant branch and location,
# for example:
mkdir ~/jobs && cd $_
git clone https://github.com/Sage-Bionetworks/amp-workflows.git
git checkout wpoehlmdev
cd amp-workflows/amp-rnaseq_reprocessing/amp-rnaseq_reprocess-workflow/
# Do any other necessary setup -- for example, add a .synapseConfig file
mkdir /etc/synapse
vi /etc/synapse/.synapseConfig
# Finally, run the job! In the example from the repository mentioned above,
# we have a job-runner script:
nohup ./run-toil.py jobs/test-main-paired &
See the Toil documentation on running workflows generally and CWL in particular; also, see the Toil documentation on running a workflow in AWS.
The repository in the example above uses a custom script to start a job. See the readme for more information on how to use that script.
There appears to be a Toil bug that arises if a job is interrupted early. If one restarts the job with the same jobstore, there is a warning about bucket versioning.
If we keep getting InsufficientInstanceCapacity
errors, here is the workflow.
If you're not in a hurry, just use fewer nodes. But in the common case where
there is some sort of deadline, there are workarounds.
The easiest thing can be to try another instance type -- if you can adapt the resource requirements specified in your workflow to use a different instance type. AWS will tell you that the smartest thing to do is to use spot fleets for diversity -- unfortunately, toil does not support this. Toil does support listing multiple instance types, but the logic for matching a job to an instance type is such that toil maps jobs to single instance types. So it won't try to fit the job to another instance type, so it will never request the other type.
If you have the permissions, make a capacity reservation for what you want. There is a capacity reservation utility in this repository. If AWS rejects the reservation because capacity is insufficient, gradually dial down the number of nodes you are requesting until you get a fulfilled request. If what you can get is acceptable, start the job with this. You can then periodically submit additional capacity reservations to reserve more capacity. If you can manage to get another, then you can stop the job and restart with the capacity in the original fulfilled request plus the additional capacity acquired.