Containers can be used to bundle software dependencies. This means that analyses can be made more reproducible, installation of tools much easier and therefore that the same workflow or analysis can be run across multiple different compute infrastructures in a very portable manner. For example, this workflow can be run on HPC (using JAX's Sumner HPC) or on the cloud over (Lifebit's CloudOS platform with AWS & GCloud). This is thanks to containers and also due to the workflow manager Nextflow which has in-built support containers such as Docker and Singularity.
- An introduction to containers (as well as Nextflow & CloudOS) can be found here
- Instructions on installing Docker and Singularity can be found here
⚠️ You will need root permissions to install Docker⚠️ - You can also read this this guide (4mins read) for a more high level overview of Docker containers.
If your still confused after reading it don't hesitate to ping @PhilPalmer or @adeslatt.
One important note is that containers would ideally be:
- Docker containers
- This is because Nextflow can convert Docker -> Singulairty containers but not the other way around
- Hosted in a google container registry (gcr)
- This is because to run on Google Cloud (eg on CloudOS) containers will take too long to be fetch and cause the pipeline to fail if they are not hosted here
Doing both of these things makes containers as portable as possible. However, both also have issues because Docker requires root or admin acess (which is not available on Sumner) & you must have access to a gcr to be able to push containers there.
If you need to modify one of the containers, eg to update or add more software dependencies you can do so like so:
docker build -t <registry_user>/<image_name>:<tag> .
Eg:
cd containers/splicing-pipelines-nf/
docker build -t gcr.io/nextflow-250616/splicing-pipelines-nf:gawk .
docker push <registry_user>/<image_name>:<tag>
gcloud auth login
docker push gcr.io/<project_id>/<image_name>:<tag>
On Sumner, you may need to remove old singularity images from the cache dir in order to implement updated/new image:
- Having a cache dir is a Nextflow thing. The idea of this is to save the images to prevent needing to pull the images on each execution which would be really slow
- You should only need to clear the cache as when the containers are updated.
- We ended up setting the cacheDir to cacheDir = "/projects/anczukow-lab/.singularity_cache/"