Skip to content

Commit

Permalink
update README with occam instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
fmarotta committed Sep 13, 2020
1 parent 915b151 commit 48dcf7e
Showing 1 changed file with 200 additions and 9 deletions.
209 changes: 200 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,9 +441,9 @@ environment with the indispensable tools (e.g. Snakemake, R, Rstudio),
and then other sub-environments, one for each part of the analysis.
Users are encouraged to use their best judgement when it comes to
determining how many parts a project has. But for example, one part
could be labelled 'alignment', and it would contain all the software
could be labelled 'alignment,' and it would contain all the software
required for the alignment; another part could be called 'quality
control'; yet another could be 'statistical analysis', or 'plots'.
control;' yet another could be 'statistical analysis,' or 'plots.'
Snakemake makes it easy to use the appropriate conda environment for
each rule: check it out
[here](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html).
Expand All @@ -455,11 +455,203 @@ the project

#### Extra: R Studio integration

TODO

### Using Docker

TODO.
* Install the rstudio package from conda for the project where you need
it.

* From a directory within the project, run the command `rstudio`. If you
are working on a remote server, make sure that you connected with `ssh
-X ...` (the -X option allows the server to forward graphical
applications like Rstudio to the client).

* Once Rstudio is up, create a project from an existing directory and
choose the root directory of your project. This will make sure that
Rstudio knows where to find the libraries and the Rprofile.

### Using Docker for occam

occam, the supercomputer of the University of Turin, is the main
external cluster used by the bioinfoconda community. (If you use another
service, such as AWS, and you want to contribute the instructions for
your favorite platform, by all means make a PR or email the maintainer.)

Using occam is surprisingly difficult and many things are not explained
in the official documentation. Here we provide a step-by-step guide for
using occam the Bioinfoconda way. Before starting, however, please make
sure you have a basic understanding of Docker and occam by reading the
respective manuals and tutorials.

It is helpful to classify the files that are involved in the pipeline
into three categories: *input*, *output,* and *intermediate*. To the
*output* caterogy belong all those files which we need but we don't
have; ideally there should be a single Snakemake rule (or equivalent) to
generate all the output files at once. The *intermediate* files are
those that are automatically generated by Snakemake; we need not worry
about them. And the *input* files are the ancestors of all the files in
the analysis; typically they have been obtained by our wet-lab
colleagues, or downloaded from the internet. Note that *input* does not
refer to the input section of a single rule, but it refers to the set of
all files that are not generated by any rule or script; they are the
files that go in the *local/data* directory of the project, or in the
*/bioinfo/data*, or even those files that are inside other projects.

Moreover, it is helpful to keep in mind the distinction between: your
local machine, where you write and run your pipelines on an everyday
basis; occam, the supercomputer; and the docker container, a virtual
machine that can run everywhere.

1. Write and test the pipeline on your local machine. Make sure that you
can obtain the result that you want by running just one command. For
example, if you use Snakemake, make sure that the dependencies are
resolved correctly and there is a rule to make everything; if you
don't use Snakemake or similar software, you could write a custom
script that will execute all the steps in the pipeline. The important
thing is that the target file and all its dependencies be created
with just one command, starting from a limited set of *input* files.

1. Export the environment by running `X-conda export` (It is good
practice to do this every time you install something with conda, but
it is mandatory before building the docker image.)

1. There is a default *local/docker/Dockerfile* inside each project, but
you will need to do some editing:
* the ymlfile path must be changed to match the one you have exported
in step 2;
* if you manually created new environmental variables you have to add
them to the dockerfile;
* you may have to incorporate in the Dockerfile other manual changes
you did: always think about that.

1. `cd` into the project directory and build the docker image by
running:
```
docker build -t "gitlab.c3s.unito.it:5000/user/project_name" -f
local/dockerfiles/Dockerfile .
```
Note that the last dot '.' is part of the command. Replace 'user'
with your username on occam (not on the local machine) and
'project\_name' with an arbitrary name (it helps if it is related to
your project's name, like for instance 'alignment\_v2').
Important: the content of the *dataset* directory is not part of the
docker image, therefore, if some of the *input* files are inside
*dataset*, you'll need to mount them as volumes; the master
Snakefile, in particular, should always be mounted.

1. Test the image locally. There are many things that can be tested, but
a basic thing would be to run
```
docker run -it "gitlab.c3s.unito.it:5000/user/project_name"
```
and explore the docker container to see if everything is where it
should be. You could also try and mount the volumes (see later steps)
and run the snakemake command to create your target.

1. Create a repository on occam's GitLab (see [occam
HowTo](https://c3s.unito.it/index.php/super-computer/occam-howto)).
It should be called "project\_name" as in the previous step.

1. Push the image to occam registry by running the following two
commands:
```
docker login "gitlab.c3s.unito.it:5000"
docker push "gitlab.c3s.unito.it:5000/user/project_name"
```

1. Log in to [occam website](https://c3s.unito.it/index.php) and Use the
[Resource Booking System](https://c3s.unito.it/booked/Web/?) to
reserve a slot. You will need to choose which node(s) to reserve and
then estimate the time it will take to compute your pipeline using
the chosen resources. Tips:
* If you book 2 light nodes, you don't have 48 cores; you will have
to run 2 separate containers, each with 24 cores.
* The reservation can be updated (or deleted) only before its
official start.
* If you reserve two consecutive time slots for the same machine, the
container will not be interrupted.

1. Log in to occam (see the
[HowTo](https://c3s.unito.it/index.php/super-computer/occam-howto))
and `cd` into */scratch/home/user/*, then create a directory called
*project_name* (it's not important that it be called like that, but
it helps).

1. Now it's time for the hard part: docker volumes. You'll have to think
of all the *input* files that your target needs. Suppose that, on
your local machine, the target is
*/bioinfo/prj/project_name/dataset/analysis/correlation.tsv*, while
the *input* files are located under four different directories:
* */bioinfo/data/ucsc/chromosomes*,
* */bioinfo/prj/project_name/local/data/annotation*,
* */bioinfo/prj/otherproject/dataset/expression*,
* */bioinfo/prj/project_name/dataset/alignments*
You may need the last directory if, for instance, you have already
run the alignment rules on your local machine, and now you just need
to compute the correlation, without re-doing the alignment. In this
situation, my reccommendation is to do as follows.
From occam, `cd` into */scratch/home/user/project_name* and create
four directories:
* `mkdir chromosomes`,
* `mkdir annotation`,
* `mkdir expression`,
* `mkdir -p dataset/aligments`.
Then, use `rsync` (or your favorite equivalent tool) to copy the
files from your local machine to occam:
* `rsync -a user@localmachineIP:/bioinfo/data/ucsc/chromosomes/* chromosomes`
* `rsync -a user@localmachineIP:/bioinfo/prj/project_name/local/data/annotation/* annotation`
* `rsync -a user@localmachineIP:/bioinfo/otherproject/dataset/expression/* expression`
* `rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/alignments/* dataset/alignments`
Lastly, copy the master Snakefile (this has to be done even if you
don't mount any volume):
* `rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/Snakefile dataset/`

1. Test the image on occam, mounting all the volumes. In a sense, this
step is the opposite of the previous one: in the previous, we
**copied** the directories from the local machines to occam, now we
**mount** these directories in the docker container, but we mount
them at the same paths that they have in the local machine. I suggest
to `cd` into */archive/home/user* and create a directory called like
the title of the reservation that you have created through occam's
resource booking system, and append the current date. For instance,
`mkdir projectX_correlation_2020-09-13`. Then, `cd` into the new
directory and run:
```
occam-run \
-v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \
-v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \
-v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \
-v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \
-t user/project_name \
"snakemake -np target"
```
The above command will start running the container and executing the
command to make the target (here assuming that you use snakemake).
Please note that the quotes ("") are important. The container will
run in occam's node22, which is a management node reserved for
testing purposes only, so we cannot run the whole analysis on this
node. That's why we use snakemake with the `-n` option. When the
above command exits, after a bit you will find three files called
something like *node22-5250.log*, *node22-5250.err*, and
*node22-5250.done*. Inspect their content and see if everything is
OK. In particular, the *.log* file should contain the output of
`snakemake -np target`. If something is wrong, try to fix the problem
and repeat.

1. When the time of your reservation comes, run the same command as
before from the same directory as before, but add the option `-n
nodeXX`. For instance, if you reserved node 17, write
```
occam-run \
-n node17 \
-v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \
-v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \
-v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \
-v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \
-t user/project_name \
"snakemake -np target"
```

1. Remember that if you have booked multiple nodes, you will need to run
`occam-run` multiple times, once for each node. It is up to you to
split the targets and mount the volumes appropriately. GOOD LUCK.

#### TODO

Expand All @@ -469,7 +661,6 @@ TODO.

* Make the names coherent (bioinfo vs bioinfoconda...)

* Improve docker management

* Config file with default conda packages and default directories

* Document the templates

0 comments on commit 48dcf7e

Please sign in to comment.