diff --git a/README.md b/README.md index 5296695..246e534 100644 --- a/README.md +++ b/README.md @@ -441,9 +441,9 @@ environment with the indispensable tools (e.g. Snakemake, R, Rstudio), and then other sub-environments, one for each part of the analysis. Users are encouraged to use their best judgement when it comes to determining how many parts a project has. But for example, one part -could be labelled 'alignment', and it would contain all the software +could be labelled 'alignment,' and it would contain all the software required for the alignment; another part could be called 'quality -control'; yet another could be 'statistical analysis', or 'plots'. +control;' yet another could be 'statistical analysis,' or 'plots.' Snakemake makes it easy to use the appropriate conda environment for each rule: check it out [here](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html). @@ -455,11 +455,203 @@ the project #### Extra: R Studio integration -TODO - -### Using Docker - -TODO. +* Install the rstudio package from conda for the project where you need +it. + +* From a directory within the project, run the command `rstudio`. If you +are working on a remote server, make sure that you connected with `ssh +-X ...` (the -X option allows the server to forward graphical +applications like Rstudio to the client). + +* Once Rstudio is up, create a project from an existing directory and +choose the root directory of your project. This will make sure that +Rstudio knows where to find the libraries and the Rprofile. + +### Using Docker for occam + +occam, the supercomputer of the University of Turin, is the main +external cluster used by the bioinfoconda community. (If you use another +service, such as AWS, and you want to contribute the instructions for +your favorite platform, by all means make a PR or email the maintainer.) + +Using occam is surprisingly difficult and many things are not explained +in the official documentation. Here we provide a step-by-step guide for +using occam the Bioinfoconda way. Before starting, however, please make +sure you have a basic understanding of Docker and occam by reading the +respective manuals and tutorials. + +It is helpful to classify the files that are involved in the pipeline +into three categories: *input*, *output,* and *intermediate*. To the +*output* caterogy belong all those files which we need but we don't +have; ideally there should be a single Snakemake rule (or equivalent) to +generate all the output files at once. The *intermediate* files are +those that are automatically generated by Snakemake; we need not worry +about them. And the *input* files are the ancestors of all the files in +the analysis; typically they have been obtained by our wet-lab +colleagues, or downloaded from the internet. Note that *input* does not +refer to the input section of a single rule, but it refers to the set of +all files that are not generated by any rule or script; they are the +files that go in the *local/data* directory of the project, or in the +*/bioinfo/data*, or even those files that are inside other projects. + +Moreover, it is helpful to keep in mind the distinction between: your +local machine, where you write and run your pipelines on an everyday +basis; occam, the supercomputer; and the docker container, a virtual +machine that can run everywhere. + +1. Write and test the pipeline on your local machine. Make sure that you + can obtain the result that you want by running just one command. For + example, if you use Snakemake, make sure that the dependencies are + resolved correctly and there is a rule to make everything; if you + don't use Snakemake or similar software, you could write a custom + script that will execute all the steps in the pipeline. The important + thing is that the target file and all its dependencies be created + with just one command, starting from a limited set of *input* files. + +1. Export the environment by running `X-conda export` (It is good + practice to do this every time you install something with conda, but + it is mandatory before building the docker image.) + +1. There is a default *local/docker/Dockerfile* inside each project, but + you will need to do some editing: + * the ymlfile path must be changed to match the one you have exported + in step 2; + * if you manually created new environmental variables you have to add + them to the dockerfile; + * you may have to incorporate in the Dockerfile other manual changes + you did: always think about that. + +1. `cd` into the project directory and build the docker image by + running: + ``` + docker build -t "gitlab.c3s.unito.it:5000/user/project_name" -f + local/dockerfiles/Dockerfile . + ``` + Note that the last dot '.' is part of the command. Replace 'user' + with your username on occam (not on the local machine) and + 'project\_name' with an arbitrary name (it helps if it is related to + your project's name, like for instance 'alignment\_v2'). + Important: the content of the *dataset* directory is not part of the + docker image, therefore, if some of the *input* files are inside + *dataset*, you'll need to mount them as volumes; the master + Snakefile, in particular, should always be mounted. + +1. Test the image locally. There are many things that can be tested, but + a basic thing would be to run + ``` + docker run -it "gitlab.c3s.unito.it:5000/user/project_name" + ``` + and explore the docker container to see if everything is where it + should be. You could also try and mount the volumes (see later steps) + and run the snakemake command to create your target. + +1. Create a repository on occam's GitLab (see [occam + HowTo](https://c3s.unito.it/index.php/super-computer/occam-howto)). + It should be called "project\_name" as in the previous step. + +1. Push the image to occam registry by running the following two + commands: + ``` + docker login "gitlab.c3s.unito.it:5000" + docker push "gitlab.c3s.unito.it:5000/user/project_name" + ``` + +1. Log in to [occam website](https://c3s.unito.it/index.php) and Use the + [Resource Booking System](https://c3s.unito.it/booked/Web/?) to + reserve a slot. You will need to choose which node(s) to reserve and + then estimate the time it will take to compute your pipeline using + the chosen resources. Tips: + * If you book 2 light nodes, you don't have 48 cores; you will have + to run 2 separate containers, each with 24 cores. + * The reservation can be updated (or deleted) only before its + official start. + * If you reserve two consecutive time slots for the same machine, the + container will not be interrupted. + +1. Log in to occam (see the + [HowTo](https://c3s.unito.it/index.php/super-computer/occam-howto)) + and `cd` into */scratch/home/user/*, then create a directory called + *project_name* (it's not important that it be called like that, but + it helps). + +1. Now it's time for the hard part: docker volumes. You'll have to think + of all the *input* files that your target needs. Suppose that, on + your local machine, the target is + */bioinfo/prj/project_name/dataset/analysis/correlation.tsv*, while + the *input* files are located under four different directories: + * */bioinfo/data/ucsc/chromosomes*, + * */bioinfo/prj/project_name/local/data/annotation*, + * */bioinfo/prj/otherproject/dataset/expression*, + * */bioinfo/prj/project_name/dataset/alignments* + You may need the last directory if, for instance, you have already + run the alignment rules on your local machine, and now you just need + to compute the correlation, without re-doing the alignment. In this + situation, my reccommendation is to do as follows. + From occam, `cd` into */scratch/home/user/project_name* and create + four directories: + * `mkdir chromosomes`, + * `mkdir annotation`, + * `mkdir expression`, + * `mkdir -p dataset/aligments`. + Then, use `rsync` (or your favorite equivalent tool) to copy the + files from your local machine to occam: + * `rsync -a user@localmachineIP:/bioinfo/data/ucsc/chromosomes/* chromosomes` + * `rsync -a user@localmachineIP:/bioinfo/prj/project_name/local/data/annotation/* annotation` + * `rsync -a user@localmachineIP:/bioinfo/otherproject/dataset/expression/* expression` + * `rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/alignments/* dataset/alignments` + Lastly, copy the master Snakefile (this has to be done even if you + don't mount any volume): + * `rsync -a user@localmachineIP:/bioinfo/prj/project_name/dataset/Snakefile dataset/` + +1. Test the image on occam, mounting all the volumes. In a sense, this + step is the opposite of the previous one: in the previous, we + **copied** the directories from the local machines to occam, now we + **mount** these directories in the docker container, but we mount + them at the same paths that they have in the local machine. I suggest + to `cd` into */archive/home/user* and create a directory called like + the title of the reservation that you have created through occam's + resource booking system, and append the current date. For instance, + `mkdir projectX_correlation_2020-09-13`. Then, `cd` into the new + directory and run: + ``` + occam-run \ + -v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \ + -v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \ + -v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \ + -v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \ + -t user/project_name \ + "snakemake -np target" + ``` + The above command will start running the container and executing the + command to make the target (here assuming that you use snakemake). + Please note that the quotes ("") are important. The container will + run in occam's node22, which is a management node reserved for + testing purposes only, so we cannot run the whole analysis on this + node. That's why we use snakemake with the `-n` option. When the + above command exits, after a bit you will find three files called + something like *node22-5250.log*, *node22-5250.err*, and + *node22-5250.done*. Inspect their content and see if everything is + OK. In particular, the *.log* file should contain the output of + `snakemake -np target`. If something is wrong, try to fix the problem + and repeat. + +1. When the time of your reservation comes, run the same command as + before from the same directory as before, but add the option `-n + nodeXX`. For instance, if you reserved node 17, write + ``` + occam-run \ + -n node17 \ + -v /scratch/home/user/project_name/chromosomes:/bioinfo/data/ucsc/chromosomes \ + -v /scratch/home/user/project_name/annotation:/bioinfo/prj/project_name/local/data/annotation \ + -v /scratch/home/user/project_name/expression:/bioinfo/prj/otherproject/dataset/expression \ + -v /scratch/home/user/project_name/dataset:/bioinfo/prj/project_name/dataset \ + -t user/project_name \ + "snakemake -np target" + ``` + +1. Remember that if you have booked multiple nodes, you will need to run + `occam-run` multiple times, once for each node. It is up to you to + split the targets and mount the volumes appropriately. GOOD LUCK. #### TODO @@ -469,7 +661,6 @@ TODO. * Make the names coherent (bioinfo vs bioinfoconda...) -* Improve docker management - * Config file with default conda packages and default directories +* Document the templates