Bioconda #10

tiagochst · 2018-05-27T16:36:51Z

I'm trying to finish the docker file. I rebased some of the master fixes as it was required to run the workflow.

The major changes are:

Add ngsplot and genome files installed
Add run_spp.R from github.
Add conda-forge::gawk=4.2.1 to enviroment.yml as gawk was required by phantompeakqualtools
The deeptools set in the beginning of the enviroment.yml was not being installed, but if I set it in the end it is installed.
Changed RUN conda env create -f /environment.yml to RUN conda env update -n root --file environment.yml because of the source below:
- If you are using the Conda package manager, avoid creating a Conda environment. Instead, update the root environment with whatever dependencies you want to install.
  
  As an example to the above warning, it would be best practice to run this: RUN conda env update -n root --file environment.yml and avoid this: RUN conda env create -f environment.yml

https://docs.datascience.com/en/master/appendix-1/dockerfile-basics-and-best-practices-1.html

My files are big so I could not finish the test yet, but deeptools and phantompeakqualtools are working. chippeakanno never finishes and I was not able to test ngsplot yet.

tiagochst · 2018-05-27T16:40:21Z

My docker image is here: https://hub.docker.com/r/tiagochst/chip-seq/

ewels · 2018-05-28T14:12:49Z

Dockerfile


 COPY environment.yml /
-RUN conda env create -f /environment.yml && conda clean -a
+RUN conda update -n base conda


I had sort of tried to avoid this in the past, as the base docker image already updates. However, I guess we don't know how stale the base container is, so it doesn't hurt 👍

I added that because there was a warn to update it every time asking to update conda. It is not really necessary.

Yeah, the more I think about it, the more I think that it's probably a good idea to have in all nf-core containers..

ewels · 2018-05-28T14:13:19Z

Dockerfile

-RUN conda env create -f /environment.yml && conda clean -a
+RUN conda update -n base conda
+RUN conda update --all
+RUN conda env update -n root --file environment.yml  && conda clean -a


This is brilliant - I didn't know that you could just use the root environment like this. Much cleaner 👍

ewels · 2018-05-28T14:14:19Z

Dockerfile

-RUN conda env create -f /environment.yml && conda clean -a
+RUN conda update -n base conda
+RUN conda update --all
+RUN conda env update -n root --file environment.yml  && conda clean -a
 ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH


If we're not using a conda environment, I guess we can probably get rid of this line? This was an alternative to source activate nfcore-chipseq-1.4dev, but if we're not using that environment then I guess everything should already be available on the PATH.

I'm not very familiar with the conda environment. But If I understand it right we might be able to remove that line. It is better to check that before. I have a question about nexflow. Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.

Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.

No, it's automatic - if it thinks that downstream processes rely on a work folder that you deleted, then it will rerun them all. It does this because it can't know that the downstream results would be the same otherwise.

If I'm misunderstanding your question, then come ask on gitter where we can chat: https://gitter.im/nf-core/Lobby

ewels · 2018-05-28T14:16:23Z

Dockerfile

 ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH
+
+ENV NGSPLOT_VERSION="2.63"
+RUN curl -fsSL https://github.com/shenlab-sinai/ngsplot/archive/${NGSPLOT_VERSION}.tar.gz -o /opt/ngsplot_${NGSPLOT_VERSION}.tar.gz && \


Thanks for this! Whilst it works, we have previously discussed that we should try to avoid direct installations into docker images like this. Instead, we hope to package missing tools into bioconda.

The reason for this is that users who aren't using docker and singularity, but only using the bioconda environment will not be able to run the full pipeline. So if possible, for maximum portability, it's best to try to use bioconda for everything.

Your thoughts on this are welcome. Do you think it's possible to add these to bioconda? Would you be willing / able to do so..? 😉

I'm not sure this package is in Bioconda: shenlab-sinai/ngsplot#74
So we had to add it there. It makes sense to port everything to bioconda. I'll check what we can do.

But still, the ngsplot genomes needs to be installed (they are available in google drive). That was one of the problems I had before. I'm not sure that is possible with Bioconda.

What if we added a script and run it from docker? And users would be able to execute it locally?

But still, the ngsplot genomes needs to be installed (they are available in google drive).

Ah yeah, I'd forgotten about this. No they can't (or rather, shouldn't) be added to bioconda. I'd argue that they shouldn't really be in the docker container either. It works, but it makes the container huge and is bloated for anyone not using these reference genomes.

I think this falls into the same category as all references - can we not just ask the user to fetch them and specify their location in the same way as alignment genome references?

We could also look into adding them to AWS-iGenomes or somthing. We have also been discussing an upcoming project to create a bioconda-like resource for large genome references such as this which would be the best solution (see https://gitter.im/nf-core/Lobby?at=5afd44f02df44c2d0634e430)

Phil

@apeltzer / @chuan-wang - any thoughts or previous discussion points that I've missed here?

Sorry @tiagochst, I missed a bit..

What if we added a script and run it from docker? And users would be able to execute it locally?

Yeah, I guess.... I would consider this a last resort 😅

Sorry still on vacation and thus only iPad typing available. I agree that we shouldn’t package this in bio conda and keep it external of the Container for the reasons you mentioned.

@apeltzer - you mean the references yeah? And package the tool scripts in bioconda?

It could also be worth bringing this topic up in the bioconda community to see what they have to say.. https://gitter.im/bioconda/Lobby

Yes, just the references . I can ask once I’m back - already made a todo

ewels · 2018-05-28T14:17:56Z

docs/usage.md

@@ -213,7 +252,7 @@ projects or different sets of reference genomes.
 Note - you can use this to override defaults. For example, we run on UPPMAX but don't want to use the MultiQC
 environment module as is the default. So we specify a config file using `-c` that contains the following:

-```groovy
+```nextflow
 process.$multiqc.module = []


Ugh, this example is super outdated and should be removed.

We need to overhaul this whole part of the "standard" documentation really. I'm hoping to write a helper tool to work with this soon (basically using the cookiecutter template and doing a git diff)

ewels · 2018-05-28T14:29:33Z

Some messy git history stuff going on here. I just updated bioconda from master which will hopefully help a little. Apologies if I made life difficult!

…processing.r

ewels · 2018-05-30T10:11:46Z

Hi @tiagochst,

So your pull request generated quite a bit of discussion! 😄 (cf nf-core/cookiecutter#29)

I think that we came to the consensus that it's better to leave the Dockerfiles as they are with the pseudo-environments.

For the conda update we can remove this I think. If we start getting warnings about it being out of date, we can just trigger a new build on the upstream nfcore/base container (I just did this now).

Finally, there's the manual installations of the tools in the Dockerfile. Whilst this is great in that it makes the container work with the pipeline, we agree that we would prefer to package these tools in bioconda instead and use these.

Would it be possible for you to revert these changes in the Dockerfile but keep the other things in your pull request (bug fixes, new packages etc). Then we can merge here and open up a new PR when the bioconda packaging is done.

Thanks for your hard work on this! Let me know if you have comments / questions.

Phil

tiagochst · 2018-05-30T12:36:49Z

No problem. I'm just wondering how you will do for installing the genome db for ngsplot. phantompeakqualtools shouldn't be a problem. We just need to keep in mind that the run_spp.R has a line missing (library(caTools)) to make it work.

ewels · 2018-05-30T13:06:14Z

Thanks! Yes, we need to discuss this with the bioconda team I think.

Bioconda

ewels and others added 9 commits May 25, 2018 17:16

WIP: Bioconda environment file.

6ddcf60

Updated Environment YAML - that should do the trick ;-)

f06974b

Added r-spp to environment.yaml file

c8bff17

Cleaned package names in docker file

3cc6e25

DeepTools requires pysam >=0.14.1 which requires samtools 1.7

8b8de8a

INstalls just fine :-)

30a4527

Doublechecked tool names and corrected some

4bbf815

phantompeakqualtools fixed, installing ngsplot genomes, fixing deepTools

a8dc927

phantompeakqualtools fixed, installing ngsplot genomes, fixing deepTools

4468e1a

tiagochst added 2 commits May 27, 2018 13:46

fix identing

181c3a3

fixing arguemnt for bamCoverage

dfee534

ewels approved these changes May 28, 2018

View reviewed changes

Merge branch 'bioconda' into bioconda

8163158

ewels mentioned this pull request May 28, 2018

Don't use bioconda environments in docker image nf-core/cookiecutter#29

Closed

maxulysse mentioned this pull request May 28, 2018

Don't use bioconda environments in docker image SciLifeLab/Sarek#594

Closed

Adding doparallel to docker for improvement in bin/post_peak_calling_…

e7e61b8

…processing.r

Revert docker file

4a5a08f

ewels merged commit b7711fd into nf-core:bioconda May 30, 2018

ewels mentioned this pull request May 30, 2018

Make conda environment.yml #2

Closed

7 tasks

apeltzer mentioned this pull request Jul 2, 2018

Improved docker image creation nf-core/methylseq#29

Merged

jherrero pushed a commit to UCL-BLIC/chipseq that referenced this pull request Apr 11, 2019

Merge pull request nf-core#10 from tiagochst/bioconda

9977832

Bioconda

Bioconda #10

Bioconda #10

Uh oh!

Conversation

tiagochst commented May 27, 2018 • edited by ewels Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiagochst commented May 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewels commented May 28, 2018

Uh oh!

ewels commented May 30, 2018

Uh oh!

tiagochst commented May 30, 2018

Uh oh!

ewels commented May 30, 2018

Uh oh!

Uh oh!

tiagochst commented May 27, 2018 •

edited by ewels

Loading