-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bioconda #10
Bioconda #10
Conversation
My docker image is here: https://hub.docker.com/r/tiagochst/chip-seq/ |
Dockerfile
Outdated
|
||
COPY environment.yml / | ||
RUN conda env create -f /environment.yml && conda clean -a | ||
RUN conda update -n base conda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had sort of tried to avoid this in the past, as the base docker image already updates. However, I guess we don't know how stale the base container is, so it doesn't hurt 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that because there was a warn to update it every time asking to update conda. It is not really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the more I think about it, the more I think that it's probably a good idea to have in all nf-core containers..
Dockerfile
Outdated
RUN conda env create -f /environment.yml && conda clean -a | ||
RUN conda update -n base conda | ||
RUN conda update --all | ||
RUN conda env update -n root --file environment.yml && conda clean -a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brilliant - I didn't know that you could just use the root environment like this. Much cleaner 👍
Dockerfile
Outdated
RUN conda env create -f /environment.yml && conda clean -a | ||
RUN conda update -n base conda | ||
RUN conda update --all | ||
RUN conda env update -n root --file environment.yml && conda clean -a | ||
ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not using a conda environment, I guess we can probably get rid of this line? This was an alternative to source activate nfcore-chipseq-1.4dev
, but if we're not using that environment then I guess everything should already be available on the PATH
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the conda environment. But If I understand it right we might be able to remove that line. It is better to check that before. I have a question about nexflow. Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.
No, it's automatic - if it thinks that downstream processes rely on a work folder that you deleted, then it will rerun them all. It does this because it can't know that the downstream results would be the same otherwise.
If I'm misunderstanding your question, then come ask on gitter where we can chat: https://gitter.im/nf-core/Lobby
Dockerfile
Outdated
ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH | ||
|
||
ENV NGSPLOT_VERSION="2.63" | ||
RUN curl -fsSL https://github.com/shenlab-sinai/ngsplot/archive/${NGSPLOT_VERSION}.tar.gz -o /opt/ngsplot_${NGSPLOT_VERSION}.tar.gz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Whilst it works, we have previously discussed that we should try to avoid direct installations into docker images like this. Instead, we hope to package missing tools into bioconda.
The reason for this is that users who aren't using docker and singularity, but only using the bioconda environment will not be able to run the full pipeline. So if possible, for maximum portability, it's best to try to use bioconda for everything.
Your thoughts on this are welcome. Do you think it's possible to add these to bioconda? Would you be willing / able to do so..? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this package is in Bioconda: shenlab-sinai/ngsplot#74
So we had to add it there. It makes sense to port everything to bioconda. I'll check what we can do.
But still, the ngsplot genomes needs to be installed (they are available in google drive). That was one of the problems I had before. I'm not sure that is possible with Bioconda.
What if we added a script and run it from docker? And users would be able to execute it locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But still, the ngsplot genomes needs to be installed (they are available in google drive).
Ah yeah, I'd forgotten about this. No they can't (or rather, shouldn't) be added to bioconda. I'd argue that they shouldn't really be in the docker container either. It works, but it makes the container huge and is bloated for anyone not using these reference genomes.
I think this falls into the same category as all references - can we not just ask the user to fetch them and specify their location in the same way as alignment genome references?
We could also look into adding them to AWS-iGenomes or somthing. We have also been discussing an upcoming project to create a bioconda-like resource for large genome references such as this which would be the best solution (see https://gitter.im/nf-core/Lobby?at=5afd44f02df44c2d0634e430)
Phil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeltzer / @chuan-wang - any thoughts or previous discussion points that I've missed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @tiagochst, I missed a bit..
What if we added a script and run it from docker? And users would be able to execute it locally?
Yeah, I guess.... I would consider this a last resort 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry still on vacation and thus only iPad typing available. I agree that we shouldn’t package this in bio conda and keep it external of the Container for the reasons you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeltzer - you mean the references yeah? And package the tool scripts in bioconda?
It could also be worth bringing this topic up in the bioconda community to see what they have to say.. https://gitter.im/bioconda/Lobby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just the references . I can ask once I’m back - already made a todo
docs/usage.md
Outdated
@@ -213,7 +252,7 @@ projects or different sets of reference genomes. | |||
Note - you can use this to override defaults. For example, we run on UPPMAX but don't want to use the MultiQC | |||
environment module as is the default. So we specify a config file using `-c` that contains the following: | |||
|
|||
```groovy | |||
```nextflow | |||
process.$multiqc.module = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this example is super outdated and should be removed.
We need to overhaul this whole part of the "standard" documentation really. I'm hoping to write a helper tool to work with this soon (basically using the cookiecutter template and doing a git diff
)
Some messy git history stuff going on here. I just updated |
Hi @tiagochst, So your pull request generated quite a bit of discussion! 😄 (cf nf-core/cookiecutter#29) I think that we came to the consensus that it's better to leave the Dockerfiles as they are with the pseudo-environments. For the Finally, there's the manual installations of the tools in the Dockerfile. Whilst this is great in that it makes the container work with the pipeline, we agree that we would prefer to package these tools in bioconda instead and use these. Would it be possible for you to revert these changes in the Dockerfile but keep the other things in your pull request (bug fixes, new packages etc). Then we can merge here and open up a new PR when the bioconda packaging is done. Thanks for your hard work on this! Let me know if you have comments / questions. Phil |
No problem. I'm just wondering how you will do for installing the genome db for ngsplot. phantompeakqualtools shouldn't be a problem. We just need to keep in mind that the run_spp.R has a line missing ( |
Thanks! Yes, we need to discuss this with the bioconda team I think. |
I'm trying to finish the docker file. I rebased some of the master fixes as it was required to run the workflow.
The major changes are:
run_spp.R
from github.conda-forge::gawk=4.2.1
toenviroment.yml
as gawk was required by phantompeakqualtoolsRUN conda env create -f /environment.yml
toRUN conda env update -n root --file environment.yml
because of the source below:https://docs.datascience.com/en/master/appendix-1/dockerfile-basics-and-best-practices-1.html
My files are big so I could not finish the test yet, but deeptools and phantompeakqualtools are working. chippeakanno never finishes and I was not able to test ngsplot yet.