-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for handling multiple samples #95
base: master
Are you sure you want to change the base?
Conversation
…trifuge-class but I'll gradually modify this over time.
Added support for samplesheet via new tool "centrifuge-multi"
@maplesond, thank you very much! I totally agree. For the NT database (or larger ones) this is very convenient, if not a must. Sorry, I have been unable to find any documentation or example about the samplesheet for
Is this the intended format for the samplesheet? Thanks again! |
Yes, that sums it up exactly right. Sorry for the lack of documentation! |
@maplesond, after some days testing Thank you very much for your work extending Centrifuge in such an essential direction. I hope your PR will be successful! |
@khyox, thanks very much. Glad you find it useful. |
I've tested it on about 4k genomes and it works very well. Took around ~100 CPU-hours overall. |
Thanks for providing this useful request/feature. We think it would be better to handle this issue through wrapper, so Centrifuge can have fewer library dependency and the output of unclassified reads is also through wrapper. Though this implementation is not as elegant as yours. I created a branch "multisample" for this version. Could you give it a try? if it works then I'll merge that into the master branch. Thank you! |
Hi @mourisl, You're welcome. Thanks for creating such a great tool! We've really found it useful. Your suggested changes sound fine to me. I don't think I'll get a chance to test this before Easter but I'll let you know as soon as possible. |
Hello, sorry for not getting back to you earlier and holding up the release! I've just cloned and installed from the multisample branch. I can see code in the wrapper for handling samplesheets but there is nothing in the help message when typing |
Thanks! I just added that to the help information of Centrifuge in the "multisample" branch. I will also add the description of the samplesheet format to the website once I merge this to the master branch. |
I encountered a problem when trying to index the nt database. I allocated 180GB to this which should be adequate. This also works fine with my custom version. The command line I used was:
The log file is here: centrifuge_mstest.log My samplesheet is 5 column tab separated. First column contains 2, to indicate paired end. The next two columns are the R1 and R2 fastq gzipped files, then there is the hits file, then the last column is the report file. Any ideas of what may have gone wrong? |
The command line and the format of the sample sheet looks right. I guess this works on your version of centrifuge-multi. Can you try to run it with single thread? Thanks. |
I think the force kill might due to too many decompressing child process. I just changed the wrapper so that only 1 child process is used. Could you please give the multisample branch another try? |
can you post an example of single end read sample sheet including the first line please |
For single-end sample you can use: xxx is some randomly string, Centrifuge will ignore that in single-end sample case. The space between columns should be "tab" in the sheet since the format is tsv. Does this help? |
@maplesond @mourisl I've been using the sample sheet as a way to speed up the processing time within our pipeline but am hesitant to use the centrifuge-multi branch to perform in parallel due to the number of commits its missing, would this result in errors, what would you suggest I do? |
Hi @themouldinator, centrifuge master branch also supports sample-sheet starting from v1.0.4-beta. Or do you mean the master branch failed and you need to use the centrifuge-multi? Thanks. |
@mourisl My apologies, I'd used apt-get to install centrifuge not realising it would install the 1.03 beta not 1.04 and so from use of the manual in 1.03 had thought I needed the centrifuge-multi for sample sheet. |
When processing samples against the NT database a large amount of memory (~140-190GB depending on sample) and long runtime is required (~20-30mins for index loading plus mapping).
By loading the index once and then mapping to multiple samples in sequence we can make much more efficient use of RAM and runtime. In order for this to work with gzip compressed files I've added in gzip decompression as well.
I've had to reorganise some of the original bowtie2 code in order to get this to work correctly. Also you may want to think about improving the interface I created for handling samplesheet. It's definitely worth a thorough review before integrating and I probably wouldn't recommend taking the changes as is. Though FWIW we've been using the changes for the past couple of months and found them to be stable.