Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for handling multiple samples #95

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

maplesond
Copy link

@maplesond maplesond commented Feb 19, 2018

When processing samples against the NT database a large amount of memory (~140-190GB depending on sample) and long runtime is required (~20-30mins for index loading plus mapping).

By loading the index once and then mapping to multiple samples in sequence we can make much more efficient use of RAM and runtime. In order for this to work with gzip compressed files I've added in gzip decompression as well.

I've had to reorganise some of the original bowtie2 code in order to get this to work correctly. Also you may want to think about improving the interface I created for handling samplesheet. It's definitely worth a thorough review before integrating and I probably wouldn't recommend taking the changes as is. Though FWIW we've been using the changes for the past couple of months and found them to be stable.

@khyox
Copy link

khyox commented Feb 19, 2018

@maplesond, thank you very much! I totally agree. For the NT database (or larger ones) this is very convenient, if not a must.

Sorry, I have been unable to find any documentation or example about the samplesheet for centrifuge-multi. From your code, I guess that it should have 5 comma-separated columns with the following layout:

input file for mated reads (1st mate), input file for mated reads (2nd mate), input file for unmated reads, Centrifuge output file, Centrifuge report file.

  • If the 1st column is r1 the line will be ignored; I suppose this is for a key line at the beginning of the samplesheet.
  • If the mated reads columns are empty (just the commas), it is taken as a single end sample.
  • If the unmated reads column is empty, it is taken as a paired end sample.
  • Otherwise (the first 3 columns with file names), it is taken as a sample with both paired and single end data.

Is this the intended format for the samplesheet? Thanks again!

@maplesond
Copy link
Author

Yes, that sums it up exactly right. Sorry for the lack of documentation!

@khyox
Copy link

khyox commented Mar 2, 2018

@maplesond, after some days testing centrifuge-multi I have to say it is awesome. Depending on the database size and the number of samples the speedup is considerable (easily reaching 6 or 8). It also provides with much more homogeneous and stable use of computing resources.

Thank you very much for your work extending Centrifuge in such an essential direction. I hope your PR will be successful!

@maplesond
Copy link
Author

@khyox, thanks very much. Glad you find it useful.

@apredeus
Copy link

I've tested it on about 4k genomes and it works very well. Took around ~100 CPU-hours overall.

@mourisl
Copy link
Collaborator

mourisl commented Mar 27, 2018

Thanks for providing this useful request/feature.

We think it would be better to handle this issue through wrapper, so Centrifuge can have fewer library dependency and the output of unclassified reads is also through wrapper. Though this implementation is not as elegant as yours.
We incorporate your idea into the main program such that centrifuge will process one sample at a time. And the wrapper "centrifuge" now takes the parameter "--sample-sheet tsv_file" to specify the multiple samples.
The format is that: the first column specify the sample type: 1: single-end, 2:paired-end
the next column(s) will specify the read file(s) followed by the classification result output file and report file.
In other words, for single-end sample, there are 4 columns while for paired-end sample, there are 5
columns.
The next two column will specify the read file(s) followed by the classification result output file and report file. If the sample is single-ended (type 1), the third column will be ignored by Centrifuge. So as in your sample sheet, there are 5 columns.

I created a branch "multisample" for this version. Could you give it a try? if it works then I'll merge that into the master branch.

Thank you!

@maplesond
Copy link
Author

Hi @mourisl, You're welcome. Thanks for creating such a great tool! We've really found it useful.

Your suggested changes sound fine to me. I don't think I'll get a chance to test this before Easter but I'll let you know as soon as possible.

@mourisl mourisl mentioned this pull request Apr 22, 2018
@maplesond
Copy link
Author

Hello, sorry for not getting back to you earlier and holding up the release! I've just cloned and installed from the multisample branch. I can see code in the wrapper for handling samplesheets but there is nothing in the help message when typing ./centrifuge --help. Was this intended to be a hidden feature? If not it would be nice to see some indication of how to use it in the help message.

@mourisl
Copy link
Collaborator

mourisl commented Apr 23, 2018

Thanks! I just added that to the help information of Centrifuge in the "multisample" branch. I will also add the description of the samplesheet format to the website once I merge this to the master branch.

@maplesond
Copy link
Author

maplesond commented Apr 25, 2018

I encountered a problem when trying to index the nt database. I allocated 180GB to this which should be adequate. This also works fine with my custom version. The command line I used was:

/tgac/software/testing/centrifuge/multisample_test/x86_64/bin/centrifuge -x /tgac/references/databases/centrifuge/nt --sample-sheet PAP_20180424/centrifuge/test_samplesheet.csv -p 16 -u 1000000 --no-abundance --startverbose --verbose -t > centrifuge_mstest.log 2>&1

The log file is here: centrifuge_mstest.log

My samplesheet is 5 column tab separated. First column contains 2, to indicate paired end. The next two columns are the R1 and R2 fastq gzipped files, then there is the hits file, then the last column is the report file.

Any ideas of what may have gone wrong?

@mourisl
Copy link
Collaborator

mourisl commented Apr 27, 2018

The command line and the format of the sample sheet looks right. I guess this works on your version of centrifuge-multi. Can you try to run it with single thread? Thanks.

@mourisl
Copy link
Collaborator

mourisl commented May 14, 2018

I think the force kill might due to too many decompressing child process. I just changed the wrapper so that only 1 child process is used. Could you please give the multisample branch another try?
Thank you.

@themouldinator
Copy link

can you post an example of single end read sample sheet including the first line please

@mourisl
Copy link
Collaborator

mourisl commented Nov 29, 2019

For single-end sample you can use:
1 sample_read.fq.gz xxxx sample_classification.out sample_report.out

xxx is some randomly string, Centrifuge will ignore that in single-end sample case. The space between columns should be "tab" in the sheet since the format is tsv. Does this help?

@themouldinator
Copy link

@maplesond @mourisl I've been using the sample sheet as a way to speed up the processing time within our pipeline but am hesitant to use the centrifuge-multi branch to perform in parallel due to the number of commits its missing, would this result in errors, what would you suggest I do?

@mourisl
Copy link
Collaborator

mourisl commented Jul 30, 2020

Hi @themouldinator, centrifuge master branch also supports sample-sheet starting from v1.0.4-beta. Or do you mean the master branch failed and you need to use the centrifuge-multi? Thanks.

@themouldinator
Copy link

@mourisl My apologies, I'd used apt-get to install centrifuge not realising it would install the 1.03 beta not 1.04 and so from use of the manual in 1.03 had thought I needed the centrifuge-multi for sample sheet.
Would it be possible for you to update centrifuge for apt-get installation to 1.04?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants