Support for handling multiple samples #95

maplesond · 2018-02-19T11:57:57Z

When processing samples against the NT database a large amount of memory (~140-190GB depending on sample) and long runtime is required (~20-30mins for index loading plus mapping).

By loading the index once and then mapping to multiple samples in sequence we can make much more efficient use of RAM and runtime. In order for this to work with gzip compressed files I've added in gzip decompression as well.

I've had to reorganise some of the original bowtie2 code in order to get this to work correctly. Also you may want to think about improving the interface I created for handling samplesheet. It's definitely worth a thorough review before integrating and I probably wouldn't recommend taking the changes as is. Though FWIW we've been using the changes for the past couple of months and found them to be stable.

…trifuge-class but I'll gradually modify this over time.

Added support for samplesheet via new tool "centrifuge-multi"

khyox · 2018-02-19T13:11:11Z

@maplesond, thank you very much! I totally agree. For the NT database (or larger ones) this is very convenient, if not a must.

Sorry, I have been unable to find any documentation or example about the samplesheet for centrifuge-multi. From your code, I guess that it should have 5 comma-separated columns with the following layout:

input file for mated reads (1st mate), input file for mated reads (2nd mate), input file for unmated reads, Centrifuge output file, Centrifuge report file.

If the 1st column is r1 the line will be ignored; I suppose this is for a key line at the beginning of the samplesheet.
If the mated reads columns are empty (just the commas), it is taken as a single end sample.
If the unmated reads column is empty, it is taken as a paired end sample.
Otherwise (the first 3 columns with file names), it is taken as a sample with both paired and single end data.

Is this the intended format for the samplesheet? Thanks again!

maplesond · 2018-02-19T13:22:18Z

Yes, that sums it up exactly right. Sorry for the lack of documentation!

khyox · 2018-03-02T14:14:24Z

@maplesond, after some days testing centrifuge-multi I have to say it is awesome. Depending on the database size and the number of samples the speedup is considerable (easily reaching 6 or 8). It also provides with much more homogeneous and stable use of computing resources.

Thank you very much for your work extending Centrifuge in such an essential direction. I hope your PR will be successful!

maplesond · 2018-03-02T17:06:59Z

@khyox, thanks very much. Glad you find it useful.

apredeus · 2018-03-23T11:43:45Z

I've tested it on about 4k genomes and it works very well. Took around ~100 CPU-hours overall.

mourisl · 2018-03-27T06:49:17Z

Thanks for providing this useful request/feature.

We think it would be better to handle this issue through wrapper, so Centrifuge can have fewer library dependency and the output of unclassified reads is also through wrapper. Though this implementation is not as elegant as yours.
We incorporate your idea into the main program such that centrifuge will process one sample at a time. And the wrapper "centrifuge" now takes the parameter "--sample-sheet tsv_file" to specify the multiple samples.
The format is that: the first column specify the sample type: 1: single-end, 2:paired-end
~~the next column(s) will specify the read file(s) followed by the classification result output file and report file.~~
~~In other words, for single-end sample, there are 4 columns while for paired-end sample, there are 5~~
~~columns.~~
The next two column will specify the read file(s) followed by the classification result output file and report file. If the sample is single-ended (type 1), the third column will be ignored by Centrifuge. So as in your sample sheet, there are 5 columns.

I created a branch "multisample" for this version. Could you give it a try? if it works then I'll merge that into the master branch.

Thank you!

maplesond · 2018-03-27T08:18:13Z

Hi @mourisl, You're welcome. Thanks for creating such a great tool! We've really found it useful.

Your suggested changes sound fine to me. I don't think I'll get a chance to test this before Easter but I'll let you know as soon as possible.

maplesond · 2018-04-23T09:34:01Z

Hello, sorry for not getting back to you earlier and holding up the release! I've just cloned and installed from the multisample branch. I can see code in the wrapper for handling samplesheets but there is nothing in the help message when typing ./centrifuge --help. Was this intended to be a hidden feature? If not it would be nice to see some indication of how to use it in the help message.

mourisl · 2018-04-23T20:15:56Z

Thanks! I just added that to the help information of Centrifuge in the "multisample" branch. I will also add the description of the samplesheet format to the website once I merge this to the master branch.

maplesond · 2018-04-25T08:05:14Z

I encountered a problem when trying to index the nt database. I allocated 180GB to this which should be adequate. This also works fine with my custom version. The command line I used was:

/tgac/software/testing/centrifuge/multisample_test/x86_64/bin/centrifuge -x /tgac/references/databases/centrifuge/nt --sample-sheet PAP_20180424/centrifuge/test_samplesheet.csv -p 16 -u 1000000 --no-abundance --startverbose --verbose -t > centrifuge_mstest.log 2>&1

The log file is here: centrifuge_mstest.log

My samplesheet is 5 column tab separated. First column contains 2, to indicate paired end. The next two columns are the R1 and R2 fastq gzipped files, then there is the hits file, then the last column is the report file.

Any ideas of what may have gone wrong?

mourisl · 2018-04-27T06:43:01Z

The command line and the format of the sample sheet looks right. I guess this works on your version of centrifuge-multi. Can you try to run it with single thread? Thanks.

mourisl · 2018-05-14T21:00:45Z

I think the force kill might due to too many decompressing child process. I just changed the wrapper so that only 1 child process is used. Could you please give the multisample branch another try?
Thank you.

themouldinator · 2019-11-29T15:58:32Z

can you post an example of single end read sample sheet including the first line please

mourisl · 2019-11-29T17:23:00Z

For single-end sample you can use:
1 sample_read.fq.gz xxxx sample_classification.out sample_report.out

xxx is some randomly string, Centrifuge will ignore that in single-end sample case. The space between columns should be "tab" in the sheet since the format is tsv. Does this help?

themouldinator · 2020-07-28T08:22:07Z

@maplesond @mourisl I've been using the sample sheet as a way to speed up the processing time within our pipeline but am hesitant to use the centrifuge-multi branch to perform in parallel due to the number of commits its missing, would this result in errors, what would you suggest I do?

mourisl · 2020-07-30T15:53:12Z

Hi @themouldinator, centrifuge master branch also supports sample-sheet starting from v1.0.4-beta. Or do you mean the master branch failed and you need to use the centrifuge-multi? Thanks.

themouldinator · 2020-09-17T08:21:55Z

@mourisl My apologies, I'd used apt-get to install centrifuge not realising it would install the 1.03 beta not 1.04 and so from use of the manual in 1.03 had thought I needed the centrifuge-multi for sample sheet.
Would it be possible for you to update centrifuge for apt-get installation to 1.04?

Daniel Mapleson and others added 8 commits October 25, 2017 11:45

Adding new tool to centrifuge. Currently this is a carbon copy of cen…

3984320

…trifuge-class but I'll gradually modify this over time.

Added support for multi-sample handling.

c944363

Added native support for gzip deflation.

25854b0

Updated output to validate samplesheet prior to loading index.

4511a40

Resetting metrics after each sample iteration.

14f5a0e

Deleting samplesheet from memory after use.

6602499

Finish Release-EI1

32b11fd

Added support for samplesheet via new tool "centrifuge-multi"

Ensuring LDFLAGS is used when linking

02ab98b

mourisl mentioned this pull request Mar 21, 2018

Continuously stream input/out data #65

Open

mourisl mentioned this pull request Apr 22, 2018

Make a new release? #107

Closed

Adoni5 mentioned this pull request Jul 11, 2019

Sharing one index between several processes, over time. #135

Closed

g1o mentioned this pull request Oct 21, 2019

Running multiple samples while loading index only once bioinformatics-centre/kaiju#129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for handling multiple samples #95

Support for handling multiple samples #95

maplesond commented Feb 19, 2018 •

edited

Loading

khyox commented Feb 19, 2018 •

edited

Loading

maplesond commented Feb 19, 2018

khyox commented Mar 2, 2018

maplesond commented Mar 2, 2018

apredeus commented Mar 23, 2018

mourisl commented Mar 27, 2018 •

edited

Loading

maplesond commented Mar 27, 2018

maplesond commented Apr 23, 2018

mourisl commented Apr 23, 2018

maplesond commented Apr 25, 2018 •

edited

Loading

mourisl commented Apr 27, 2018

mourisl commented May 14, 2018

themouldinator commented Nov 29, 2019

mourisl commented Nov 29, 2019

themouldinator commented Jul 28, 2020

mourisl commented Jul 30, 2020

themouldinator commented Sep 17, 2020

Support for handling multiple samples #95

Are you sure you want to change the base?

Support for handling multiple samples #95

Conversation

maplesond commented Feb 19, 2018 • edited Loading

khyox commented Feb 19, 2018 • edited Loading

maplesond commented Feb 19, 2018

khyox commented Mar 2, 2018

maplesond commented Mar 2, 2018

apredeus commented Mar 23, 2018

mourisl commented Mar 27, 2018 • edited Loading

maplesond commented Mar 27, 2018

maplesond commented Apr 23, 2018

mourisl commented Apr 23, 2018

maplesond commented Apr 25, 2018 • edited Loading

mourisl commented Apr 27, 2018

mourisl commented May 14, 2018

themouldinator commented Nov 29, 2019

mourisl commented Nov 29, 2019

themouldinator commented Jul 28, 2020

mourisl commented Jul 30, 2020

themouldinator commented Sep 17, 2020

maplesond commented Feb 19, 2018 •

edited

Loading

khyox commented Feb 19, 2018 •

edited

Loading

mourisl commented Mar 27, 2018 •

edited

Loading

maplesond commented Apr 25, 2018 •

edited

Loading