Improve runtime of de novo variant discovery with racon #302

mbhall88 · 2022-10-30T23:08:03Z

    > My only comment is regarding runtime. I am seeing increases in runtime of a few minutes per sample.

Would you know the runtime of pandora with DBG denovo and with racon? IDK if a few minutes more is a big or small proportion of the total runtime. Also, did you run with the --debugging-files flag on or off? If it is on, I'd say this would contribute to a big increase of runtime, as we are outputting lots of data to understand exactly what pandora does in each mapping and discovery step, and writing to disk is very slow (this increase in runtime is fine, as --debugging-files should be on only if you want to debug what pandora is doing).

we run racon for every locus? Is that necessary? Could we somehow leverage the previous candidate discovery stuff to only run racon on those loci where we think there might be novel variants?

We could indeed have improvements in this area. We actually run racon for every locus several times, for a max of 10 rounds. We have an early stop mechanism if we see that racon did not improve the sequence, but in the worst case we might need to run 10 rounds. The improvements we did in this area is that neither minimap2 nor racon are called as external process, but is just a C++ function call, i.e. we use them as APIs, so we avoid all the overhead of creating processes, terminating and cleaning them up etc (e.g. see https://github.com/rmcolq/pandora/blob/racon_denovo/src/denovo_discovery/racon.cpp#L38-L49 to how we call racon). I think one of main performance improvements we can do is to use memfd (i.e. create files in RAM) instead of actual files in the filesystem for minimap2 and racon, as writing several files per locus to the disk is the bottleneck of discovery. This is noted as a TODO here. I have much of the code for memfd ready, minimap2 works with it, racon does not though. It always error out here because it needs suffixes in its filepaths to recognise what type of file it is, but memfd files always look like /proc/self/fd/63 or sth like that, e.g. no suffixes. I just need to change it to read fasta as default and I think will work... but got my time on pandora cut short...

The above would be a speed up on running minimap2+racon on a locus. We might also need a filtering on whether we even should run minimap2+racon on a locus, as you said. One way would be to check for drops of coverage, as the DBG denovo does, however this would not detect insertions on the ML sequence. If there are no new variants for a locus, we at least run racon just once and figure out it did not improve the sequence, and we finish the process... but we still run it...

Originally posted by @leoisl in #299 (comment)

The text was updated successfully, but these errors were encountered:

mbhall88 · 2022-10-30T23:09:11Z

Would you know the runtime of pandora with DBG denovo and with racon?

I have some benchmarking violin plots which I will do up and reference here.

IDK if a few minutes more is a big or small proportion of the total runtime. Also, did you run with the --debugging-files flag on or off?

Yes, I do have this on. I'll run without it and see if there's a noticeable difference

mbhall88 · 2022-10-31T00:59:54Z

Okay, seeing much better runtimes without the debugging files. In some examples, it is half the runtime. In others its a smaller amount.

I'll provide a more accurate comparison soon.

leoisl · 2022-10-31T13:12:31Z

denovo racon code now uses memfd, see this commit. I finished this on friday, but did not have time to finish the very last tests and post it here, sorry. I think this might conclude the straightforward optimisations we can do. There are more optimisations, especially the one you mentioned where we look for gaps of coverage, or a similar procedure, to infer if we should even try to find new alleles or not. However, for that I think it would require us to put back quite a good amount of code, or reimplement some sort of samtools pileup on the pandora SAMs. As it seems non-trivial, I am not planning on do it now, and only if we see that pandora discover performance really needs a speed up

leoisl · 2022-10-31T13:13:51Z

As we just want some performance logging now, I am closing this in favour of #303

leoisl closed this as completed Oct 31, 2022

leoisl mentioned this issue Oct 31, 2022

Use racon for de novo variant discovery #299

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve runtime of de novo variant discovery with racon #302

Improve runtime of de novo variant discovery with racon #302

mbhall88 commented Oct 30, 2022

mbhall88 commented Oct 30, 2022

mbhall88 commented Oct 31, 2022

leoisl commented Oct 31, 2022

leoisl commented Oct 31, 2022

Improve runtime of de novo variant discovery with racon #302

Improve runtime of de novo variant discovery with racon #302

Comments

mbhall88 commented Oct 30, 2022

mbhall88 commented Oct 30, 2022

mbhall88 commented Oct 31, 2022

leoisl commented Oct 31, 2022

leoisl commented Oct 31, 2022