Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve runtime of de novo variant discovery with racon #302

Closed
mbhall88 opened this issue Oct 30, 2022 · 4 comments
Closed

Improve runtime of de novo variant discovery with racon #302

mbhall88 opened this issue Oct 30, 2022 · 4 comments

Comments

@mbhall88
Copy link
Member

    > My only comment is regarding runtime. I am seeing increases in runtime of a few minutes per sample.

Would you know the runtime of pandora with DBG denovo and with racon? IDK if a few minutes more is a big or small proportion of the total runtime. Also, did you run with the --debugging-files flag on or off? If it is on, I'd say this would contribute to a big increase of runtime, as we are outputting lots of data to understand exactly what pandora does in each mapping and discovery step, and writing to disk is very slow (this increase in runtime is fine, as --debugging-files should be on only if you want to debug what pandora is doing).

we run racon for every locus? Is that necessary? Could we somehow leverage the previous candidate discovery stuff to only run racon on those loci where we think there might be novel variants?

We could indeed have improvements in this area. We actually run racon for every locus several times, for a max of 10 rounds. We have an early stop mechanism if we see that racon did not improve the sequence, but in the worst case we might need to run 10 rounds. The improvements we did in this area is that neither minimap2 nor racon are called as external process, but is just a C++ function call, i.e. we use them as APIs, so we avoid all the overhead of creating processes, terminating and cleaning them up etc (e.g. see https://github.com/rmcolq/pandora/blob/racon_denovo/src/denovo_discovery/racon.cpp#L38-L49 to how we call racon). I think one of main performance improvements we can do is to use memfd (i.e. create files in RAM) instead of actual files in the filesystem for minimap2 and racon, as writing several files per locus to the disk is the bottleneck of discovery. This is noted as a TODO here. I have much of the code for memfd ready, minimap2 works with it, racon does not though. It always error out here because it needs suffixes in its filepaths to recognise what type of file it is, but memfd files always look like /proc/self/fd/63 or sth like that, e.g. no suffixes. I just need to change it to read fasta as default and I think will work... but got my time on pandora cut short...

The above would be a speed up on running minimap2+racon on a locus. We might also need a filtering on whether we even should run minimap2+racon on a locus, as you said. One way would be to check for drops of coverage, as the DBG denovo does, however this would not detect insertions on the ML sequence. If there are no new variants for a locus, we at least run racon just once and figure out it did not improve the sequence, and we finish the process... but we still run it...

Originally posted by @leoisl in #299 (comment)

@mbhall88
Copy link
Member Author

Would you know the runtime of pandora with DBG denovo and with racon?

I have some benchmarking violin plots which I will do up and reference here.

IDK if a few minutes more is a big or small proportion of the total runtime. Also, did you run with the --debugging-files flag on or off?

Yes, I do have this on. I'll run without it and see if there's a noticeable difference

@mbhall88
Copy link
Member Author

Okay, seeing much better runtimes without the debugging files. In some examples, it is half the runtime. In others its a smaller amount.

I'll provide a more accurate comparison soon.

@leoisl
Copy link
Collaborator

leoisl commented Oct 31, 2022

denovo racon code now uses memfd, see this commit. I finished this on friday, but did not have time to finish the very last tests and post it here, sorry. I think this might conclude the straightforward optimisations we can do. There are more optimisations, especially the one you mentioned where we look for gaps of coverage, or a similar procedure, to infer if we should even try to find new alleles or not. However, for that I think it would require us to put back quite a good amount of code, or reimplement some sort of samtools pileup on the pandora SAMs. As it seems non-trivial, I am not planning on do it now, and only if we see that pandora discover performance really needs a speed up

@leoisl
Copy link
Collaborator

leoisl commented Oct 31, 2022

As we just want some performance logging now, I am closing this in favour of #303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants