-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve runtime of de novo variant discovery with racon #302
Comments
I have some benchmarking violin plots which I will do up and reference here.
Yes, I do have this on. I'll run without it and see if there's a noticeable difference |
Okay, seeing much better runtimes without the debugging files. In some examples, it is half the runtime. In others its a smaller amount. I'll provide a more accurate comparison soon. |
|
As we just want some performance logging now, I am closing this in favour of #303 |
Would you know the runtime of
pandora
with DBG denovo and withracon
? IDK if a few minutes more is a big or small proportion of the total runtime. Also, did you run with the--debugging-files
flag on or off? If it is on, I'd say this would contribute to a big increase of runtime, as we are outputting lots of data to understand exactly whatpandora
does in each mapping and discovery step, and writing to disk is very slow (this increase in runtime is fine, as--debugging-files
should be on only if you want to debug whatpandora
is doing).We could indeed have improvements in this area. We actually run
racon
for every locus several times, for a max of 10 rounds. We have an early stop mechanism if we see thatracon
did not improve the sequence, but in the worst case we might need to run 10 rounds. The improvements we did in this area is that neitherminimap2
norracon
are called as external process, but is just a C++ function call, i.e. we use them as APIs, so we avoid all the overhead of creating processes, terminating and cleaning them up etc (e.g. see https://github.com/rmcolq/pandora/blob/racon_denovo/src/denovo_discovery/racon.cpp#L38-L49 to how we callracon
). I think one of main performance improvements we can do is to use memfd (i.e. create files in RAM) instead of actual files in the filesystem forminimap2
andracon
, as writing several files per locus to the disk is the bottleneck of discovery. This is noted as a TODO here. I have much of the code formemfd
ready,minimap2
works with it,racon
does not though. It always error out here because it needs suffixes in its filepaths to recognise what type of file it is, butmemfd
files always look like/proc/self/fd/63
or sth like that, e.g. no suffixes. I just need to change it to readfasta
as default and I think will work... but got my time onpandora
cut short...The above would be a speed up on running
minimap2+racon
on a locus. We might also need a filtering on whether we even should runminimap2+racon
on a locus, as you said. One way would be to check for drops of coverage, as the DBG denovo does, however this would not detect insertions on the ML sequence. If there are no new variants for a locus, we at least runracon
just once and figure out it did not improve the sequence, and we finish the process... but we still run it...Originally posted by @leoisl in #299 (comment)
The text was updated successfully, but these errors were encountered: