Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up alignment 2-5x(?) by using mm2-plus #33

Open
samuell opened this issue Dec 2, 2024 · 5 comments
Open

Speed up alignment 2-5x(?) by using mm2-plus #33

samuell opened this issue Dec 2, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@samuell
Copy link
Contributor

samuell commented Dec 2, 2024

Enhancement suggestion

It seems it might be possible to speed up the most resource intensive part of EMU (the alignment part done by minimap2) by switching minimap2 to this new improved drop-in replacement: https://github.com/at-cg/mm2-plus

They report speedups of around 2-5x, from what I can see in the graphs, depending on the dataset, although that includes spreading out the workload on multiple CPU cores, which I understand might not make such a big difference in EMU, since EMU can leverage multiple CPU cores by running multiple minimap2 jobs in parallel anyways if I understand correctly(?)

Anyways, there is a preprint about the tool here: https://www.biorxiv.org/content/10.1101/2024.11.25.625328v1

Motivation

The resource requirements for EMU right now are somewhat demanding, which might hinder fast response times depending on the amount of samples and available compute power.

In our rough tests we have seen resource requirements in the ballpark of:

  • 0.3-0.4 (CPU) core seconds per read (avg 1400 bp in length)
  • Around 0.5 core hours per 4000 reads chunk file output from the instrument
  • Full samples easily taking ~20 core hours, if having 40 such chunks (x 4000 reads), to compute (of course can be cut by ~10x by scaling out on 10 cores etc, but still pretty demanding).
@samuell samuell changed the title Use faster version of minimap2 to speed up pipeline Speed up alignment 2-5x(?) by using mm2-plus Dec 2, 2024
@kdc10 kdc10 added the enhancement New feature or request label Dec 3, 2024
@kdc10
Copy link
Member

kdc10 commented Dec 3, 2024

Thanks for letting us know! We will look into it!

@jodjo86
Copy link
Contributor

jodjo86 commented Dec 4, 2024

Did you use the --mm2-forward-only argument ? It force minimap2 to consider the forward transcript strand only. While not as promising as mm2-fast, it can speed up EMU. The argument is suitable for Iso-seq, Direct RNA-seq and traditional full-length cDNAs.

source: https://github.com/lh3/minimap2?tab=readme-ov-file#map-long-mrnacdna-reads

@samuell
Copy link
Contributor Author

samuell commented Dec 5, 2024

Thank you for the suggestion @jodjo86 ! We'll have a look at that!

I should also say that after looking closer into mm2-plus, I realize the speedup might not be as great since a big part of it seems to be based on utilizing multiple CPU cores, which is already done in EMU by running multiple minimap2 jobs in parallel. So any remaining speedups are then probably coming from the SIMD optimizations I guess.

I'm still interested in trying that out, but haven't got to it just yet.

Will report back if and when!

@jodjo86
Copy link
Contributor

jodjo86 commented Dec 6, 2024

As far as I understand, EMU does not run multiple minimap2 jobs in parallel. The --threads argument of EMU is given directly to minimap2. The minimap2 documentation says Minimap2 uses at most three threads when indexing target sequences, and uses up to INT+1 threads when mapping. The indexing step with EMU is very fast (<2sec), it is therefore advantageous to use as much core as possible to speed up EMU.

Cautionary tale: mm2 is a robust and heavily documented tool, I would wait to see what happens with mm2-plus before implementing it.

I'm just an EMU user but I hope this helps.

source: https://lh3.github.io/minimap2/minimap2.html

@jodjo86
Copy link
Contributor

jodjo86 commented Dec 6, 2024

I did a small benchmark with the minimap2 step of EMU and a fastq file (16S amplicon ONT) of 61k reads (with EMU_db).

image

example command: minimap2 -ax map-ont -t $THREADS -N 50 -p .9 -u f -K 500000000 EMU_db/species_taxid.fasta $FASTQ -o $SAM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants