Skip to content

Latest commit

 

History

History
345 lines (284 loc) · 21.6 KB

README.md

File metadata and controls

345 lines (284 loc) · 21.6 KB

A rust FFI library for minimap2. In development! Feedback appreciated!

https://crates.io/crates/minimap2 https://docs.rs/minimap2/latest/minimap2/ CircleCI codecov

Structure

minimap2-sys is the raw FFI bindings to minimap2. minimap2 is the more opinionated, rusty version.

How to use

Requirements

minimap2 = "0.1.23+minimap2.2.28"

Also see Features

Tested with rustc 1.82.0 and nightly. So probably a good idea to upgrade before running. But let me know if you run into pain points with older versions and will try to fix.

Minimap2 Version Table

minimap2-rs minimap2
0.1.23 2.28
0.1.22 2.28
0.1.21 2.28
0.1.20 2.28
0.1.19 2.28
0.1.18 2.28
0.1.17 2.27
0.1.16 2.26

Usage

Create an Aligner

let mut aligner = Aligner::builder()
    .map_ont()
    .with_threads(8)
    .with_cigar()
    .with_index("ReferenceFile.fasta", None)
    .expect("Unable to build index");

Align a sequence:

let seq: Vec<u8> = b"ACTGACTCACATCGACTACGACTACTAGACACTAGACTATCGACTACTGACATCGA";
let alignment = aligner
    .map(&seq, false, false, None, None, Some(b"My Sequence Name"))
    .expect("Unable to align");

Presets

All minimap2 presets should be available (see functions section):

let aligner = map_ont();
let aligner = asm20();

Note Each preset overwrites different arguments. Using multiple at a time is not technically supported, but will work. Results unknown. So be careful! It's equivalent to running minimap2 -x map_ont -x short ...

Customization

MapOpts and IdxOpts can be customized with Rust's struct pattern, as well as applying mapping settings. Inspired by bevy.

let mut aligner: Aligner<PresetSet> = Aligner::builder().map_ont();
aligner.mapopt.seed = 42;
aligner.mapopt.best_n = 1;
aligner.idxopt.k = 21;
self.mapopt.flag |= MM_F_COPY_COMMENT as i64; // Setting a flag. If you do this frequently, open an [issue](https://github.com/jguhlin/minimap2-rs/issues/new) asking for an ergonomic function!
self.idxopt.flag |= MM_I_HPC as i32;

See full list of options below.

Working Example

Examples Directory

There are two working examples directly in this repo. In both instances below, 64 is the number of threads to allocate.

Channel-based multi-threading

cargo run --example channels -- reference_genome.fasta reads.fasta 64

Rayon-based multi-threading

cargo run --example rayon -- reference_genome.fasta reads.fasta 64

Depending on your needs you can probably do just fine with Rayon. But for very large implementations, interactivity, or limited memory, using channels may be the way to go.

Fakeminimap2

There is a binary called "fakeminimap2" which demonstrates basic usage and multithreading using channels or rayon. You can find it in this repo for an example. It it much more fully featured example, with an output interface, some mouse support, and interaction.

Code Examples

Alignment functions return a Mapping struct. The Alignment struct is only returned when the Aligner is created using .with_cigar().

A very simple example would be:

let mut file = std::fs::File::open(query_file);
let mut reader = BufReader::new(reader);
let mut fasta = Fasta::from_buffer(&mut reader)

for seq in reader {
    let seq = seq.unwrap();
    let alignment: Vec<Mapping> = aligner
        .map(&seq.sequence.unwrap(), false, false, None, None, None)
        .expect("Unable to align");
    println!("{:?}", alignment);
}

There is a map_file function that works on an entire file, but it is not-lazy and thus not suitable for large files. It may be removed in the future or moved to a separate lib.

let mappings: Result<Vec<Mapping>> = aligner.map_file("query.fa", false, false);

Multithreading

Multithreading is supported, for implementation example see fakeminimap2. Minimap2 also supports threading itself, and will use a minimum of 3 cores for building the index. Multithreading for mapping is left to the end-user.

Adjust the number of threads used to build the index:

let mut aligner = Aligner::builder()
    .map_ont()
    .with_index_threads(8);

Experimental Rayon support

This appears to work. See fakeminimap2 for full implementation.

use rayon::prelude::*;

let results = sequences.par_iter().map(|seq| {
    aligner.map(seq.as_bytes(), false, false, None, None, None).unwrap()
}).collect::<Vec<_>>();

Arc cloning the Aligner

Also works. Otherwise directly cloning the aligner will Arc clone the internal index.

Features

The following crate features are available:

  • map-file - Enables the ability to map a file directly to a reference. Enabled by deafult
  • htslib - Provides an interface to minimap2 that returns rust_htslib::Records
  • simde - Enables SIMD Everywhere library in minimap2
  • zlib-ng - Enables the use of zlib-ng for faster compression
  • curl - Enables curl for htslib
  • static - Builds minimap2 as a static library
  • sse2only - Builds minimap2 with only SSE2 support

Map-file is a default feature and enabled unless otherwise specified.

Missing Features

Create an issue if you need any of the following:

Potentially others. Please create an issue!

Building for MUSL

Follow these instructions.

In brief, using bash shell:

docker pull messense/rust-musl-cross:x86_64-musl
alias rust-musl-builder='docker run --rm -it -v "$(pwd)":/home/rust/src messense/rust-musl-cross:x86_64-musl'
rust-musl-builder cargo build --release

Minimap2 is tested on x86_64 and aarch64 (arm64). Other platforms may work, please open an issue if minimap2 compiles but minimap2-rs does not.

Features tested with MUSL

  • htslib - Success
  • simde - Success

Tools using this binding

  • Chopper - Long read trimming and filtering
  • mappy-rs - Drop-in multi-threaded replacement for python's mappy
  • HiFiHLA - HLA star-calling tool for PacBio HiFi data
  • STRdust - Tandem repeat genotyper for long reads
  • oarfish - transcript quantification from long-read RNA-seq data
  • lrge - Long Read-based Genome size Estimation from overlaps

Next things todo

  • Iterator interface for map_file
  • -sys Possible to decouple from pthread?

Citation

Please cite the appropriate minimap2 papers if you use this in your work, as well as this library.

DOI for this library

... coming soon ...

Minimap2 Papers

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]

and/or:

Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. [doi:10.1093/bioinformatics/btab705][doi2]

Minimap2 Mapping and Indexing Options

See customization for how to use these.

Mapping Options (MapOpt in rust, alias for mm_mapopt_t)

Field Name Type Description
flag i64 Flags to control mapping behavior (bitwise flags).
seed c_int Random seed for mapping.
sdust_thres c_int Threshold for masking low-complexity regions using SDUST.
max_qlen c_int Maximum query length.
bw c_int Bandwidth for alignment of short reads.
bw_long c_int Bandwidth for alignment of long reads.
max_gap c_int Maximum gap allowed in mapping.
max_gap_ref c_int Maximum gap allowed on the reference.
max_frag_len c_int Maximum fragment length for paired-end reads.
max_chain_skip c_int Maximum number of seeds to skip in chaining.
max_chain_iter c_int Maximum number of chaining iterations.
min_cnt c_int Minimum number of seeds required for a chain.
min_chain_score c_int Minimum score for a chain to be considered.
chain_gap_scale f32 Scaling factor for chain gap penalty.
chain_skip_scale f32 Scaling factor for chain skipping.
rmq_size_cap c_int Size cap for RMQ (Range Minimum Query).
rmq_inner_dist c_int Inner distance for RMQ rescue.
rmq_rescue_size c_int Size threshold for RMQ rescue.
rmq_rescue_ratio f32 Rescue ratio for RMQ.
mask_level f32 Level at which to mask repetitive seeds.
mask_len c_int Length of sequences to mask.
pri_ratio f32 Ratio threshold for primary alignment selection.
best_n c_int Maximum number of best alignments to retain.
alt_drop f32 Score drop ratio for alternative mappings.
a c_int Match score.
b c_int Mismatch penalty.
q c_int Gap open penalty.
e c_int Gap extension penalty.
q2 c_int Gap open penalty for long gaps.
e2 c_int Gap extension penalty for long gaps.
transition c_int Penalty for transitions in spliced alignment.
sc_ambi c_int Score for ambiguous bases.
noncan c_int Allow non-canonical splicing (boolean flag).
junc_bonus c_int Bonus score for junctions.
zdrop c_int Z-drop score for alignment extension stopping.
zdrop_inv c_int Inverse Z-drop score.
end_bonus c_int Bonus score for aligning to the ends of sequences.
min_dp_max c_int Minimum score to consider a DP alignment valid.
min_ksw_len c_int Minimum length for performing Smith-Waterman alignment.
anchor_ext_len c_int Length for anchor extension.
anchor_ext_shift c_int Shift for anchor extension.
max_clip_ratio f32 Maximum allowed clipping ratio.
rank_min_len c_int Minimum length for rank filtering.
rank_frac f32 Fraction for rank filtering.
pe_ori c_int Expected orientation of paired-end reads.
pe_bonus c_int Bonus score for proper paired-end alignment.
mid_occ_frac f32 Fraction for mid-occurrence filtering.
q_occ_frac f32 Fraction for query occurrence filtering.
min_mid_occ i32 Minimum mid-occurrence threshold.
max_mid_occ i32 Maximum mid-occurrence threshold.
mid_occ i32 Mid-occurrence cutoff value.
max_occ i32 Maximum occurrence cutoff value.
max_max_occ i32 Maximum allowed occurrence value.
occ_dist i32 Distribution of occurrences for filtering.
mini_batch_size i64 Size of mini-batches for processing.
max_sw_mat i64 Maximum size of Smith-Waterman matrices.
cap_kalloc i64 Memory allocation cap for kalloc.
split_prefix *const c_char Prefix for splitting output files.

Mapping Flags (MM_F_*)

These can be set with helper functions. Please see the docs for IdxOpt and MapOpt.

Flag Constant Value Description
MM_F_NO_DIAG 1 Skip seed pairs on the same diagonal.
MM_F_NO_DUAL 2 Do not compute reverse complement of seeds.
MM_F_CIGAR 4 Compute CIGAR string.
MM_F_OUT_SAM 8 Output alignments in SAM format.
MM_F_NO_QUAL 16 Do not output base quality in SAM.
MM_F_OUT_CG 32 Output CIGAR in CG format (Compact CIGAR).
MM_F_OUT_CS 64 Output cs tag (difference string) in SAM/PAF.
MM_F_SPLICE 128 Enable spliced alignment (for RNA-seq).
MM_F_SPLICE_FOR 256 Only consider the forward strand for spliced alignment.
MM_F_SPLICE_REV 512 Only consider the reverse strand for spliced alignment.
MM_F_NO_LJOIN 1024 Disable long join for gapped alignment.
MM_F_OUT_CS_LONG 2048 Output cs tag in long format.
MM_F_SR 4096 Perform split read alignment (for short reads).
MM_F_FRAG_MODE 8192 Fragment mode for paired-end reads.
MM_F_NO_PRINT_2ND 16384 Do not output secondary alignments.
MM_F_2_IO_THREADS 32768 Use two I/O threads during mapping.
MM_F_LONG_CIGAR 65536 Use long CIGAR (>65535 operations).
MM_F_INDEPEND_SEG 131072 Map segments independently in multiple mapping.
MM_F_SPLICE_FLANK 262144 Add flanking bases for spliced alignment.
MM_F_SOFTCLIP 524288 Perform soft clipping at ends.
MM_F_FOR_ONLY 1048576 Only map the forward strand of the query.
MM_F_REV_ONLY 2097152 Only map the reverse complement of the query.
MM_F_HEAP_SORT 4194304 Use heap sort for mapping.
MM_F_ALL_CHAINS 8388608 Output all chains (may include suboptimal chains).
MM_F_OUT_MD 16777216 Output MD tag in SAM.
MM_F_COPY_COMMENT 33554432 Copy comment from FASTA/Q to SAM output.
MM_F_EQX 67108864 Use =/X instead of M in CIGAR.
MM_F_PAF_NO_HIT 134217728 Output unmapped reads in PAF format.
MM_F_NO_END_FLT 268435456 Disable end flanking region filtering.
MM_F_HARD_MLEVEL 536870912 Hard mask low-complexity regions.
MM_F_SAM_HIT_ONLY 1073741824 Output only alignments in SAM (no headers).
MM_F_RMQ 2147483648 Use RMQ for read mapping quality estimation.
MM_F_QSTRAND 4294967296 Consider query strand in mapping.
MM_F_NO_INV 8589934592 Disable inversion in alignment.
MM_F_NO_HASH_NAME 17179869184 Do not hash read names (for reproducibility).
MM_F_SPLICE_OLD 34359738368 Use old splice alignment model.
MM_F_SECONDARY_SEQ 68719476736 Output sequence of secondary alignments.
MM_F_OUT_DS 137438953472 Output detailed alignment score (ds tag).

Index Options (IdxOpt in rust, alias for mm_idxopt_t)

Field Name Type Description
k c_short K-mer size (mer length).
w c_short Minimizer window size.
flag c_short Flags to control indexing behavior (bitwise flags).
bucket_bits c_short Number of bits for the size of hash table buckets.
mini_batch_size i64 Size of mini-batches for indexing (number of bases).
batch_size u64 Total batch size for indexing (number of bases).

Indexing Flags (MM_I_*)

These can be set with helper functions. Please see the docs for IdxOpt and MapOpt.

Flag Constant Value Description
MM_I_HPC 1 Use homopolymer-compressed k-mers for indexing.
MM_I_NO_SEQ 2 Do not store sequences in the index.
MM_I_NO_NAME 4 Do not store sequence names in the index.

Changelog

See CHANGELOG.md

Funding

Genomics Aotearoa