Skip to content

lh3/ropebwt3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started

# Compile
git clone https://github.com/lh3/ropebwt3
cd ropebwt3
make  # use "make omp=0" if your compiler doesn't suport OpenMP

# Toy examples
echo -e 'AGG\nAGC' | ./ropebwt3 build -LR -
echo TGAACTCTACACAACATATTTTGTCACCAAG | ./ropebwt3 build -Ldo idx.fmd -
echo ACTCTACACAAgATATTTTGTCA | ./ropebwt3 mem -Ll10 idx.fmd -

# Download the prebuilt FM-index for 152 M. tuberculosis genomes
wget -O- https://zenodo.org/records/12803206/files/mtb152.tar.gz?download=1 | tar -zxf -

# Count super-maximal exact matches (no contig positions)
echo ACCTACAACACCGGTGGCTACAACGTGG  | ./ropebwt3 mem -L mtb152.fmd -
# Local alignment
echo ACCTACAACACCGGTaGGCTACAACGTGG | ./ropebwt3 sw -Lm20 mtb152.fmd -
# Retrieve R15311, the 46th genome in the collection. 90=(46-1)*2
./ropebwt3 get mtb152.fmd 90 > R15311.fa

# Download the index of 398 human long-read assemblies (16GB download size)
wget -O human398.fmr.gz https://zenodo.org/records/14511009/files/human398.fmr.gz?download=1
wget -O human398.fmd.ssa.gz https://zenodo.org/records/14511009/files/human398.fmd.ssa.gz?download=1
wget -O human398.fmd.len.gz https://zenodo.org/records/14511009/files/human398.fmd.len.gz?download=1
gzip -d human398.fmr.gz human398.fmd.ssa.gz
./ropebwt3 build -i human398.fmr -do human398.fmd   # convert to a faster format

# Find C4 alleles (the query is on the exon 26 of C4A)
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ./ropebwt3 sw -eN200 -Lm10 human398.fmd -

Table of Contents

Introduction

Ropebwt3 constructs the FM-index of a large DNA sequence set and searches for matches against the FM-index. It is optimized for highly redundant sequence sets such as a pangenome or sequence reads at high coverage. Ropebwt3 can losslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length encoded BWT file and report supermaximal exact matches (SMEMs) or local alignments with mismatches and gaps.

Prebuilt ropebwt3 indices can be downloaded from Zenodo.

Usage

A full ropebwt3 index consists of three files:

  • <base>.fmd: run-length encoded BWT that supports the rank operation. It is generated by the build command. By default, the $i$-th sequence in the input is the $2i$-th sequence in the BWT and its reverse complement is the $(2i+1)$-th sequence. Some commands assume such ordering.

  • <base>.fmd.ssa: sampled suffix array, generated by the ssa command. For now, it is only needed for reporting coordinates in the PAF output of the sw command.

  • <base>.fmd.len.gz: list of sequence names and lengths. It is generated with third-party tools/scripts, for example, with seqtk comp input.fa | cut -f1,2 | gzip. This file is needed for reporting sequence names and lengths in the PAF output.

Finding maximal exact matches

A maximal exact match (MEM) is an exact alignment between the index and a query that cannot be extended in either direction. A super MEM (SMEM) is a MEM that is not contained in any other MEM on the query sequence. You can find the SMEMs with

ropebwt3 mem -t4 -l31 bwt.fmd query.fa > matches.bed

In the output, the first three columns give the query sequence name, start and end of a match and the fourth column gives the number of hits. Option -l specifies the minimum SMEM length. A larger value helps performance. This command does not output positions of SMEMs by default. You can use option -p to get the positions of a subset of SMEMs. In addition, you can use --gap to obtain regions not covered by long SMEMs or --cov to get the total length of regions covered by long SMEMs.

Local alignment

Ropebwt3 implements a revised BWA-SW algorithm to align query sequences against an FM-index:

ropebwt3 sw -t4 -N25 -k11 bwt.fmd query.fa > aln.paf

Option -N effectively sets the bandwidth during alignment. A larger value improves alignment accuracy at the cost of performance. Option -k initiates alignments with an exact match.

Given a complete ropebwt3 index with sampled suffix array and sequence names, the sw command outputs standard PAF but it only outputs one hit per query even if there are multiple equally best hits. The number of hits in BWT is written to the rh tag.

Local alignment is tens of times slower than finding SMEMs. It is not designed for aligning high-throughput sequence reads.

Haplotype diversity with end-to-end alignment

With option -e, the sw command aligns the query sequence from end to end. In this mode, ropebwt3 may output multiple suboptimal end-to-end hits. This provides a way to retrieve similar haplotypes from the index.

The hapdiv command applies this algorithm to 101-mers in a query sequence and outputs 1) query name, 2) query start, 3) query end, 4) number of distinct alleles the 101-mer matches, 5) maximum edit distance observed, 6) number of haplotypes with perfectly matching the 101-mer, 7-11) number of haplotypes with edit distance 1-5 from the 101-mer, and 12) with distance 6 or higher.

Indexing

Ropebwt3 implements two algorithms for BWT construction. Although both algorithms work for general sequences, you need to choose an algorithm based on the input date types for the best performance.

# If not sure, use the general command line
ropebwt3 build -t24 -bo bwt.fmr file1.fa file2.fa filen.fa
# You can also append another file to an existing index
ropebwt3 build -t24 -i bwt-old.fmr -bo bwt-new.fmr filex.fa
# If each file is small, concatenate them together
cat file1.fa file2.fa filen.fa | ropebwt3 build -t24 -m2g -bo bwt.fmr -
# For short reads, use the old ropebwt2 algorithm and optionally apply RCLO (option -r)
ropebwt3 build -r -bo bwt.fmr reads.fq.gz
# use grlBWT, which may be faster but uses working disk space
ropebwt3 fa2line genome1.fa genome2.fa genomen.fa > all.txt
grlbwt-cli all.txt -t 32 -T . -o bwt.grl
grl2plain bwt.rl_bwt bwt.txt
ropebwt3 plain2fmd -o bwt.fmd bwt.txt

These command lines construct a BWT for both strands of the input sequences. You can skip the reverse strand by adding option -R. If you provide multiple files on a build command line, ropebwt3 internally will run build on each input file and then incrementally merge each individual BWT to the final BWT.

After BWT construction, you will probably want to generate sampled suffix array with:

ropebwt3 ssa -o index.fmd.ssa -s8 -t32 index.fmd

This stores one suffix array value per $2^8$ positions. The size of the output file is roughly $64\cdot(n/2^s+m)$, where $n$ is the number of symbols in the BWT and $m$ is the number of sequences. Furthermore, if you want to get the contig names with sw, you need to prepare another file:

cat input*.fa.gz | seqtk comp | cut -f1,2 | gzip > index.fmd.len.gz

If the BWT is built from multiple files, make sure the order in cat is the same as the order used for BWT construction.

Binary BWT file formats

Ropebwt3 uses two binary formats to store run-length encoded BWTs: the ropebwt2 FMR format and the fermi FMD format. The FMR format is dynamic in that you can add new sequences or merge BWTs to an existing FMR file. The same BWT does not necessarily lead to the same FMR. The FMD format is simpler in structure, faster to load, smaller in memory and can be memory-mapped. The two formats can often be used interchangeably in ropebwt3, but it is recommended to use FMR for BWT construction and FMD for sequence search. You can explicitly convert between the two formats with:

ropebwt3 build -i in.fmd -bo out.fmr  # from static to dynamic format
ropebwt3 build -i in.fmr -do out.fmd  # from dynamic to static format

Citation

Ropebwt3 is described in

Li (2024) BWT construction and search at the terabase scale, arXiv:2409.00613

Limitations

  • Ropebwt3 is slow on the "locate" operation.