Skip to content

Constructing the MSBWT

Matt Holt edited this page Jun 19, 2015 · 1 revision

Column-wise Construction

Generally speaking, this construction method follows the approach described by Bauer et al. in "Lightweight BWT construction for very large string collections". This method is a "column-wise" approach that requires all strings (i.e. reads) to be of uniform length. We provide methods that all the BWT to be built in both uncompressed and compressed formats. Currently, we provide command line options for this method when the input strings are in FASTQ format. We also provide API methods for construction from FASTA, BAM, or Python strings.

Recommended parameters:

  • Input format: FASTQ
  • Input type: Illumina; short, uniform length reads
  • Output type: Compressed BWT (Run-length encoded); significantly reduced disk I/O

Usage:

msbwt cffq --uniform --compressed /path/to/output/directory data/fq1.fq data/fq2.fq ...

Merge-based Construction

This construction method follows the algorithm described by Holt and McMillan in "Merging of Multi-String BWTs with Applications". This algorithm performs a divide-and-conquer of the reads by building many small MSBWTs and then repeatedly merging them together with a merge algorithm. Currently, this method build only into the uncompressed BWT format. Currently, we provide command line options for this method when the input strings are in FASTQ format. We also provide API methods for construction from FASTA, BAM, or Python strings.

Recommended parameters:

  • Input format: FASTQ
  • Input type: PacBio; long reads
  • Output type: Uncompressed BWT

Usage:

msbwt cffq /path/to/output/directory data/fq1.fq data/fq2.fq ...