-
Notifications
You must be signed in to change notification settings - Fork 7
Constructing the MSBWT
Generally speaking, this construction method follows the approach described by Bauer et al. in "Lightweight BWT construction for very large string collections". This method is a "column-wise" approach that requires all strings (i.e. reads) to be of uniform length. We provide methods that all the BWT to be built in both uncompressed and compressed formats. Currently, we provide command line options for this method when the input strings are in FASTQ format. We also provide API methods for construction from FASTA, BAM, or Python strings.
Recommended parameters:
- Input format: FASTQ
- Input type: Illumina; short, uniform length reads
- Output type: Compressed BWT (Run-length encoded); significantly reduced disk I/O
Usage:
msbwt cffq --uniform --compressed /path/to/output/directory data/fq1.fq data/fq2.fq ...
This construction method follows the algorithm described by Holt and McMillan in "Merging of Multi-String BWTs with Applications". This algorithm performs a divide-and-conquer of the reads by building many small MSBWTs and then repeatedly merging them together with a merge algorithm. Currently, this method build only into the uncompressed BWT format. Currently, we provide command line options for this method when the input strings are in FASTQ format. We also provide API methods for construction from FASTA, BAM, or Python strings.
Recommended parameters:
- Input format: FASTQ
- Input type: PacBio; long reads
- Output type: Uncompressed BWT
Usage:
msbwt cffq /path/to/output/directory data/fq1.fq data/fq2.fq ...
Holt, J., & McMillan, L. (2014). Merging of multi-string BWTs with applications. Bioinformatics, btu584.
Holt, J., & McMillan, L. (2014, September). Constructing burrows-wheeler transforms of large string collections via merging. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 464-471). ACM.
James Holt - [email protected]