Skip to content

jancrichter/bcf-reader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crates.io Crates.io Crates.io docs.rs GitHub Workflow Status

bcf_reader

This is an attempt to create a small, lightweight, pure-Rust library to allow efficient, cross-platform access to genotype data in BCF files.

Currently, the rust_htslib crate works only on Linux and macOS (not Windows?). The noodles crate is a pure Rust library for many bioinformatic file formats and works across Windows, Linux, and macOS. However, the noodles API for reading genotype data from BCF files can be slow due to its memory allocation patterns. Additionally, both crates have a large number of dependencies, as they provide many features and support a wide range of file formats.

One way to address the memory allocation and dependency issues is to manually parse BCF records according to its specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf) and use iterators whenever possible, especially for the per-sample fields, like GT and AD.

Note: This crate is in its early stages of development.

Usage

use bcf_reader::*;
let mut reader = smart_reader("testdata/test2.bcf");
let header = Header::from_string(&read_header(&mut reader));
// find key for a field in INFO or FORMAT or FILTER
let key = header.get_idx_from_dictionary_str("FORMAT", "GT").unwrap();
// access header dictionary
let d = &header.dict_strings()[&key];
assert_eq!(d["ID"], "GT");
assert_eq!(d["Dictionary"], "FORMAT");
/// get chromosome name
assert_eq!(header.get_chrname(0), "Pf3D7_01_v3");
let fmt_ad_key = header
    .get_idx_from_dictionary_str("FORMAT", "AD")
    .expect("FORMAT/AD not found");
let info_af_key = header
    .get_idx_from_dictionary_str("INFO", "AF")
    .expect("INFO/AF not found");

// this can be and should be reused to reduce allocation
let mut record = Record::default();
while let Ok(_) = record.read(&mut reader) {
    let pos = record.pos();

    // use byte ranges and shared buffer to get allele string values
    let allele_byte_ranges = record.alleles();
    let share_buf = record.buf_shared();
    let ref_rng = &allele_byte_ranges[0];
    let ref_allele_str =
        std::str::from_utf8(&share_buf[ref_rng.start..ref_rng.end]).unwrap();
    let alt1_rng = &allele_byte_ranges[1];
    let alt1_allele_str =
        std::str::from_utf8(&share_buf[alt1_rng.start..alt1_rng.end]).unwrap();
    // ...

    // access FORMAT/GT via iterator
    for nv in record.fmt_gt(&header) {
        let (has_no_ploidy, is_missing, is_phased, allele_idx) = nv.gt_val();
        // ...
    }

    // access FORMAT/AD via iterator
    for nv in record.fmt_field(fmt_ad_key) {
        match nv.int_val() {
            None => {}
            Some(ad) => {
                // ...
            }
        }
        // ...
    }

    // access FILTERS via itertor
    record.filters().for_each(|nv| {
        let filter_key = nv.int_val().unwrap() as usize;
        let dict_string_map = &header.dict_strings()[&filter_key];
        let filter_name = &dict_string_map["ID"];
        // ...
    });

    // access INFO/AF via itertor
    record.info_field_numeric(info_af_key).for_each(|nv| {
        let af = nv.float_val().unwrap();
        // ...
    });
}

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%