Skip to content

Releases: jkbonfield/io_lib

Io_lib 1.15.0

14 Apr 13:14
Compare
Choose a tag to compare

Version 1.15.0 (14th April 2023)

Version number bumped to reflect the official status of CRAM 3.1.

Updates:

  • Formally accept CRAM 3.1 as an official standard. Warning removed.
    For best compatibility CRAM 3.0 is still the default CRAM, but use
    "-V3.1" to specify the version.

  • Updated to latest htscodecs. This has a significant speed
    improvement in encoding with fqzcomp (enabled in "-X small" profile).

    Tested on a NovaSeq dataset, encoding from BAM to CRAM was 27% faster.
    Decoding a CRAM with fqzcomp is also around 6% faster.

io_lib-1-14-15

06 Dec 11:38
Compare
Choose a tag to compare

This is a bug fix release.

Updates:

  • Switched to using GitHub actions for CI.

  • Updated htscodecs submodule to the latest version (1.3.0).
    Also fixed function names used within it.

  • Minor code tidyups to remove compiler warnings.

Bug fixes:

  • MacOS testing fix to cope with sed failing on long lines.

  • Fixed bam_aux_skip B array handling of signed values.

  • Improved detection of unsorted mode, which was causing a drastic
    slow-down when encoding unsorted data.

  • Fixed a CRAM reference multi-threading bug when fetching from a
    fasta file.

  • Fixed bam_aux_f decoding of floats.

io_lib-1-14-14

18 Mar 10:01
Compare
Choose a tag to compare

Version 1.14.14 (17th March 2021)

This is primarily a bug fix release.

Updates:

  • Bumped htscodecs submodule to 1.0. This is mainly security hardening. This now means for the first time Io_lib and HTSlib share the same code for the CRAM codecs.

  • Cram_filter now copies with CRAM 3.1 and 4.0 files.

  • Added Power support(ppc64le) to CI. (Author: Arumugam)

  • Added int64_t as a HashTable key type.

  • Improved configure script handling of lzma and bzip2, which are now on by default.

  • Improved support for hurd_i386 by defining PATH_MAX (with thanks to Michael Crusoe).

Bug fixes:

  • The CRAM_IO_CUSTOM_BUFFERING code is now enabled correctly. This is required for biobambam2 / libmaus2 integration. (Thanks to German Tischler-Hohle)

  • Fixed a recent bug in the cram_open_by_callbacks function used by Biobambam. (#39. Thanks to German Tischler-Hohle)

  • Fixed cram_codec_decoder2encoder handling of CRAM 4 encodings.

  • Fixed an uninitialised memory access added during 1.14.13 (harmlessas it was then immediately replaced again, but it triggered valgrind warnings).

  • Fixed configure --disable-custom-buffering

  • Typo fixes, courtesy of Debian lintian.

Staden io_lib 1.14.13

03 Jul 16:21
Compare
Choose a tag to compare

Version 1.14.13

This release has a mixture of on-going CRAM 4 work (not compatible with previous CRAM 4) and some more general quality of life improvements for all CRAM versions including speed-ups and better multi-threading.

Note both CRAM 3.1 and 4.0 are still to be considered an unofficial CRAM extensions.

Updates:

  • Scramble can now filter-in or filter-out aux tags during transcoding. This is done using -d and -D options. For example:

    scramble -D OQ,BI,BD in.bam out.cram
    

    removes the GATK added OQ, BI and BD aux tags.
    Requested by @jhaezebrouck in issue #24.

  • The Scramble -X options are now implemented using a CRAM_OPT_PROFILE option. This simplifies the scramble code and makes it easier to call from a library. This also fixes a number of bugs in the order of argument parsing.

  • Improved CRAM writing speeds.
    The bam_copy function now only copies the number of used bytes rather than the number of allocated bytes, which can sometimes be substantially smaller. As this was done in the main thread it may have a significant benefit when multi-threading.

  • Added libdeflate support into CRAM too (in addition to the existing support in BAM). This isn't a huge change to CRAM speeds except at high levels (-8 and -9) which are now slower, but also better compression ratio. A modest 2-3% speed gain is visible are low and mid levels, and at -1/-2 to -4 the compression ratio is also improved.

  • CRAM 3.1 compression level -1 is now 25% faster, but 4% larger. This is achieved by difference choice of compression codecs, most notably disabling the name tokeniser for level 1. Use level 2 for something comparable to the old behaviour.

  • Added an io_lib/version.h to make it easier to detect the version being compiled against using IOLIB_VERSION macros.
    Requested by German Tischler in issue #25.

  • Refactored the cram encoding interface used by biobambam.
    Implemented by German Tischler in PR#27.

  • CRAM 4 now uses E_CONST instead of a uni-value version of E_HUFFMAN. Also added offset field to VARINT_SIGNED and VARINT_UNSIGNED which helps for data series that have values from -1 to MAXINT.

  • CRAM 4 container structure has changed so that all values are variable sized integers instead of fixed size.

  • Further improvements with CRAM 4's use of signed values.

    • Ref_seq_id is container and slice headers are now signed.
    • RI (ref ID) data series and NS (mate ref ID) are also now signed as -1 is a valid value.
    • Embedded ref id is now 0 for unusued instead of -1.
  • Reversed the use of CRAM 4 delta encoding for the B array. It only helps at the moment for ONT signal data, so it needs more work to
    make it auto-detect when delta makes sense. (Enabling it globally for CRAM4 B aux tags was accidental.)

  • Htscodecs submodule has gained support for big-endian platforms
    Other big-endian improvements to parts of CRAM4 too.

Bug fixes:

  • Fixed CRAM MD tag generatin when using the "b" feature code. (NB: unused by known CRAM encoders).
    Also see samtools/htslib#1086 for more details.

  • Fixed CRAM quality string when using "q" feature code (unused by encoders?) and in lossy-quality mode (maybe utilised in old Cramtools).
    Also see samtools/htslib#1094 for more details.

  • Fixed some minor memory leaks.

  • "Scramble -X archive -1" enabled lzma, which should only have arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)

  • Removed minor compilation warning in printf debugging.

  • Fixed a 7 year old bug in scram_pileup which couldn't cope with soft-clips being followed by hard-clips.

Staden io_lib 1.14.12

31 Jan 17:24
Compare
Choose a tag to compare

This is primarily a change to CRAM, focusing mainly on the unofficial
CRAM 3.1 and 4.0 file formats. Note these newer experimental formats
are INCOMPATIBLE with the 1.14.11 output!

Some changes also affect CRAM 3.0 (current) though. Main updates are:

  • Added compression profiles to scramble: fast, normal (default),
    small and archive. Specify using scramble -X profile-name. These
    change compression codecs permitted as well as the granularity of
    random access ("fast" profile is 1/10th the size per block than
    normal).

  • NM and MD tags are now checked during encode to validate
    auto-generation during decode. If they differ they are stored
    verbatim.

  • CRAM behaves better when many small chromosomes occur in the middle
    of larger ones (as it can switch out of multi-ref mode again).

  • Numerous improvements to CRAM 4.0 compression ratios.

  • Some speed improvements to CRAM 3.1 and 4.0 decoding.

  • Fixes to github issues/bugs #12, #14, #15, #17, #18, #19, #20, #21, #22.

See CHANGES for more details.

Staden io_lib 1.14.11

16 Oct 09:09
Compare
Choose a tag to compare

Updates:

  • CRAM: http(s) queries now honour redirects.
    The User-Agent header is also set, which is necessary in some
    proxies.

Bug fixes:

  • CRAM: fix to major range query bug introduced in 1.14.10.

  • CRAM: more bug fixing on range queries when multi-threading (EOF
    detection).

  • The test harness now works correctly in bourne shell, without
    using bashisms.

Staden io_lib 1.14.10

27 Sep 11:29
Compare
Choose a tag to compare
Staden io_lib 1.14.10 Pre-release
Pre-release

WARNING: some bugs have been found in 1.14.10. Use for evaluation of CRAM 4 only, while we track down these. A 1.14.11 will be available once we've fixed the problems. Apologies for the inconvenience.

Updates

  • BAM: Libdeflate support (https://github.com/ebiggers/libdeflate).
    This library is significantly faster than zlib, so it is a good
    alternative to the Cloudflare and/or Intel libraries.

    Configure using --with-libdeflate=/dir/to/deflate/install

  • CRAM EXPERIMENTAL: Added custom quality and identifier codecs.
    Also added the ability to use libbsc as a general purpose codec.

    These are NOT OFFICIAL and so not enabled by default (version 3.0).
    However as a technology demonstration only, they are available with
    scramble -V3.1 or -V4.0 for evaluation and to promote discussion on
    future CRAM formats. Do not use these on production data.

    Implementations of the codecs and CRAM version 4.0 layout are liable
    to change without prior warning.

  • CRAM: name sorted files now automatically switch to non-ref mode.

Bug fixes

  • CRAM: Considerable fixes to multi-threading.

    • Using more than 1 slice per container with threading now works.
    • Removal of race conditions when using CRAM_OPT_REQUIRED_FIELDS.
    • Combinations of ref and no-ref mode in adjacent containers.
    • Other misc. threading bugs.
  • Corrected end-of-range check in some scenarios.

  • CRAM: bug fix to index creation when a slice contains exactly one
    alignment.

  • SAM: fixed parsing of illegal sequence characters (eg "Z").
    These are now treated as "N" and not "=".

  • BAM/SAM: protect against out of bound CIGAR operations.

  • CRAM: hardening of rANS codec against malicious input.
    Also fixed a very rare frequency renormalisation case.

  • CRAM: fix with range queries used in conjuction with turning off
    sequence retrieval (via CRAM_OPT_REQUIRED_FIELDS).

  • Improved test harness for Windows and some header file problems.

  • Fixed bgzip on big endian systems. (Debian bugs 876839, 876840)

Staden Io_lib v1.14.9

10 Feb 12:32
Compare
Choose a tag to compare

Version 1.14.9 (9th February 2017)

Updates:

  • BAM: Added CRC checking. Bizarrely this was absent here and in most
    other BAM implementations too. Pure BAM decode of an uncompressed
    BAM is around 9% slower and compressed BAM to compressed BAM is
    almost identical. The most significant hit is reading uncompressed
    BAM (and doing nothing else) which is 120% slower as CRC dominates.
    Options are available to disable the CRC checking incase this is an
    issue (scramble -!).
  • CRAM: Now supports bgziped fasta references.
  • CRAM/SAM: Headers are now kept in the same basic type order while
    transcoding. (Eg all @pg before all @sq, or vice versa, depending on
    input ordering.)
  • CRAM: Compression level 1 is now faster but larger. (The old -1 and
    -2 were too similar.)
  • CRAM: Improved compression efficiency in some files, when switching
    from sorted to unsorted data.
  • CRAM: Various speedups relating to memory handling,
    multi-threaded performance and the rANS codec.
  • CRAM: Block CRC checks are now only done when the block is used,
    speeding up multi-threading and tools that do not decode all blocks
    (eg flagstat).
  • Scramble -g and -G options to generate and reuse bgzip indices when
    reading and writing BAM files.
  • Scramble -q option to omit updating the @pg header records.
  • Experimental cram_filter tool has been added, to rapidly produce
    cram subsets.
  • Migrated code base to git. Use github for primary repository.

Bug fixes:

  • BAM: Fixed the bin value calculation for placed but unmapped reads.
  • CRAM: Fixed file descriptor leak in refs_load_fai().
  • CRAM: Fixed a crash in MD5 calculation for sequences beyond the
    reference end.
  • CRAM: Bug fixes when encoding malformed @sq records.
  • CRAM: Fixed a rare renormalisation bug in rANS codec.
  • Fixed tests so make -j worked.
  • Removed ancient, broken and unused popen() code.