Releases: jkbonfield/io_lib
Io_lib 1.15.0
Version 1.15.0 (14th April 2023)
Version number bumped to reflect the official status of CRAM 3.1.
Updates:
-
Formally accept CRAM 3.1 as an official standard. Warning removed.
For best compatibility CRAM 3.0 is still the default CRAM, but use
"-V3.1" to specify the version. -
Updated to latest htscodecs. This has a significant speed
improvement in encoding with fqzcomp (enabled in "-X small" profile).Tested on a NovaSeq dataset, encoding from BAM to CRAM was 27% faster.
Decoding a CRAM with fqzcomp is also around 6% faster.
io_lib-1-14-15
This is a bug fix release.
Updates:
-
Switched to using GitHub actions for CI.
-
Updated htscodecs submodule to the latest version (1.3.0).
Also fixed function names used within it. -
Minor code tidyups to remove compiler warnings.
Bug fixes:
-
MacOS testing fix to cope with sed failing on long lines.
-
Fixed bam_aux_skip B array handling of signed values.
-
Improved detection of unsorted mode, which was causing a drastic
slow-down when encoding unsorted data. -
Fixed a CRAM reference multi-threading bug when fetching from a
fasta file. -
Fixed bam_aux_f decoding of floats.
io_lib-1-14-14
Version 1.14.14 (17th March 2021)
This is primarily a bug fix release.
Updates:
-
Bumped htscodecs submodule to 1.0. This is mainly security hardening. This now means for the first time Io_lib and HTSlib share the same code for the CRAM codecs.
-
Cram_filter now copies with CRAM 3.1 and 4.0 files.
-
Added Power support(ppc64le) to CI. (Author: Arumugam)
-
Added int64_t as a HashTable key type.
-
Improved configure script handling of lzma and bzip2, which are now on by default.
-
Improved support for hurd_i386 by defining PATH_MAX (with thanks to Michael Crusoe).
Bug fixes:
-
The CRAM_IO_CUSTOM_BUFFERING code is now enabled correctly. This is required for biobambam2 / libmaus2 integration. (Thanks to German Tischler-Hohle)
-
Fixed a recent bug in the cram_open_by_callbacks function used by Biobambam. (#39. Thanks to German Tischler-Hohle)
-
Fixed cram_codec_decoder2encoder handling of CRAM 4 encodings.
-
Fixed an uninitialised memory access added during 1.14.13 (harmlessas it was then immediately replaced again, but it triggered valgrind warnings).
-
Fixed configure --disable-custom-buffering
-
Typo fixes, courtesy of Debian lintian.
Staden io_lib 1.14.13
Version 1.14.13
This release has a mixture of on-going CRAM 4 work (not compatible with previous CRAM 4) and some more general quality of life improvements for all CRAM versions including speed-ups and better multi-threading.
Note both CRAM 3.1 and 4.0 are still to be considered an unofficial CRAM extensions.
Updates:
-
Scramble can now filter-in or filter-out aux tags during transcoding. This is done using -d and -D options. For example:
scramble -D OQ,BI,BD in.bam out.cram
removes the GATK added OQ, BI and BD aux tags.
Requested by @jhaezebrouck in issue #24. -
The Scramble -X options are now implemented using a CRAM_OPT_PROFILE option. This simplifies the scramble code and makes it easier to call from a library. This also fixes a number of bugs in the order of argument parsing.
-
Improved CRAM writing speeds.
The bam_copy function now only copies the number of used bytes rather than the number of allocated bytes, which can sometimes be substantially smaller. As this was done in the main thread it may have a significant benefit when multi-threading. -
Added libdeflate support into CRAM too (in addition to the existing support in BAM). This isn't a huge change to CRAM speeds except at high levels (-8 and -9) which are now slower, but also better compression ratio. A modest 2-3% speed gain is visible are low and mid levels, and at -1/-2 to -4 the compression ratio is also improved.
-
CRAM 3.1 compression level -1 is now 25% faster, but 4% larger. This is achieved by difference choice of compression codecs, most notably disabling the name tokeniser for level 1. Use level 2 for something comparable to the old behaviour.
-
Added an io_lib/version.h to make it easier to detect the version being compiled against using IOLIB_VERSION macros.
Requested by German Tischler in issue #25. -
Refactored the cram encoding interface used by biobambam.
Implemented by German Tischler in PR#27. -
CRAM 4 now uses E_CONST instead of a uni-value version of E_HUFFMAN. Also added offset field to VARINT_SIGNED and VARINT_UNSIGNED which helps for data series that have values from -1 to MAXINT.
-
CRAM 4 container structure has changed so that all values are variable sized integers instead of fixed size.
-
Further improvements with CRAM 4's use of signed values.
- Ref_seq_id is container and slice headers are now signed.
- RI (ref ID) data series and NS (mate ref ID) are also now signed as -1 is a valid value.
- Embedded ref id is now 0 for unusued instead of -1.
-
Reversed the use of CRAM 4 delta encoding for the B array. It only helps at the moment for ONT signal data, so it needs more work to
make it auto-detect when delta makes sense. (Enabling it globally for CRAM4 B aux tags was accidental.) -
Htscodecs submodule has gained support for big-endian platforms
Other big-endian improvements to parts of CRAM4 too.
Bug fixes:
-
Fixed CRAM MD tag generatin when using the "b" feature code. (NB: unused by known CRAM encoders).
Also see samtools/htslib#1086 for more details. -
Fixed CRAM quality string when using "q" feature code (unused by encoders?) and in lossy-quality mode (maybe utilised in old Cramtools).
Also see samtools/htslib#1094 for more details. -
Fixed some minor memory leaks.
-
"Scramble -X archive -1" enabled lzma, which should only have arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
-
Removed minor compilation warning in printf debugging.
-
Fixed a 7 year old bug in scram_pileup which couldn't cope with soft-clips being followed by hard-clips.
Staden io_lib 1.14.12
This is primarily a change to CRAM, focusing mainly on the unofficial
CRAM 3.1 and 4.0 file formats. Note these newer experimental formats
are INCOMPATIBLE with the 1.14.11 output!
Some changes also affect CRAM 3.0 (current) though. Main updates are:
-
Added compression profiles to scramble: fast, normal (default),
small and archive. Specify using scramble -X profile-name. These
change compression codecs permitted as well as the granularity of
random access ("fast" profile is 1/10th the size per block than
normal). -
NM and MD tags are now checked during encode to validate
auto-generation during decode. If they differ they are stored
verbatim. -
CRAM behaves better when many small chromosomes occur in the middle
of larger ones (as it can switch out of multi-ref mode again). -
Numerous improvements to CRAM 4.0 compression ratios.
-
Some speed improvements to CRAM 3.1 and 4.0 decoding.
-
Fixes to github issues/bugs #12, #14, #15, #17, #18, #19, #20, #21, #22.
See CHANGES for more details.
Staden io_lib 1.14.11
Updates:
- CRAM: http(s) queries now honour redirects.
The User-Agent header is also set, which is necessary in some
proxies.
Bug fixes:
-
CRAM: fix to major range query bug introduced in 1.14.10.
-
CRAM: more bug fixing on range queries when multi-threading (EOF
detection). -
The test harness now works correctly in bourne shell, without
using bashisms.
Staden io_lib 1.14.10
WARNING: some bugs have been found in 1.14.10. Use for evaluation of CRAM 4 only, while we track down these. A 1.14.11 will be available once we've fixed the problems. Apologies for the inconvenience.
Updates
-
BAM: Libdeflate support (https://github.com/ebiggers/libdeflate).
This library is significantly faster than zlib, so it is a good
alternative to the Cloudflare and/or Intel libraries.Configure using --with-libdeflate=/dir/to/deflate/install
-
CRAM EXPERIMENTAL: Added custom quality and identifier codecs.
Also added the ability to use libbsc as a general purpose codec.These are NOT OFFICIAL and so not enabled by default (version 3.0).
However as a technology demonstration only, they are available with
scramble -V3.1 or -V4.0 for evaluation and to promote discussion on
future CRAM formats. Do not use these on production data.Implementations of the codecs and CRAM version 4.0 layout are liable
to change without prior warning. -
CRAM: name sorted files now automatically switch to non-ref mode.
Bug fixes
-
CRAM: Considerable fixes to multi-threading.
- Using more than 1 slice per container with threading now works.
- Removal of race conditions when using CRAM_OPT_REQUIRED_FIELDS.
- Combinations of ref and no-ref mode in adjacent containers.
- Other misc. threading bugs.
-
Corrected end-of-range check in some scenarios.
-
CRAM: bug fix to index creation when a slice contains exactly one
alignment. -
SAM: fixed parsing of illegal sequence characters (eg "Z").
These are now treated as "N" and not "=". -
BAM/SAM: protect against out of bound CIGAR operations.
-
CRAM: hardening of rANS codec against malicious input.
Also fixed a very rare frequency renormalisation case. -
CRAM: fix with range queries used in conjuction with turning off
sequence retrieval (via CRAM_OPT_REQUIRED_FIELDS). -
Improved test harness for Windows and some header file problems.
-
Fixed bgzip on big endian systems. (Debian bugs 876839, 876840)
Staden Io_lib v1.14.9
Version 1.14.9 (9th February 2017)
Updates:
- BAM: Added CRC checking. Bizarrely this was absent here and in most
other BAM implementations too. Pure BAM decode of an uncompressed
BAM is around 9% slower and compressed BAM to compressed BAM is
almost identical. The most significant hit is reading uncompressed
BAM (and doing nothing else) which is 120% slower as CRC dominates.
Options are available to disable the CRC checking incase this is an
issue (scramble -!). - CRAM: Now supports bgziped fasta references.
- CRAM/SAM: Headers are now kept in the same basic type order while
transcoding. (Eg all @pg before all @sq, or vice versa, depending on
input ordering.) - CRAM: Compression level 1 is now faster but larger. (The old -1 and
-2 were too similar.) - CRAM: Improved compression efficiency in some files, when switching
from sorted to unsorted data. - CRAM: Various speedups relating to memory handling,
multi-threaded performance and the rANS codec. - CRAM: Block CRC checks are now only done when the block is used,
speeding up multi-threading and tools that do not decode all blocks
(eg flagstat). - Scramble -g and -G options to generate and reuse bgzip indices when
reading and writing BAM files. - Scramble -q option to omit updating the @pg header records.
- Experimental cram_filter tool has been added, to rapidly produce
cram subsets. - Migrated code base to git. Use github for primary repository.
Bug fixes:
- BAM: Fixed the bin value calculation for placed but unmapped reads.
- CRAM: Fixed file descriptor leak in refs_load_fai().
- CRAM: Fixed a crash in MD5 calculation for sequences beyond the
reference end. - CRAM: Bug fixes when encoding malformed @sq records.
- CRAM: Fixed a rare renormalisation bug in rANS codec.
- Fixed tests so make -j worked.
- Removed ancient, broken and unused popen() code.