Replace extension based compression detection by crates niffler #10

natir · 2023-08-21T13:26:48Z

Hello,

I'm one of the niffler crate developers and I think you might be interested in this crate.

Niffler allows to open gzip, bzip2, lzma (xz) or zstd compressed files transparently just by calling niffler::get_reader. Format detection is based on the magic number at the beginning of the file, not the extension (no need to trust the file name).

If you're interested, I can write a pull request.

The text was updated successfully, but these errors were encountered:

nh13 · 2023-08-22T06:16:11Z

I'd welcome a pull request, but before you do, I'd think getting a thumbs up from @tfenne makes sense

tfenne · 2023-08-22T16:13:22Z

Yes please! I'm curious, on the writer side, how things work? Do you auto-pick compression based on the extension, or do you require users to specify?

tfenne · 2023-08-22T16:13:29Z

And thank you!

natir · 2023-08-23T11:07:48Z

We require users to specify (never trust a filename)

nh13 · 2023-08-24T04:56:13Z

never trust a filename

I mostly agree, however, you have to trust them at some point (e.g. specifying the type of compression)? Perhaps if no compression type is given, we fall back on the file extension detection? And if compression is given, we check the file extension against the few known ones so they don't mismatch, but continue on if the file extension is unknown?

This would also be a great time to solve how to specify the compression parameters for a wide variety of compression types (see: #9 (comment)). I see in niffler there are 22 levels, which is needed for zstd, but what is level 22 for zlib?

natir · 2023-08-24T10:56:58Z

The choice I've made in several applications is that if the input is compressed in one format, the output is compressed in the same format, leaving the user free to choose the output format via a parameter. As for the compression level, I've chosen to keep the default compression levels (niffler doesn't detect the compression level used).

If the user isn't satisfied with this behavior, he can send the uncompressed output as standard output and pass it on to his preferred compression tools with the parameters he has chosen.

After all, this is a library, not an application, so we don't necessarily need to make this choice right now.

About compression levels, in niffler if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format.

kockan · 2023-08-29T18:56:04Z

Thanks @natir for the explanation and suggestions! I will keep #9 open for the time being, primarily for reference, but happy to close&replace it with your PR that utilizes niffler.

My personal opinion on the reader/writer side of things would be as follows:

Reader: auto-detect input file compression via niffler but also check if the extension matches any of the "accepted" extensions for that file format, and at least give a warning otherwise (e.g. filename is "test.vcf.gz" but somehow it's an lzma compressed file)
Writer: default behavior is using the same compression format detected from the input as you've mentioned. Once again, an extension check might be useful as to not detect gzip with a ".zst" extension and still try to write a ".zst.gz" file. Unlikely, but just in case perhaps... As for levels, the current mechanism used by niffler seems reasonable to me.

Finally, an interesting(?) case I could think of is something like #8 , where a VCF.gz could be read as a gzipped file but most downstream tools expect it to be written as a bgzf. I might have missed it but I couldn't see a BGZF module/support in niffler. Would it make sense to add it (assuming it actually isn't there and I didn't miss it) and have a rule like "if file format is VCF, even if original compression is gzip, writer will default to bgzf" or would that be too much?

kockan mentioned this issue Aug 21, 2023

Feat: Add zstd support #9

Open

nh13 added the question Further information is requested label Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace extension based compression detection by crates niffler #10

Replace extension based compression detection by crates niffler #10

natir commented Aug 21, 2023

nh13 commented Aug 22, 2023

tfenne commented Aug 22, 2023

tfenne commented Aug 22, 2023

natir commented Aug 23, 2023

nh13 commented Aug 24, 2023

natir commented Aug 24, 2023

kockan commented Aug 29, 2023

Replace extension based compression detection by crates niffler #10

Replace extension based compression detection by crates niffler #10

Comments

natir commented Aug 21, 2023

nh13 commented Aug 22, 2023

tfenne commented Aug 22, 2023

tfenne commented Aug 22, 2023

natir commented Aug 23, 2023

nh13 commented Aug 24, 2023

natir commented Aug 24, 2023

kockan commented Aug 29, 2023