Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace extension based compression detection by crates niffler #10

Open
natir opened this issue Aug 21, 2023 · 7 comments
Open

Replace extension based compression detection by crates niffler #10

natir opened this issue Aug 21, 2023 · 7 comments
Labels
question Further information is requested

Comments

@natir
Copy link

natir commented Aug 21, 2023

Hello,

I'm one of the niffler crate developers and I think you might be interested in this crate.

Niffler allows to open gzip, bzip2, lzma (xz) or zstd compressed files transparently just by calling niffler::get_reader. Format detection is based on the magic number at the beginning of the file, not the extension (no need to trust the file name).

If you're interested, I can write a pull request.

@nh13 nh13 added the question Further information is requested label Aug 22, 2023
@nh13
Copy link
Member

nh13 commented Aug 22, 2023

I'd welcome a pull request, but before you do, I'd think getting a thumbs up from @tfenne makes sense

@tfenne
Copy link
Member

tfenne commented Aug 22, 2023

Yes please! I'm curious, on the writer side, how things work? Do you auto-pick compression based on the extension, or do you require users to specify?

@tfenne
Copy link
Member

tfenne commented Aug 22, 2023

And thank you!

@natir
Copy link
Author

natir commented Aug 23, 2023

We require users to specify (never trust a filename)

@nh13
Copy link
Member

nh13 commented Aug 24, 2023

never trust a filename

I mostly agree, however, you have to trust them at some point (e.g. specifying the type of compression)? Perhaps if no compression type is given, we fall back on the file extension detection? And if compression is given, we check the file extension against the few known ones so they don't mismatch, but continue on if the file extension is unknown?

This would also be a great time to solve how to specify the compression parameters for a wide variety of compression types (see: #9 (comment)). I see in niffler there are 22 levels, which is needed for zstd, but what is level 22 for zlib?

@natir
Copy link
Author

natir commented Aug 24, 2023

The choice I've made in several applications is that if the input is compressed in one format, the output is compressed in the same format, leaving the user free to choose the output format via a parameter. As for the compression level, I've chosen to keep the default compression levels (niffler doesn't detect the compression level used).

If the user isn't satisfied with this behavior, he can send the uncompressed output as standard output and pass it on to his preferred compression tools with the parameters he has chosen.

After all, this is a library, not an application, so we don't necessarily need to make this choice right now.

About compression levels, in niffler if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format.

@kockan
Copy link

kockan commented Aug 29, 2023

Thanks @natir for the explanation and suggestions! I will keep #9 open for the time being, primarily for reference, but happy to close&replace it with your PR that utilizes niffler.

My personal opinion on the reader/writer side of things would be as follows:

  • Reader: auto-detect input file compression via niffler but also check if the extension matches any of the "accepted" extensions for that file format, and at least give a warning otherwise (e.g. filename is "test.vcf.gz" but somehow it's an lzma compressed file)

  • Writer: default behavior is using the same compression format detected from the input as you've mentioned. Once again, an extension check might be useful as to not detect gzip with a ".zst" extension and still try to write a ".zst.gz" file. Unlikely, but just in case perhaps... As for levels, the current mechanism used by niffler seems reasonable to me.

Finally, an interesting(?) case I could think of is something like #8 , where a VCF.gz could be read as a gzipped file but most downstream tools expect it to be written as a bgzf. I might have missed it but I couldn't see a BGZF module/support in niffler. Would it make sense to add it (assuming it actually isn't there and I didn't miss it) and have a rule like "if file format is VCF, even if original compression is gzip, writer will default to bgzf" or would that be too much?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants