-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace extension based compression detection by crates niffler #10
Comments
I'd welcome a pull request, but before you do, I'd think getting a thumbs up from @tfenne makes sense |
Yes please! I'm curious, on the writer side, how things work? Do you auto-pick compression based on the extension, or do you require users to specify? |
And thank you! |
We require users to specify (never trust a filename) |
I mostly agree, however, you have to trust them at some point (e.g. specifying the type of compression)? Perhaps if no compression type is given, we fall back on the file extension detection? And if compression is given, we check the file extension against the few known ones so they don't mismatch, but continue on if the file extension is unknown? This would also be a great time to solve how to specify the compression parameters for a wide variety of compression types (see: #9 (comment)). I see in niffler there are 22 levels, which is needed for zstd, but what is level 22 for zlib? |
The choice I've made in several applications is that if the input is compressed in one format, the output is compressed in the same format, leaving the user free to choose the output format via a parameter. As for the compression level, I've chosen to keep the default compression levels (niffler doesn't detect the compression level used). If the user isn't satisfied with this behavior, he can send the uncompressed output as standard output and pass it on to his preferred compression tools with the parameters he has chosen. After all, this is a library, not an application, so we don't necessarily need to make this choice right now. About compression levels, in niffler if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format. |
Thanks @natir for the explanation and suggestions! I will keep #9 open for the time being, primarily for reference, but happy to close&replace it with your PR that utilizes niffler. My personal opinion on the reader/writer side of things would be as follows:
Finally, an interesting(?) case I could think of is something like #8 , where a VCF.gz could be read as a gzipped file but most downstream tools expect it to be written as a bgzf. I might have missed it but I couldn't see a BGZF module/support in niffler. Would it make sense to add it (assuming it actually isn't there and I didn't miss it) and have a rule like "if file format is VCF, even if original compression is gzip, writer will default to bgzf" or would that be too much? |
Hello,
I'm one of the niffler crate developers and I think you might be interested in this crate.
Niffler allows to open gzip, bzip2, lzma (xz) or zstd compressed files transparently just by calling
niffler::get_reader
. Format detection is based on the magic number at the beginning of the file, not the extension (no need to trust the file name).If you're interested, I can write a pull request.
The text was updated successfully, but these errors were encountered: