Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: zsttool #11

Open
fiddyschmitt opened this issue Jan 4, 2022 · 6 comments
Open

Feature request: zsttool #11

fiddyschmitt opened this issue Jan 4, 2022 · 6 comments
Labels
laboratory ideas ideas and discussions for future related projects

Comments

@fiddyschmitt
Copy link

Hi Roberto,

Going out on a limb here, but do you think you can make a tool to index Zstandard files?

@circulosmeos
Copy link
Owner

circulosmeos commented Jan 15, 2022

In a quick review, I find that Zstandard format is also "indexable".
A different thing is that I find the time to implement this 🙂
I would probably make a quick implementation in a script language first, to test the possibilities...

@fiddyschmitt
Copy link
Author

Thanks for looking into it!
Understand about finding time :)

@mxmlnkn
Copy link

mxmlnkn commented Jul 18, 2022

@fiddyschmitt Maybe t2sz might be something for you. It compresses to zstd in such a manner that it can be easily seeked, e.g., with ratarmount, indexed_zstd, and libzstd-seek.

@fiddyschmitt
Copy link
Author

Awesome, thanks @mxmlnkn. That's really interesting. Do you know if t2sz can be used to create an index for an existing zst file (without having to create a new zst file)?

@mxmlnkn
Copy link

mxmlnkn commented Jul 19, 2022

Unfortunately, not.

I'm pretty sure last time I looked at the file formats, I found that it would be near impossible to do. Similar to gzip, zstd is a sequence of streams and blocks. This is btw also true for xz and lz4, I think. And while blocks are somewhat seekable, they require a back-reference window, i.e., the last x bytes from the previous decoding procedure. In contrast, streams are completely independent. This is why t2sz creates multiple streams instead of the default one stream per zstd file that the zstd standard compressor creates. But, while the back-reference windows in gzip are limited to 32 KiB, they can be as large as 2 GiB for zstd, xz, and lz4 if I remember correctly. This makes indexing near-impossible because you would have to save up to 2 GiB per checkpoint.

Maybe, an index implementation could check how large the actually required back-reference windows are. And in case, they are quite small, an index could still be created. I doubt that there are many zstd compression levels for which this is possible but that is only speculation. One mitigating factor, similar to gzip, could be uncompressed blocks inside the archive. If they are large enough, a checkpoint could be created there as the uncompressed chunks would serve as the back-reference window for all compressed blocks thereafter.

@fiddyschmitt
Copy link
Author

Fascinating, thanks!

@circulosmeos circulosmeos added the laboratory ideas ideas and discussions for future related projects label Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
laboratory ideas ideas and discussions for future related projects
Projects
None yet
Development

No branches or pull requests

3 participants