Feature request: zsttool #11

fiddyschmitt · 2022-01-04T08:48:56Z

Hi Roberto,

Going out on a limb here, but do you think you can make a tool to index Zstandard files?

circulosmeos · 2022-01-15T21:55:06Z

In a quick review, I find that Zstandard format is also "indexable".
A different thing is that I find the time to implement this 🙂
I would probably make a quick implementation in a script language first, to test the possibilities...

fiddyschmitt · 2022-01-16T00:52:20Z

Thanks for looking into it!
Understand about finding time :)

mxmlnkn · 2022-07-18T20:33:15Z

@fiddyschmitt Maybe t2sz might be something for you. It compresses to zstd in such a manner that it can be easily seeked, e.g., with ratarmount, indexed_zstd, and libzstd-seek.

fiddyschmitt · 2022-07-19T05:29:05Z

Awesome, thanks @mxmlnkn. That's really interesting. Do you know if t2sz can be used to create an index for an existing zst file (without having to create a new zst file)?

mxmlnkn · 2022-07-19T07:20:32Z

Unfortunately, not.

I'm pretty sure last time I looked at the file formats, I found that it would be near impossible to do. Similar to gzip, zstd is a sequence of streams and blocks. This is btw also true for xz and lz4, I think. And while blocks are somewhat seekable, they require a back-reference window, i.e., the last x bytes from the previous decoding procedure. In contrast, streams are completely independent. This is why t2sz creates multiple streams instead of the default one stream per zstd file that the zstd standard compressor creates. But, while the back-reference windows in gzip are limited to 32 KiB, they can be as large as 2 GiB for zstd, xz, and lz4 if I remember correctly. This makes indexing near-impossible because you would have to save up to 2 GiB per checkpoint.

Maybe, an index implementation could check how large the actually required back-reference windows are. And in case, they are quite small, an index could still be created. I doubt that there are many zstd compression levels for which this is possible but that is only speculation. One mitigating factor, similar to gzip, could be uncompressed blocks inside the archive. If they are large enough, a checkpoint could be created there as the uncompressed chunks would serve as the back-reference window for all compressed blocks thereafter.

fiddyschmitt · 2022-07-19T09:09:59Z

Fascinating, thanks!

circulosmeos added the laboratory ideas ideas and discussions for future related projects label Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: zsttool #11

Feature request: zsttool #11

fiddyschmitt commented Jan 4, 2022

circulosmeos commented Jan 15, 2022 •

edited

Loading

fiddyschmitt commented Jan 16, 2022

mxmlnkn commented Jul 18, 2022 •

edited

Loading

fiddyschmitt commented Jul 19, 2022

mxmlnkn commented Jul 19, 2022 •

edited

Loading

fiddyschmitt commented Jul 19, 2022

Feature request: zsttool #11

Feature request: zsttool #11

Comments

fiddyschmitt commented Jan 4, 2022

circulosmeos commented Jan 15, 2022 • edited Loading

fiddyschmitt commented Jan 16, 2022

mxmlnkn commented Jul 18, 2022 • edited Loading

fiddyschmitt commented Jul 19, 2022

mxmlnkn commented Jul 19, 2022 • edited Loading

fiddyschmitt commented Jul 19, 2022

circulosmeos commented Jan 15, 2022 •

edited

Loading

mxmlnkn commented Jul 18, 2022 •

edited

Loading

mxmlnkn commented Jul 19, 2022 •

edited

Loading