Skip to content

Allow custom differ and patcher format implementations #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nyurik opened this issue Apr 10, 2024 · 1 comment
Open

Allow custom differ and patcher format implementations #11

nyurik opened this issue Apr 10, 2024 · 1 comment

Comments

@nyurik
Copy link
Contributor

nyurik commented Apr 10, 2024

bsdiff format uses two somewhat unrelated concepts that fit many, but not all use cases: diffing algorithm and patch storage format.

There are multiple diffing algorithms which could be fine-tuned depending on the data being diffed, resulting in multiple implementations. This is usually the complicated part of the library, requiring a lot of testing and profiling.

Patch storage is far simpler, but may be optimized with different compressions (e.g. brotli), or removal of the magic BSDIFF40 prefix (e.g. in case when there are millions of small diffs being stored in the same sqlite file). Patch storage format is not affected by how the patch was generated.

I would like to discuss how qbsdiff can allow user-implemented patch storage. This way I could choose to have my own compression without the magic prefix, and store the patch in my own envelope format, while qbsdiff would pass me the raw patch data or take patch data and apply it. What do you think?

@hucsmn
Copy link
Owner

hucsmn commented Apr 12, 2024

Yes, making another envelope format to support customizable patch compression algorithm is reasonable.

But, I personally tend to just provide a fixed set of available compression algorithms in configuration, rather than user-implemented compression algorithm callbacks.

According to Colin Percival's paper of bsdiff, the constructed patch file size is highly relative to which compression algorithm is chosen.

The patch file is then constructed of three parts: First, a control file containing ADD and INSERT instructions; second, a ‘difference’ file, containing the bytewise differences of the approximate matches; and third, an ‘extra’ file, containing the bytes which were not part of an approximate match.
...
While these three files together are slightly larger than the original target file, the control and difference files are highly compressible; in particular, bzip2 tends to perform remarkably well (probably due to the highly structured nature of these two files).

Theoretically, these three parts of bsdiff patch file could be compressed using different algoritms due to their different data characteristics:

  • control instructions part (ctrls): repeated i64 triples;
  • differential data part (delta): delta to the original file data, very likely to be sparse data or of repeated patterns;
  • extra data part (extra): copies of the original file data.

Maybe some experiments on the state-of-art algorithms should be taken to update Colin Percival's conclusion in 20 years ago.

PS: There was another piece of art attempting to improve bsdiff algorithm with a brand-new patch format, in which multiple compression formats (brotli, zstd, etc.) were supported. (I have not investigated on it yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants