-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(handlers): add support for unix-compatible (aka v7) tar files. #655
Conversation
e2a524b
to
7bdb24b
Compare
c84b912
to
a26f644
Compare
a26f644
to
83c2e37
Compare
Made an attempt at using only the new version without the uStar handler, and after looking into |
83c2e37
to
2bd8648
Compare
Numbers are stored in octal, but there is a GNU extension, that does base-256 encoding to store bigger than 8G numbers, base-256 is harder to cover (there are prefix bytes), and also might not worth it: the uStar version already knows about this encoding, the now introduced does not need it for old archives. Note, that this new format appears only for bigger tar files > 8GB. So we might not need to merge the implementations in the end. |
2bd8648
to
d362e6a
Compare
Tested with samples with tar content that I have here (~3GB) and did not observe regressions. This patch has very limited performance impact. These are the stats from running executions with a 2.4GB real world tar sample: performance results on main:
performance results on this branch:
|
d362e6a
to
d012e0a
Compare
d012e0a
to
eaa3812
Compare
I have some questions and a general improvement idea, otherwise I am happy with the changes here. |
Another interesting sample could be some "garbage" data, filled with octal bytes which does not get picked up by other handlers, but triggers match on v7 all the time and then aborts on the checksum calculations. |
Should be easy to build samples like this, will check when I have some time. |
dd62347
to
5488d50
Compare
I created two 75MB samples with a tar header followed by random bytes from
Invalid checksum (654-tar-handler)
Invalid checksum (main)
valid checksum (654-tar-handler):
valid checksum (main):
Summary
So a clear performance impact (152%) on samples with invalid checksum. Smaller performance impact (10%) for samples with valid checksum. |
To me it's okay to keep the code this way. Performance impact under specific conditions in exchange of support for a new format is a good trade off. Asking approval from initial reviewer now. |
v7 tar headers do not have the 'ustar' magic that we match on for modern tar files. In order to match on those v7 archive, we build a regular expression that matches on mode, uid, gid, mtime, size files given their properties: - fixed size (e.g. 8 for mode) - optionally prepended by whitespaces - suffixed by null bytes (null terminated) - ASCII encoded octal digits (0x30 to 0x37) In order to build a pattern that can be handled by hyperscan without using a notation such as '[\w]{1,99}' for path name (see below for detailed explanation), we rely on utility function to build a regular expression to match all possible combination using the or (|) operator. Note: hyperscan will yield a "Pattern is too large" exception when trying to use '{1,99}' notation. Even though we found out that using '.*' works, it would have an important performance impact on pattern matching. That's why we decided to go with the OR operator approach with combination. See https://intel.github.io/hyperscan/dev-reference/compilation.html for more information about this.
The unix v7 old-style tar handler's pattern is not strict enough to prevent false positives, so checking the checksum might prevent these false matches. The header chksum is an octal representation of the sum of header bytes as (unsigned) integers (the chksum field is calculated with 8 spaces), followed by a null and a space (there are tar files with these bytes reversed). Multiple header checksums are calculated, as the old header is much shorter, than the newer headers. Wikipedia also mentions some historic implementations using signed sums. The potential match is discarded if the header checksum is not one of the calculated checksums.
5488d50
to
0166d74
Compare
v7 tar headers do not have the 'ustar' magic that we match on for modern tar files. In order to match on those v7 archive, we build a regular expression that matches on mode, uid, gid, mtime, size files given their properties:
In order to build a pattern that can be handled by hyperscan without using a notation such as
\w{1,99}
for path name (see ¹ for detailed explanation), we rely on utility function to build a regular expression to match all possible combination using the or (|
) operator.Note: hyperscan will yield a
Pattern is too large
exception when trying to use{1,99}
notation. Even though we found out that using.*
works, it would have an important performance impact on pattern matching. That's why we decided to go with the OR operator approach with combination.Although it may look simple, this change is significant and will require:
potentially re-merging the tar handlers (somewhat optional, and needs to be tested) - see ²The current solution is the product of @qkaiser and @e3krisztian collaboration, so neither of us should be the final approver.
See #654
¹ https://intel.github.io/hyperscan/dev-reference/compilation.html
² the v7 matching rule probably supersedes the one defined for ustar so we can probably leave it out