Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(handlers): add support for unix-compatible (aka v7) tar files. #655

Merged
merged 2 commits into from
Oct 23, 2023

Conversation

qkaiser
Copy link
Contributor

@qkaiser qkaiser commented Sep 25, 2023

v7 tar headers do not have the 'ustar' magic that we match on for modern tar files. In order to match on those v7 archive, we build a regular expression that matches on mode, uid, gid, mtime, size files given their properties:

  • fixed size (e.g. 8 for mode)
  • optionally prepended by whitespaces
  • suffixed by null bytes (null terminated)
  • ASCII encoded digits (0x30 to 0x39)

In order to build a pattern that can be handled by hyperscan without using a notation such as \w{1,99} for path name (see ¹ for detailed explanation), we rely on utility function to build a regular expression to match all possible combination using the or (|) operator.

Note: hyperscan will yield a Pattern is too large exception when trying to use {1,99} notation. Even though we found out that using .* works, it would have an important performance impact on pattern matching. That's why we decided to go with the OR operator approach with combination.


Although it may look simple, this change is significant and will require:

  • testing with real fws (performance and regression)
  • reviewing by someone else (@martonilles might be the best candidate, as @vlaci will be away)
  • potentially re-merging the tar handlers (somewhat optional, and needs to be tested) - see ²

The current solution is the product of @qkaiser and @e3krisztian collaboration, so neither of us should be the final approver.

See #654

¹ https://intel.github.io/hyperscan/dev-reference/compilation.html
² the v7 matching rule probably supersedes the one defined for ustar so we can probably leave it out

@qkaiser qkaiser self-assigned this Sep 25, 2023
@qkaiser qkaiser added bug Something isn't working format:archive labels Sep 25, 2023
@qkaiser qkaiser requested a review from e3krisztian September 25, 2023 20:57
@qkaiser qkaiser force-pushed the 654-tar-handler branch 3 times, most recently from c84b912 to a26f644 Compare September 28, 2023 12:00
@e3krisztian
Copy link
Contributor

Made an attempt at using only the new version without the uStar handler, and after looking into typeflag related problems it looks working for the test inputs and also on a non-trivial sample, I think it could be possible to merge again the implementations.

@e3krisztian
Copy link
Contributor

Numbers are stored in octal, but there is a GNU extension, that does base-256 encoding to store bigger than 8G numbers, base-256 is harder to cover (there are prefix bytes), and also might not worth it: the uStar version already knows about this encoding, the now introduced does not need it for old archives.

Note, that this new format appears only for bigger tar files > 8GB.

So we might not need to merge the implementations in the end.

@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 3, 2023

Tested with samples with tar content that I have here (~3GB) and did not observe regressions. This patch has very limited performance impact. These are the stats from running executions with a 2.4GB real world tar sample:

performance results on main:

Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -k -d 1 /home/quentin/Downloads/cvd-host_package.tar':

         13.130,53 msec task-clock                       #    1,002 CPUs utilized             
               238      context-switches                 #   18,126 /sec                      
                13      cpu-migrations                   #    0,990 /sec                      
            43.256      page-faults                      #    3,294 K/sec                     
    53.669.223.451      cycles                           #    4,087 GHz                       
   142.150.660.523      instructions                     #    2,65  insn per cycle            
     3.396.889.993      branches                         #  258,702 M/sec                     
        34.637.512      branch-misses                    #    1,02% of all branches           

      13,109785030 seconds time elapsed

      10,849108000 seconds user
       2,282617000 seconds sys

performance results on this branch:

 Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -k -d 1 /home/quentin/Downloads/cvd-host_package.tar':

         13.170,81 msec task-clock                       #    1,000 CPUs utilized             
               628      context-switches                 #   47,681 /sec                      
                17      cpu-migrations                   #    1,291 /sec                      
            41.616      page-faults                      #    3,160 K/sec                     
    53.611.922.822      cycles                           #    4,071 GHz                       
   142.090.165.881      instructions                     #    2,65  insn per cycle            
     3.365.689.495      branches                         #  255,542 M/sec                     
        34.373.818      branch-misses                    #    1,02% of all branches           

      13,168817128 seconds time elapsed

      10,874326000 seconds user
       2,297380000 seconds sys

@qkaiser qkaiser requested review from vlaci and removed request for e3krisztian October 9, 2023 08:28
unblob/handlers/archive/tar.py Outdated Show resolved Hide resolved
unblob/handlers/archive/tar.py Outdated Show resolved Hide resolved
unblob/handlers/archive/tar.py Outdated Show resolved Hide resolved
@vlaci
Copy link
Contributor

vlaci commented Oct 9, 2023

I have some questions and a general improvement idea, otherwise I am happy with the changes here.

@vlaci
Copy link
Contributor

vlaci commented Oct 9, 2023

Tested with samples with tar content that I have here (~3GB) and did not observe regressions. This patch has very limited performance impact. These are the stats from running executions with a 2.4GB real world tar sample:

[...]

Another interesting sample could be some "garbage" data, filled with octal bytes which does not get picked up by other handlers, but triggers match on v7 all the time and then aborts on the checksum calculations.

@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 9, 2023

Tested with samples with tar content that I have here (~3GB) and did not observe regressions. This patch has very limited performance impact. These are the stats from running executions with a 2.4GB real world tar sample:
[...]

Another interesting sample could be some "garbage" data, filled with octal bytes which does not get picked up by other handlers, but triggers match on v7 all the time and then aborts on the checksum calculations.

Should be easy to build samples like this, will check when I have some time.

@e3krisztian e3krisztian force-pushed the 654-tar-handler branch 3 times, most recently from dd62347 to 5488d50 Compare October 9, 2023 13:37
@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 11, 2023

I created two 75MB samples with a tar header followed by random bytes from /dev/urandom:

  • first sample has an invalid checksum
  • second sample has a valid checksum

Invalid checksum (654-tar-handler)

Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -d 1 -k /tmp/tar_invalid.bin' (10 runs):

        139.327,15 msec task-clock:u                     #    9,269 CPUs utilized               ( +-  9,58% )
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
           470.571      page-faults:u                    #    6,207 K/sec                       ( +-  9,57% )
   361.014.537.121      cycles:u                         #    4,762 GHz                         ( +-  9,58% )
   414.359.307.124      instructions:u                   #    2,10  insn per cycle              ( +-  9,57% )
    73.635.561.486      branches:u                       #  971,237 M/sec                       ( +-  9,57% )
     1.556.295.394      branch-misses:u                  #    3,84% of all branches             ( +-  9,60% )

            15,031 +- 0,244 seconds time elapsed  ( +-  1,62% )

Invalid checksum (main)

Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -d 1 -k /tmp/tar_invalid.bin' (10 runs):

         59.823,84 msec task-clock:u                     #   10,031 CPUs utilized               ( +-  9,55% )
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
           457.648      page-faults:u                    #   13,883 K/sec                       ( +-  9,57% )
   226.910.378.653      cycles:u                         #    6,884 GHz                         ( +-  9,57% )
   394.810.857.655      instructions:u                   #    3,16  insn per cycle              ( +-  9,58% )
    70.929.772.008      branches:u                       #    2,152 G/sec                       ( +-  9,58% )
     1.485.695.361      branch-misses:u                  #    3,81% of all branches             ( +-  9,58% )

             5,964 +- 0,110 seconds time elapsed  ( +-  1,84% )

valid checksum (654-tar-handler):

Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -d 1 -k /tmp/tar_randombody.bin' (10 runs):

         64.096,36 msec task-clock:u                     #   10,034 CPUs utilized               ( +-  9,65% )
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
           470.814      page-faults:u                    #   13,464 K/sec                       ( +-  9,57% )
   236.536.039.198      cycles:u                         #    6,765 GHz                         ( +-  9,60% )
   413.304.132.300      instructions:u                   #    3,18  insn per cycle              ( +-  9,57% )
    73.431.199.316      branches:u                       #    2,100 G/sec                       ( +-  9,57% )
     1.513.254.304      branch-misses:u                  #    3,75% of all branches             ( +-  9,58% )

            6,3877 +- 0,0941 seconds time elapsed  ( +-  1,47% )

valid checksum (main):

Performance counter stats for 'poetry run unblob -v -e /tmp/out -f -d 1 -k /tmp/tar_randombody.bin' (10 runs):

         57.794,17 msec task-clock:u                     #   10,035 CPUs utilized               ( +-  9,51% )
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
           454.125      page-faults:u                    #   14,146 K/sec                       ( +-  9,57% )
   224.755.433.959      cycles:u                         #    7,001 GHz                         ( +-  9,56% )
   394.615.926.845      instructions:u                   #    3,18  insn per cycle              ( +-  9,57% )
    70.878.218.909      branches:u                       #    2,208 G/sec                       ( +-  9,57% )
     1.485.271.655      branch-misses:u                  #    3,81% of all branches             ( +-  9,57% )

            5,7590 +- 0,0461 seconds time elapsed  ( +-  0,80% )

Summary

branch sample time elapsed
654-tar-handler invalid checksum 15,031 +- 0,244 seconds time elapsed ( +- 1,62% )
main invalid checksum 5,964 +- 0,110 seconds time elapsed ( +- 1,84% )
654-tar-handler valid checksum 6,3877 +- 0,0941 seconds time elapsed ( +- 1,47% )
main valid checksum 5,7590 +- 0,0461 seconds time elapsed ( +- 0,80% )

So a clear performance impact (152%) on samples with invalid checksum. Smaller performance impact (10%) for samples with valid checksum.

@qkaiser
Copy link
Contributor Author

qkaiser commented Oct 11, 2023

To me it's okay to keep the code this way. Performance impact under specific conditions in exchange of support for a new format is a good trade off.

Asking approval from initial reviewer now.

@qkaiser qkaiser requested a review from vlaci October 11, 2023 10:45
qkaiser and others added 2 commits October 23, 2023 10:53
v7 tar headers do not have the 'ustar' magic that we match on for modern
tar files. In order to match on those v7 archive, we build a regular
expression that matches on mode, uid, gid, mtime, size files given their
properties:

- fixed size (e.g. 8 for mode)
- optionally prepended by whitespaces
- suffixed by null bytes (null terminated)
- ASCII encoded octal digits (0x30 to 0x37)

In order to build a pattern that can be handled by hyperscan without
using a notation such as '[\w]{1,99}' for path name (see below for
detailed explanation), we rely on utility function to build a regular
expression to match all possible combination using the or (|) operator.

Note: hyperscan will yield a "Pattern is too large" exception when
trying to use '{1,99}' notation. Even though we found out that using
'.*' works, it would have an important performance impact on pattern
matching. That's why we decided to go with the OR operator approach with
combination.

See https://intel.github.io/hyperscan/dev-reference/compilation.html for
more information about this.
The unix v7 old-style tar handler's pattern is not strict enough to
prevent false positives, so checking the checksum might prevent these
false matches.

The header chksum is an octal representation of the sum of header bytes
as (unsigned) integers (the chksum field is calculated with 8 spaces),
followed by a null and a space (there are tar files with these bytes
reversed).

Multiple header checksums are calculated, as the old header is much
shorter, than the newer headers.
Wikipedia also mentions some historic implementations using signed sums.
The potential match is discarded if the header checksum is not one of
the calculated checksums.
@qkaiser qkaiser merged commit ed3e303 into main Oct 23, 2023
12 checks passed
@qkaiser qkaiser deleted the 654-tar-handler branch October 23, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working format:archive
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants