tar2tsv

#!/usr/bin/env python2.7
r"""
Output one-file-per-line from a tar file, in the format

Filename \t FileContentsAsJSONString

JSON-encoded strings are used since they never contain tabs nor newlines
(they're escaped), and they sometimes are fairly greppable.

Why?  Because scanning one big file is often much faster than scanning through
lots of little files as you would unpack a tarball -- e.g. "grep" vs "grep -r".
Too often, people distribute data as a tarball of lots of little files.  If you
unpack it, this is good for random access, but bad for large-scale scans.  The
latter is more common in large-scale data analysis.

(Alternative approach: The tarfile format allows for sequential access.
In python, see the 'tarfile' standard library.)

Example:

  /d/nyt_ldc/data/2006 % for f in *.tgz; {tar xf $f; tar2tsv $f > ${f:t}.tsv}

  (1) Searching the unpacked directories: 
      ... the speed-up is after lots of disk scanning. With large datasets, you
          don't get that speedup because they can't be cached.
  /d/nyt_ldc/data/2006 % time grep -r hope */ > /dev/null
  grep -r hope */ > /dev/null  1.73s user 1.84s system 47% cpu 7.468 total
  /d/nyt_ldc/data/2006 % time grep -r hope */ > /dev/null
  grep -r hope */ > /dev/null  1.67s user 1.15s system 99% cpu 2.817 total
  /d/nyt_ldc/data/2006 % time grep -r hope */ > /dev/null
  grep -r hope */ > /dev/null  1.67s user 1.12s system 99% cpu 2.796 total


  (2) Searching the TSV-JSON files:
  /d/nyt_ldc/data/2006 % time grep  hope *.tsv > /dev/null 
  grep hope *.tsv > /dev/null  1.02s user 0.30s system 99% cpu 1.320 total
  /d/nyt_ldc/data/2006 % time grep hope *.tsv > /dev/null
  grep hope *.tsv > /dev/null  1.02s user 0.30s system 99% cpu 1.321 total

"""

import sys,tarfile
#import json
import ujson as json
filenames = sys.argv[1:]
for tar_filename in filenames:
  tf = tarfile.open(tar_filename)
  for tarinfo in tf:
    if tarinfo.isfile():
      fp = tf.extractfile(tarinfo)
      contents = fp.read()
      print "%s\t%s" % (tarinfo.name, json.dumps(contents))