Skip to content
stefan-schroedl edited this page Oct 13, 2014 · 5 revisions

Description and Features

Project Tabulatoris a set of Unix command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, csv, or flat file format). We assume rows constitute records, split into a fixed number of columns; the first line signifies a header specifying the column names.

  • Command line arguments can refer to column names.
  • Delimiter characters are detected.
  • Compressed files are recognized and treated appropriately.
  • All scripts can be run either with file arguments or on stdin, to support pipe constructs.
  • For each script, the -h and -m switches provide more detailed help and list possible options.
  • Most (but not all) scripts translate the given command line options into a sequence of Unix shell commands (awk, cut, join and sort).

tblmap

Per-line computation (corresponding to the Map operation within the MapReduce scheme). For example, create new columns based on functions of other columns; remove or reorder columns; filter rows based on a column expression.

Examples:

  • Print columns rep and sales for lines where column region has value us:

tblmap −s’region=="us"’ −k’rep,sales’

  • Compute ratio of columns sales and clients, output as single column:

tblmap −k ’’ −c’sales_per_client=sales/client’

  • Cut out of columns whose names contains the string internal:

tblmap −irk’internal’

  • Shuffle a file randomly by generating a random number per line:

cat file | tblmap −c"myrand=rand()" | tblcmd "sort −k myrand −t’,’" | tblmap −irk"myrand"

tblred

Compute ("reduce") aggregations (sum, average, etc) over groups of rows identified by key column(s).

Example: computes for each line the proportion of column sales to the total for all lines with the same value of column region.

tblred -k'region' 'sales_ratio=sales/sum(sales)' sales.txt

tbldesc

Represent a snapshot of the type of data in a file. Summarize columns (e.g., proportion of character/numeric content, min/mean/median/max, missing values). A target column can be specified, in which case the correlation and predictive strength of all other columns are calculated (feature selection). The columns can be ordered by any of the output characteristics.

Example: Suppose file contains the following lines:

         name,sex,house_nr,height,shoe_size
         arthur,m,42,181,11.5
         berta,f,101,163,8.5
         chris,m,1333,175,10
         don,m,77,185,12.5
         elisa,f,204,166,7

Then tbldesc −k shoe_size file prints:

  COL NAME            | TYPE        NA  CHAR   INT     MEAN      MED      MIN      MAX      STD |   MAR R2ADJ   CORR  PVAL  VALUES 
=======================================================================================================================================
  5   shoe_size       | num        0.0   0.0  40.0      9.9       10        7     12.5     2.22 | 100.0 100.0  100.0   0.0  5 uniq:
                      |                                                                         |                           '8.5'x1=>8.5,
                      |                                                                         |                           '7'x1=>7,
                      |                                                                         |                           '12.5'x1=>12.
                      |                                                                         |                           5,
                      |                                                                         |                           '10'x1=>10,
                      |                                                                         |                           '11.5'x1=>11.
                      |                                                                         |                           5
_______________________________________________________________________________________________________________________________________
  4   height          | int        0.0   0.0 100.0      174      175      163      185     9.43 | 100.0  83.8   93.7   1.9  5 uniq:
                      |                                                                         |                           '163'x1=>8.5,
                      |                                                                         |                           '166'x1=>7,
                      |                                                                         |                           '181'x1=>11.5
                      |                                                                         |                           ,
                      |                                                                         |                           '185'x1=>12.5
                      |                                                                         |                           , '175'x1=>10 
_______________________________________________________________________________________________________________________________________
  2   sex             | char       0.0 100.0   0.0                                              | 100.0  71.0          4.6  2 uniq:
                      |                                                                         |                           'm'x3=>11.3,
                      |                                                                         |                           'f'x2=>7.75
_______________________________________________________________________________________________________________________________________
  1   name            | char       0.0 100.0   0.0                                              | 100.0   0.0        100.0  5 uniq:
                      |                                                                         |                           'don'x1=>12.5
                      |                                                                         |                           ,
                      |                                                                         |                           'elisa'x1=>7,
                      |                                                                         |                           'berta'x1=>8.
                      |                                                                         |                           5,
                      |                                                                         |                           'arthur'x1=>1
                      |                                                                         |                           1.5,
                      |                                                                         |                           'chris'x1=>10
_______________________________________________________________________________________________________________________________________
  3   house_nr        | int        0.0   0.0 100.0      351      101       42     1333      552 | 100.0 −32.7   −6.8  91.4  5 uniq:
                      |                                                                         |                           '42'x1=>11.5,
                      |                                                                         |                           '204'x1=>7,
                      |                                                                         |                           '77'x1=>12.5,
                      |                                                                         |                           '101'x1=>8.5,
                      |                                                                         |                           '1333'x1=>10
 _______________________________________________________________________________________________________________________________________

tbljoin

Join files (in the relational algebra sense), taking care of necessary column matching and pre-sorting before performing unix join. Options for inner, left, right, and outer joins are available.

Examples:

Suppose file1 is

         name,street,house
         zorro,desert road,5
         john,main st,2
         arthur,pan−galactic bypass,42
         arthur,main st,15

and file2 is

         name,street,phone
         john,main st,654−321
         arthur,main st,121−212
         john,round cir,123−456

Then tbljoin file1 file2 gives

         name,street,house,phone
         arthur,main st,15,121−212
         john,main st,2,654−321

tbljoin −r file1 file2 gives

         house,name,street,phone
         42,arthur,main st,121−212
         2,john,main st,654−321
         NA,john,round cir,123−456

and tbljoin −lrn’?’ file1 file2 gives

         name,street,house,phone
         arthur,main st,15,121−212
         arthur,pan−galactic bypass,42,?
         john,main st,2,654−321
         john,round cir,?,123−456
         zorro,desert road,5,?

tblcmd

Run any command (e.g., sort, uniq) on the body of file, keep the header. Column name and delimiter substitution is supported on command line arguments.

Example:

Sort a csv file with header by column user_name:

tblcmd ’sort −t"?delim" −k ?user_name’

tblsplit

Split a file into several subfiles based on the value of a key column.

Example: Suppose file is

        continent,country
        americas,us
        americas,mx
        europe,de
        europe,fr

Then tblsplit −rk’continent’ file generates two files:

file.select.continent=americas:

        country
        us
        mx

and file.select.continent=europe:

        country
        de
        fr

tbluniq

Identify or remove duplicate (or isomorphic) columns.

A column y is dependent on x, written x−>y, if there is a function f with f(x)=y for all x. x is isomorphic to y, if x−>y and y−>x.
The script tries to find dependencies and isomorphisms. The dependency graph can be printed out.

Example:

Suppose file is

        key1,key2,key3,key4,key5,key6
        1,1,1,1,1,1
        1,2,4,2,1,2
        3,2,4,3,1,3
        4,4,3,4,1,4
        5,5,5,5,1,5

Then tbluniq −v file gives

        duplicates:
        {key4,key6}
        key1,key2,key3,key4,key5
        1,1,1,1,1
        1,2,4,2,1
        3,2,4,3,1
        4,4,3,4,1
        5,5,5,5,1

and tbluniq −xv file gives

        isomorphic:
        {key2,key3}
        {key4,key6}
        dependencies:
        {key4,key6} −> key1
        {key4,key6} −> {key2,key3}
        key1 −> key5
        {key2,key3} −> key5
        key1,key2,key4,key5
        1,1,1,1
        1,2,2,1
        3,2,3,1
        4,4,4,1
        5,5,5,1

tblcat

Concatenate files without repeating header.

tblhist

Histogram of column values. Also supports "ascii plotting".

tbltranspose

Transpose rows and columns.

tbltex

Format csv data into LaTex format.

Clone this wiki locally