-
Notifications
You must be signed in to change notification settings - Fork 7
Home
Project Tabulator
is a set of Unix command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, csv, or flat file format). We assume rows constitute records, split into a fixed number of columns; the first line signifies a header specifying the column names.
- Command line arguments can refer to column names.
- Delimiter characters are detected.
- Compressed files are recognized and treated appropriately.
- All scripts can be run either with file arguments or on stdin, to support pipe constructs.
- For each script, the
-h
and-m
switches provide more detailed help and list possible options. - Most (but not all) scripts translate the given command line options into a sequence of Unix shell commands (
awk
,cut
,join
andsort
).
Per-line computation (corresponding to the Map
operation within the MapReduce
scheme). For example, create new columns based on functions of other columns; remove or reorder columns; filter rows based on a column expression.
Examples:
- Print columns
rep
andsales
for lines where columnregion
has valueus
:
tblmap −s’region=="us"’ −k’rep,sales’
- Compute ratio of columns
sales
andclients
, output as single column:
tblmap −k ’’ −c’sales_per_client=sales/client’
- Cut out of columns whose names contains the string
internal
:
tblmap −irk’internal’
- Shuffle a file randomly by generating a random number per line:
cat file | tblmap −c"myrand=rand()" | tblcmd "sort −k myrand −t’,’" | tblmap −irk"myrand"
Compute ("reduce"
) aggregations (sum, average, etc) over groups of rows identified by key column(s).
Example: computes for each line the proportion of column sales
to the total for all lines with the same value
of column region
.
tblred -k'region' 'sales_ratio=sales/sum(sales)' sales.txt
Represent a snapshot of the type of data in a file. Summarize columns (e.g., proportion of character/numeric content, min/mean/median/max, missing values). A target column can be specified, in which case the correlation and predictive strength of all other columns are calculated (feature selection). The columns can be ordered by any of the output characteristics.
Example: Suppose file
contains the following lines:
name,sex,house_nr,height,shoe_size arthur,m,42,181,11.5 berta,f,101,163,8.5 chris,m,1333,175,10 don,m,77,185,12.5 elisa,f,204,166,7
Then tbldesc −k
shoe_size file prints:
COL NAME | TYPE NA CHAR INT MEAN MED MIN MAX STD | MAR R2ADJ CORR PVAL VALUES ======================================================================================================================================= 5 shoe_size | num 0.0 0.0 40.0 9.9 10 7 12.5 2.22 | 100.0 100.0 100.0 0.0 5 uniq: | | '8.5'x1=>8.5, | | '7'x1=>7, | | '12.5'x1=>12. | | 5, | | '10'x1=>10, | | '11.5'x1=>11. | | 5 _______________________________________________________________________________________________________________________________________ 4 height | int 0.0 0.0 100.0 174 175 163 185 9.43 | 100.0 83.8 93.7 1.9 5 uniq: | | '163'x1=>8.5, | | '166'x1=>7, | | '181'x1=>11.5 | | , | | '185'x1=>12.5 | | , '175'x1=>10 _______________________________________________________________________________________________________________________________________ 2 sex | char 0.0 100.0 0.0 | 100.0 71.0 4.6 2 uniq: | | 'm'x3=>11.3, | | 'f'x2=>7.75 _______________________________________________________________________________________________________________________________________ 1 name | char 0.0 100.0 0.0 | 100.0 0.0 100.0 5 uniq: | | 'don'x1=>12.5 | | , | | 'elisa'x1=>7, | | 'berta'x1=>8. | | 5, | | 'arthur'x1=>1 | | 1.5, | | 'chris'x1=>10 _______________________________________________________________________________________________________________________________________ 3 house_nr | int 0.0 0.0 100.0 351 101 42 1333 552 | 100.0 −32.7 −6.8 91.4 5 uniq: | | '42'x1=>11.5, | | '204'x1=>7, | | '77'x1=>12.5, | | '101'x1=>8.5, | | '1333'x1=>10 _______________________________________________________________________________________________________________________________________
Join files (in the relational algebra sense), taking care of necessary column matching and pre-sorting before performing unix join
. Options for inner, left, right, and outer joins are available.
Examples:
Suppose file1
is
name,street,house zorro,desert road,5 john,main st,2 arthur,pan−galactic bypass,42 arthur,main st,15
and file2
is
name,street,phone john,main st,654−321 arthur,main st,121−212 john,round cir,123−456
Then tbljoin file1 file2
gives
name,street,house,phone arthur,main st,15,121−212 john,main st,2,654−321
tbljoin −r file1 file2
gives
house,name,street,phone 42,arthur,main st,121−212 2,john,main st,654−321 NA,john,round cir,123−456
and tbljoin −lrn’?’ file1 file2
gives
name,street,house,phone arthur,main st,15,121−212 arthur,pan−galactic bypass,42,? john,main st,2,654−321 john,round cir,?,123−456 zorro,desert road,5,?
Interface for unix sort
command.
Example:
Suppose file
is
string_col,int_col,float_col b,0,1.43634633463252 ab,123346646,13.4E04 d,123346646,13.4E04 aa,2323232,-1.2E-3
Then tblsort -k'float_col/g,string_col/r' file
is equivalent to a sort key specification sort -k3g -k1r
; it generates:
string_col,int_col,float_col aa,2323232,-1.2E-3 b,0,1.43634633463252 d,123346646,13.4E04 ab,123346646,13.4E04
Format data to make it better human-readable. Calls unix column
command; additionally, numerical columns are reformatted and aligned.
Example:
Suppose file
is
string_col,int_col,float_col shrt,0,1.43634633463252 looooong striiiiiiiiiing,123346646,13.4E04 medium,2323232,-1.2E-3
Then tblcolumn file
generates:
string_col int_col___ float_col_ shrt 0 1.43635 looooong striiiiiiiiiing 123346646 134000. medium 2323232 -0.0012
Shortcut script: call tblcolumn
, and page through its output.
Run any command (e.g., sort
, uniq
) on the body of file, keep the header. Column name and delimiter substitution is supported on command line arguments.
Example:
Sort a csv file with header by column user_name
:
tblcmd ’sort −t"?delim" −k ?user_name’
Split a file into several subfiles based on the value of a key column.
Example: Suppose file
is
continent,country americas,us americas,mx europe,de europe,fr
Then tblsplit −rk’continent’ file
generates two files:
file.select.continent=americas:
country us mx
and file.select.continent=europe:
country de fr
Identify or remove duplicate (or isomorphic) columns.
A column y
is dependent on x
, written x−>y
, if there is a function f
with f(x)=y
for all x
.
x
is isomorphic to y
, if x−>y
and y−>x
.
The script tries to find dependencies and isomorphisms. The dependency graph can be printed out.
Example:
Suppose file
is
key1,key2,key3,key4,key5,key6 1,1,1,1,1,1 1,2,4,2,1,2 3,2,4,3,1,3 4,4,3,4,1,4 5,5,5,5,1,5
Then tbluniq −v file
gives
duplicates: {key4,key6} key1,key2,key3,key4,key5 1,1,1,1,1 1,2,4,2,1 3,2,4,3,1 4,4,3,4,1 5,5,5,5,1
and tbluniq −xv file
gives
isomorphic: {key2,key3} {key4,key6} dependencies: {key4,key6} −> key1 {key4,key6} −> {key2,key3} key1 −> key5 {key2,key3} −> key5 key1,key2,key4,key5 1,1,1,1 1,2,2,1 3,2,3,1 4,4,4,1 5,5,5,1
Concatenate files without repeating header.
Histogram of column values. Also supports "ascii plotting".
Transpose rows and columns.
Format csv data into LaTex format.