Coreutils for datasets

When working with tabular datasets, I often use ubiquitous command line tools like head, cut, and grep to get a basic idea of the data:

$ head -n5 nyc-taxi.csv  \
    | cut -f1-3 -d,
"vendor_id","pickup_at","dropoff_at"
"1",2019-01-01 00:46:40.000000,2019-01-01 00:53:20.000000
"1",2019-01-01 00:59:47.000000,2019-01-01 01:18:59.000000
"2",2018-12-21 13:48:30.000000,2018-12-21 13:52:40.000000
"2",2018-11-28 15:52:25.000000,2018-11-28 15:55:45.000000

However, CSV files are a poor choice for datasets in general, and it's much more common to see use a format like Parquet for anything except lowest-common-denominator data distribution.

This package provides coreutils-like binaries for interacting with Parquet files. The example above becomes:

$ dcat "'nyc-taxi.parquet'" \
    | dcut -f vendor_id -f pickup_at -f dropoff_at \
    | dhead -n5 \
    | deval
"vendor_id","pickup_at","dropoff_at"
"1",2019-01-01 00:46:40.000000,2019-01-01 00:53:20.000000
"1",2019-01-01 00:59:47.000000,2019-01-01 01:18:59.000000
"2",2018-12-21 13:48:30.000000,2018-12-21 13:52:40.000000
"2",2018-11-28 15:52:25.000000,2018-11-28 15:55:45.000000
"2",2018-11-28 15:56:57.000000,2018-11-28 15:58:33.000000

The syntax of the d (for dataset) commands does differ from the coreutils commands, but hopefully it's similar enough that the two pipelines are roughly equivalent.

Commands

All of these commands accept the --help option to get a more detailed description of the individual command-line switches. I've specified all arguments as --flag/-f-style options, but the most common arguments can also be specified as positional arguments.

`dcat`: specify the dataset

Unlike regular coreutils pipelines, all d pipelines must start with a dcat (this might be relaxed in the future, but for now it's easier to have a guaranteed terminus on each end of the pipeline).

dcat accepts one parameter: the dataset to query. This must be given in a form that DuckDb understands when pasted after a FROM in a SQL command. That means that you must include single quotes if the SQL command given to DuckDb would require them.

Usage (note single quotes in table name are passed through):

$ dcat "'nyc-taxi.parquet'"
$ dcat "parquet_scan('nyc-taxi.parquet')"

`dcut`: specify columns

The dcut command is used to specify the columns to include in the output. If no dcut command is given, then all columns are included. Columns can be passed using --field (-f), similar to cut, or as positional arguments.

$ dcut vendor_id pickup_at dropoff_at
$ dcut -f vendor_id -f pickup_at -f dropoff_at

`dhead`: trim rows

Limit the number of rows returned by the query. Note that if the rows aren't sorted and multiple files are being read, then the selection of rows is indeterminate. Positional arguments can be used instead of --lines (-n).

$ dhead -n5
$ dhead 5

`dsort`: sort output

Sort the results of the query by some fields. Use --reverse (-r) to reverse the sort direction. Fields can be specified using --field (-f) or as positional arguments.

Reversing the sort direction reverses the sort direction of all fields, similar to adding DESC after every column in an ORDER BY clause.

If dsort is used alongside dhead, note that the sorting takes place before the results are truncated, no matter which order dsort and dhead appear in the pipeline.

$ dsort -f vendor_id -f pickup_at
$ dsort -r vendor_id pickup_at

`dgrep`: search output

The dgrep command is least like its coreutils equivalent, but that's because it's much more powerful. Instead of searching through an entire row, dgrep searches individual fields and allows all SQL predicates to be used.

Specify the field to search using --field (-f) and the value to find using --value (-v). Alternatively, these may be given as positional arguments in the order field, value.

The default predicate is LIKE, but this can be changed using the --predicate (-p) option. By default, parameters are treated as being of unknown types, and DuckDB is responsible for working out any typecasts. To force the value to be interpreted as text or an integer, use --text (-t) or --integer (-i). This will be reworked soon to allow for a wider range of types. In particular, TIMESTAMP types aren't yet supported.

$ dgrep -f vendor_id -v 1
$ dgrep vendor_id 1
$ dgrep -i tip_amount 0
$ dgrep -i tip_amount -p '>' 100

`deval`: evaluate pipeline

If you've tried running any of the commands above individually, you'll have noticed that they don't print any data. Instad, they write a JSON-encoded query plan that gets consumed by deval. A pipeline does nothing unless deval is invoked.

By default, deval will write in CSV format to standard output. To write to another file, use the --output (-o) option. To write Parquet, add the --parquet (-p) option.

$ dcat "'nyc-taxi.parquet'" | dhead | deval
$ dcat "'nyc-taxi.parquet'" | dhead | deval -o output.csv
$ dcat "'nyc-taxi.parquet'" | dhead | deval -o output.parquet -p

Putting it all together

We can, for example, use the NYC taxi dataset (I'm using the Parquet version kindly hosted at s3://ursa-labs-taxi-data) to find people who used taxi vendor "1" and didn't tip, then show the top few furthest journeys:

$ dcat "'nyc-taxi-data/2018/*/data.parquet'" \
    | dcut vendor_id pickup_at dropoff_at tip_amount trip_distance \
    | dgrep vendor_id 1 \
    | dgrep -i tip_amount 0 \
    | dsort -r trip_distance \
    | dhead \
    | deval

Building

The build system is Meson:

$ meson setup builddir
$ meson compile -C builddir
$ meson install -C builddir

Prequisites: arrow, duckdb and boost (specifically boost::program_options).

Next steps

There are a few things I plan to improve:

Improve the parameter detection of dcat so that it can infer certain formats and doesn't just pass a raw string that's dumbly interpreted by DuckDb.
Improve type detection in dgrep. Types should be inferred at query time rather than specified in dgrep. Predicates can be specified in dgrep but a placeholder predicate should be passed through so a sensible default can be inferred at query time.
Add support for large partitioned datasets. DuckDb supports predicate pushdown over Parquet... in Python and R. Weirdly, there isn't an equivalent to the parquet_scan table function for partitioned datasets, or even a C++ API to do it.
Clean up the build directory and write some tests.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arrow_result.h		arrow_result.h
cat.cpp		cat.cpp
cut.cpp		cut.cpp
eval.cpp		eval.cpp
grep.cpp		grep.cpp
head.cpp		head.cpp
meson.build		meson.build
options.h		options.h
query.h		query.h
queryplan.h		queryplan.h
serde.h		serde.h
sort.cpp		sort.cpp
writer.h		writer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coreutils for datasets

Commands

`dcat`: specify the dataset

`dcut`: specify columns

`dhead`: trim rows

`dsort`: sort output

`dgrep`: search output

`deval`: evaluate pipeline

Putting it all together

Building

Next steps

About

Releases

Packages

Languages

License

StefansM/dataset-coreutils

Folders and files

Latest commit

History

Repository files navigation

Coreutils for datasets

Commands

dcat: specify the dataset

dcut: specify columns

dhead: trim rows

dsort: sort output

dgrep: search output

deval: evaluate pipeline

Putting it all together

Building

Next steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`dcat`: specify the dataset

`dcut`: specify columns

`dhead`: trim rows

`dsort`: sort output

`dgrep`: search output

`deval`: evaluate pipeline

Packages