Skip to content

Tools for working with parquet, impala, and hive. This fork allows to set output parquet file size when merging files.

License

Notifications You must be signed in to change notification settings

nandrzej/herringbone

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Herringbone

Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.

The available commands are:

flatten: transform a directory of parquet files with a nested structure into a directory of parquet files with a flat schema that can be loaded into impala or hive (neither of which support nested schemas). Default output directory is /path/to/input/directory-flat.

$ herringbone flatten -i /path/to/input/directory [-o /path/to/non/default/output/directory]

load: load a directory of parquet files (which must have a flat schema) into impala or hive (defaulting to impala). Use the --nocompute-stats option for faster loading into impala (but probably slower querying later on!)

$ herringbone load [--hive] [-u] [--nocompute-stats] -d db_name -t table -p /path/to/parquet/directory

tsv: transform a directory of parquet files into a directory of tsv files (which you can concat properly later with hadoop fs -getmerge /path/to/tsvs). Default output directory is /path/to/input/directory-tsv.

$ herringbone tsv -i /path/to/input/directory [-o /path/to/non/default/output/directory]

compact: transform a directory of parquet files into a directory of fewer larger parquet files. Default output directory is /path/to/input/directory-compact.

$ herringbone compact -i /path/to/input/directory [-o /path/to/non/default/output/directory]

See herringbone COMMAND --help for more information on a specific command.

Building

You'll need thrift 0.9.1 on your path.

$ git clone github.com/stripe/herringbone
$ cd herringbone
$ mvn package

Authors

About

Tools for working with parquet, impala, and hive. This fork allows to set output parquet file size when merging files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Thrift 70.8%
  • Scala 28.2%
  • Ruby 1.0%