Ruby gem that provides utilities (ls, find, and eventually others) for HDFS (Hadoop Distributed File System).
This gem uses the webhdfs interface, which provides fast, compatible, remote access to files and directories stored in HDFS.
The precedence order of sources of settings, from lowest to highest, is:
- Defaults in this repository.
- Standard Hadoop configuration files.
- Environment variables
- Command-line options.
The following environment variables may be used to configure the utilities.
Variable | Description | Default |
---|---|---|
HDFS_HOST | The IP hostname of the webhdfs server. | localhost |
HDFS_PORT | The IP port number of the webhdfs service. | 50070 |
HDFS_USERNAME | The username used to access HDFS. | The value of the shell environment HADOOP_USER_NAME or USER variables. |
HDFS_URI | The location of the webhdfs service: [webhdfs://]hostname[:port] | webhdfs://localhost:50070 |
HDFS_DOAS | HTTP doas username to use with webhdfs. | none |
HDFS_PROXYHOST | HTTP proxy host to use with webhdfs. | none |
HDFS_PROXYPORT | HTTP proxy port to use with webhdfs. | none |
HADOOP_CONF_DIR | The directory that contains Hadoop configuration files. | /etc/hadoop |
All of the utilities take the following options, which override the environment variables when specified.
Option | Description | Default |
---|---|---|
--hdfsuri=[webhdfs://]hostname[:port] | The location of the webhdfs service. | webhdfs://localhost:50070 |
--log-level=[debug|info|warn|error|fatal] | Logging level. When debug is specified, failures will generate a stack trace. | fatal |
Altiscale has just started developing hdfsutils. We're focusing on delivering a specific use case for one of our customers, but intend to build a much more complete set of utilities. Contributions are welcome.
To add new functionality to an existing utility, you'll probably want to edit the utility's options.rb file and the utility implementation.
To develop a completely new utility: find, copy, and modify the template code. Here's the current list of template code files at the time that this documentation was written:
$ find . -path '*template*'
./bin/hdtemplate
./lib/hdfsutils/utils/hdtemplate
./lib/hdfsutils/utils/hdtemplate/implementation.rb
./lib/hdfsutils/utils/hdtemplate/options.rb
./lib/hdfsutils/utils/hdtemplate/template.rb
./spec/utils/hdtemplate_spec.rb
The code in all pull requests must pass the rubocop and rspec tests. New functionality should be submitted with corresponding rspec unit tests. The best way to run rubocop and rspec is to use rvm, bundler, and rake. Assuming that rvm is already installed with bundler in the default gemset, run rake as follows:
$ rvm use @hdfsutils-devel --create
ruby-2.0.0-p353 - #gemset created /Users/chaiken/.rvm/gems/ruby-2.0.0-p353@hdfsutils-devel
ruby-2.0.0-p353 - #generating hdfsutils-devel wrappers..........
Using /Users/chaiken/.rvm/gems/ruby-2.0.0-p353 with gemset hdfsutils-devel
bash-3.2$ bundle install
Fetching gem metadata from https://rubygems.org/............
Fetching version metadata from https://rubygems.org/..
Resolving dependencies...
<installs the development dependencies in hdfsutils.gemspec>
Bundle complete! <D> Gemfile dependencies, <G> gems now installed.
Use `bundle show [gemname]` to see where a bundled gem is installed.
bash-3.2$ rake
Running RuboCop...
Inspecting <F> files
..............................
<F> files inspected, no offenses detected
<path>/ruby <path>/rspec --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb
HdfsUtils::Ls
<ls utility tests>
HdfsUtils::Template
<template utility tests>
Finished in <N> seconds (files took <M> seconds to load)
<X> examples, <F> failures
- support HADOOP_USER_NAME shell environment variable
- hdmv implementation
- reuse webhdfs connections, if possible
- unix, si and iec filesize units and human-readable option
- help formatting fix
- improvements based on Altiscale customer feedback
Original Release
- David Chaiken ([email protected])
- Max Ziff ([email protected])
- HeeSoo Kim ([email protected])
Apache License Version 2.0 (See LICENSE.txt)