HadoopLink provides a framework for delegating the work of a map-reduce job to Mathematica kernels running on your Hadoop cluster and a suite of tools for working with your Hadoop cluster from a Mathematica notebook.
Wherever possible, HadoopLink provides an analogue to Mathematica's filesystem interaction functions for use with the Hadoop filesystem API. These functions are compatible with HDFS, the local filesystem, Amazon S3, and any other system that can be accessed through the Hadoop filesystem API.
Evaluate FileNameJoin[{$UserBaseDirectory, "Applications"}]
in Mathematica and unpack the HadoopLink release archive
to the indicated directory.
Building HadoopLink requires:
- Apache Ant
- Mathematica (version 7 or higher)
- Wolfram Workbench (for building the documentation notebooks)
- Hadoop version 0.20, patched to include the typed bytes binary data support for Hadoop Streaming.
HadoopLink was developed against the Cloudera Distribution for Hadoop, version 3.
The following properties affect the HadoopLink ant tasks:
mathematica.dir
- The path to your local Mathematica installation. Can be found by evaluating the
$InstallationDirectory
symbol in Mathematica. workbench.dir
- Optional. The path to your Wolfram Workbench installation. If omitted,
skip.docbuild
will be set. skip.docbuild
- Optional. Set this property to skip building the documentation.
Build the HadoopLink distribution by running:
ant -Dmathematica.dir=$MATHEMATICA_PATH -Dworkbench.dir=$WORKBENCH_PATH build
using appropriate values for your system.
There are a number of areas in which HadoopLink could be improved.
- Make sequence file export from Mathematica break writes up into chunks to avoid Java out of heap errors.
- Make sequence file import compatible with all
Writable
subclasses inorg.apache.hadoop.io
. - Improve error handling in DFS interaction functions.
- Switch error messages from using
Throw
toMessage
- Add support for shipping package dependencies along with map-reduce jobs
- Use
MemoryConstrained
in map-reduce tasks - Rewrite the
reJar
function in Java for better performance - Put record queues between Java map/reduce calls and Mathematica to reduce J/Link overhead