HadoopLink

HadoopLink provides a framework for delegating the work of a map-reduce job to Mathematica kernels running on your Hadoop cluster and a suite of tools for working with your Hadoop cluster from a Mathematica notebook.

Features

Distributed Filesystem Interaction

Wherever possible, HadoopLink provides an analogue to Mathematica's filesystem interaction functions for use with the Hadoop filesystem API. These functions are compatible with HDFS, the local filesystem, Amazon S3, and any other system that can be accessed through the Hadoop filesystem API.

How to install HadoopLink

Evaluate FileNameJoin[{$UserBaseDirectory, "Applications"}] in Mathematica and unpack the HadoopLink release archive to the indicated directory.

How to build HadoopLink

Building HadoopLink requires:

Apache Ant
Mathematica (version 7 or higher)
Wolfram Workbench (for building the documentation notebooks)
Hadoop version 0.20, patched to include the typed bytes binary data support for Hadoop Streaming.

HadoopLink was developed against the Cloudera Distribution for Hadoop, version 3.

The following properties affect the HadoopLink ant tasks:

mathematica.dir: The path to your local Mathematica installation. Can be found by evaluating the $InstallationDirectory symbol in Mathematica.
workbench.dir: Optional. The path to your Wolfram Workbench installation. If omitted, skip.docbuild will be set.
skip.docbuild: Optional. Set this property to skip building the documentation.

Build the HadoopLink distribution by running:

ant -Dmathematica.dir=$MATHEMATICA_PATH -Dworkbench.dir=$WORKBENCH_PATH build

using appropriate values for your system.

To do

There are a number of areas in which HadoopLink could be improved.

Make sequence file export from Mathematica break writes up into chunks to avoid Java out of heap errors.
Make sequence file import compatible with all Writable subclasses in org.apache.hadoop.io.
Improve error handling in DFS interaction functions.
Switch error messages from using Throw to Message
Add support for shipping package dependencies along with map-reduce jobs
Use MemoryConstrained in map-reduce tasks
Rewrite the reJar function in Java for better performance
Put record queues between Java map/reduce calls and Mathematica to reduce J/Link overhead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.markdown

README.markdown

HadoopLink

Features

Distributed Filesystem Interaction

How to install HadoopLink

How to build HadoopLink

To do

Files

README.markdown

Latest commit

History

README.markdown

File metadata and controls

HadoopLink

Features

Distributed Filesystem Interaction

How to install HadoopLink

How to build HadoopLink

To do