Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin

Awesome Hadoop
Resources
Other Awesome Lists

Hadoop

Apache Hadoop - Apache Hadoop
Apache Tez
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
dumbo - Python module that allows you to easily write and run Hadoop programs.
hadoopy - Python MapReduce library written in Cython.
mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
pydoop - Pydoop is a package that provides a Python API for Hadoop.
hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
White Elephant - Hadoop log aggregator and dashboard
Kiji Project
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Kylin - Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
Crunch - Crunch – Go-based toolkit for ETL and feature extraction on Hadoop

YARN

Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

Apache HBase - Apache HBase
Apache Phoenix - A SQL skin over HBase
happybase - A developer-friendly Python library to interact with Apache HBase.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex - Secondary Index for HBase
Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
OpenTSDB - The Scalable Time Series Database
Apache Cassandra

SQL on Hadoop

SQL on Hadoop

Apache Hive
Hive Plugins
UDF
- http://nexr.github.io/hive-udf/
- https://github.com/edwardcapriolo/hive_cassandra_udfs
- https://github.com/livingsocial/HiveSwarm
- https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
- https://github.com/karthkk/udfs
- https://github.com/kevinweil/elephant-bird - Twitter
- https://github.com/lovelysystems/ls-hive
- https://github.com/stewi2/hive-udfs
- https://github.com/klout/brickhouse
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/deanwampler/HiveUDFs
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
Storage Handler
SerDe
Libraries and tools
- https://github.com/forward/rbhive
- https://github.com/synctree/activerecord-hive-adapter
- https://github.com/hrp/sequel-hive-adapter
- https://github.com/forward/node-hive
- https://github.com/recruitcojp/WebHive
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift
- http://www.phphiveadmin.net/
- https://github.com/anjuke/hwi
- https://code.google.com/a/apache-extras.org/p/hipy/
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto
- https://github.com/recruitcojp/OdbcHive
- Hive-Sharp
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test- Unit test framework for hive and hive-service
Cloudera Impala
Presto
Apache Tajo
Apache Drill

Workflow, Lifecycle and Governance

Apache Oozie - Apache Oozie
Azkaban
Apache Falcon - Data management and processing platform

Data Ingestion and Integration

Apache Flume - Apache Flume
Flume Plugins
Flume MongoDB Sink
Flume HornetQ Channel
Flume MessagePack Source
Flume RabbitMQ source and sink
Flume UDP Source
Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
Flume Custom Serializers
Real-time analytics in Apache Flume
.Net FlumeNG Clients
Suro - Netflix's distributed Data Pipeline
Apache Sqoop - Apache Sqoop
Apache Kafka - Apache Kafka

DSL

**

Apache Pig - Apache Pig
Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
vahara - Machine learning and natural language processing with Apache Pig
packetpig - Open Source Big Data Security Analytics
akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Kite Software Development Kit - A set of libraries, tools, examples, and documentation
gohadoop - Native go clients for Apache Hadoop YARN.
Hue - A Web interface for analyzing data with Apache Hadoop.
Zeppelin
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
Apache Thrift
Apache Avro - Apache Avro is a data serialization system.
Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Spring for Apache Hadoop

Realtime Data Processing

Apache Storm
Apache Samza

Distributed Computing and Programming

Apache Spark
Apache Crunch
Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Packaging, Provisioning and Monitoring

Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Ambari - Apache Ambari
Ganglia Monitoring System
ankush - A big data cluster management tool that creates and manages clusters of different technologies.
Apache Zookeeper - Apache Zookeeper
Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
Buildoop - Hadoop Ecosystem Builder
Deploop - The Hadoop Deploy System

Search

ElasticSearch
Apache Solr
SenseiDB - Open-source, distributed, realtime, semi-structured database

Benchmark

**

Big Data Benchmark
HiBench
Big-Bench
hive-benchmarks
hive-testbench - Testbench for experimenting with Apache Hive at any data scale.

Machine learning and Big Data analytics

Apache Maout
Cloudera Oryx - The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure.
MLlib - MLlib is Apache Spark's scalable machine learning library.
R - R is a free software environment for statistical computing and graphics.
RHive - RHive is an R extension facilitating distributed computing via Apache Hive.
RHadoop

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Hadoop Weekly
The Hadoop Ecosystem Table
Hadoop 1.x vs 2
Apache Hadoop YARN: Yet Another Resource Negotiator
Introducing Apache Hadoop YARN
Apache Hadoop YARN - Background and an Overview
Apache Hadoop YARN - Concepts and Applications
Apache Hadoop YARN - ResourceManager
Apache Hadoop YARN - NodeManager
Migrating to MapReduce 2 on YARN (For Users)
Migrating to MapReduce 2 on YARN (For Operators)
Hadoop and Big Data: Use Cases at Salesforce.com
All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
What is Bigtop, and Why Should You Care?
Hadoop - Distributions and Commercial Support
Ganglia configuration for a small Hadoop cluster and some troubleshooting
Hadoop illuminated - Open Source Hadoop Book
NoSQL Database
10 Best Practices for Apache Hive
Hadoop Operations at Scale

Presentations

Hadoop 24/7
An example Apache Hadoop Yarn upgrade
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning

Books

Hadoop: The Definitive Guide
Hadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Hadoop

Hadoop

YARN

NoSQL

SQL on Hadoop

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Benchmark

Machine learning and Big Data analytics

Misc.

Resources

Websites

Presentations

Books

Other Awesome Lists

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Hadoop

Hadoop

YARN

NoSQL

SQL on Hadoop

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Search

Benchmark

Machine learning and Big Data analytics

Misc.

Resources

Websites

Presentations

Books

Other Awesome Lists