druid.html

---
title: What is Druid?
layout: html_page
sectionid: druid
---

<div class="container">
  <div class="page-header">
    <h1>Druid is...</h1>
    <h5 class="easter-egg">An Existential Journey</h5>
  </div>

  <div class="row">
    <div class="col-md-4">
      <ul class="nav nav-pills nav-stacked">
	<li><a href="#realrealtime"><i class="icon-chevron-right"> </i>What does Real&#178;time mean?</a></li>
	<li><a href="#where"><i class="icon-chevron-right"> </i>Where did Druid come from?</a></li>
	<li><a href="#used"><i class="icon-chevron-right"> </i>What is Druid used for?</a></li>
	<li><a href="#whois"><i class="icon-chevron-right"> </i>Who is using Druid?</a></li>
      </ul>
    </div>

    <div class="col-md-8">
      <h1 id="whatis">What is Druid?</h1>
      
      <div class="text-indent">
	<h3><p>Druid is open source infrastructure for real&#178;time exploratory analytics that supports fast ad-hoc queries on large-scale data sets.</p></h3>
	
	<h2 id="realtime">Real-time Ingestion</h2>
	
	<p class="text-indent-2"><strong>Real-time data</strong> Typical analytics databases ingest data via batches. Ingesting an event at a time is often accompanied with transactional locks and other overhead that slows down the ingestion rate. Druid&#8217;s real-time nodes employ lock-free ingestion of append-only data sets to allow for simultaneous ingestion and querying of 10,000+ events per second. Simply put, the latency between when an event happens and when it is visible is limited only by how quickly the event can be delivered to Druid.</p>
	
	<h2 id="scalable">Scalable</h2>
	
	<p class="text-indent-2"><strong>In-memory or on-disk.</strong> Druid leverages the memory mapping capabilities of modern operating systems to allow for only relevant data to be loaded into memory while the rest can live on disk. This means that if your performance requirements dictate that the data must be in memory, then you can configure each node to only accept an amount of data that is equivalent to the available memory and it will all be in-memory. If you are ok with only having the working set in memory, each node can hold more than just the working set on a given machine and the requisite data will be swapped into memory on demand.</p>
	
	<p class="text-indent-2"><strong>Highly Available.</strong> Scaling up or down, replicating nodes, or recovering from failure typically impacts availability and performance. Druid uses a distributed architecture that allows replication at the segment level – relieving the load on &#8220;hot segments.&#8221; And, because of replication, Druid supports rolling deployments and restarts. Scale up or scale down just by adding or remove nodes, it&#8217;s that easy and no data has to be re-processed or re-indexed, just re-replicated.</p>
	
	<h2 id="hri">Real-time Queries</h2>
	
	<p class="text-indent-2"><strong>Ad hoc, multi-dimensional filtering.</strong> Druid maintains bitmap indexes compressed using <a href="http://ricerca.mat.uniroma3.it/users/colanton/concise.html">CONCISE</a> to determine what data it has to look at before it ever starts looking at data. This significantly speeds up ad hoc filtered queries, even allowing for fast OR queries which are traditionally slow. All this <a href="http://metamarkets.com/2012/druid-bitmap-compression/">without a significant impact on data footprint</a></p>
	
	<p class="text-indent-2"><strong>Column-oriented for speed.</strong> Data is laid out in columns so that scans are limited to the specific data being searched. <a href="/blog/2011/05/20/druid-part-deux.html">Compression decreases overall data footprint.</a></p>
      </div>
      
      <hr id="realrealtime">
      <h1>What does Real&#178;time mean?</h1>
      
      <div class="text-indent">
	<p>Real&#178;time reflects the fact that Druid encompasses both of the common definitions of real-time in the data processing space.</p>      
	<p>&#8220;Real-time queries&#8221; refers to responsive or interactive queries. I.e. you have your data and want to be able to ask questions of the data quickly.</p>      
	<p>&#8220;Real-time ingestion&#8221; refers to ingesting data and making it available for querying in real-time. I.e. minimizing the latency between when an event occurs and when it is reflected in your query results.</p>
      </div>
      
      <hr id="where">
      <h1>Where did Druid come from?</h1>
      
      <div class="text-indent">
	<p>Druid was created out of necessity by Metamarkets, a company focused on providing real-time interactive insight to the RTB (real time bidding) AdTech space with a full stack analytics service. Metamarkets required a system that could ingest data in real-time, provide ad-hoc N-dimensional drill down and still provide sub-second responses. As a hosted service, Metamarkets also required no downtime deployments, fault-tolerance and self-healing properties.</p>
	<p>Druid was <a href="/blog/2012/10/24/introducing-druid.html">opened up</a> because Metamarkets is fully committed to the AdTech use case. However, it was felt that Druid had more general applicability to other spaces and it was in Metamarket&#8217;s best interests not to limit Druid&#8217;s development solely to its own use cases.</p>
      </div>
      
      <hr id="used">
      <h1>What is Druid used for?</h1>
      
      <div class="text-indent">
	<p>Druid is purpose built infrastructure that provides for exploration of very large quantities of data as it is ingested into the system. It is currently used for dashboarding of ad impression streams and operational monitoring of systems. If you have a dataset that is too large for your current infrastructure, your data has a timestamp associated with every event and you want to arbitrarily filter into the data with your queries, then Druid can probably provide value for whatever your use case is as well.</p>
	
	<h2 id="immediate_insight_to_large_quantities_of_data">Immediate insight to large quantities of data:</h2>
	<p class="text-indent-2">The low time latency between when data is ingested into Druid and when that data is reflected in queries allows users understand what is going on &#8220;right now&#8221; instead of a few minutes ago.</p>
	
	<h2 id="deep_exploratory_drilldown">Deep, exploratory drill-down:</h2>
	<p class="text-indent-2">Users value the ability to create many arbitrary filter dimensions without impacting performance, breaking scalability or cost viability (compute infrastructure). Ad hoc drill down on immediate and historical data allows users to query both &#8220;fresh&#8221;, or immediately ingested data and historical data all at once.</p>
      </div>
      
      <hr id="whois">
      <h1>Who is using Druid?</h1>
      
      <div class="text-indent">
	<h2 id="metamarkets">Metamarkets</h2>
	<p class="text-indent-2">Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over 30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at a 95th percentile query time of around 1 second.</p>
	
	<h2 id="netflix">Netflix</h2>
	<p class="text-indent-2">Netflix engineers use Druid to aggregate multiple data streams, ingesting up to two terabytes per hour, with the ability to query data as its being ingested. They use Druid to pinpoint anomalies within their infrastructure, endpoint activity and content flow.</p>
	
	<h2 id="madvertise">Madvertise</h2>
	<p class="text-indent-2">Madvertise uses Druid for real-time drill-down reporting. Madvertise is also contributing back to the community by creating and maintaining a ruby client library for interacting with Druid located at <a href="http://github.com/madvertise/ruby-druid">http://github.com/madvertise/ruby-druid</a>.</p>
      </div>
    </div>
  </div>
</div>