Skip to content
tribbloid edited this page Dec 21, 2014 · 10 revisions

SpookyStuff

... is a scalable query engine for web scraping/data mashup/acceptance QA. The goal is to allow the Web being queried and ETL'ed like a relational database.

SpookyStuff is the fastest big data collection engine in history, with a speed record of querying 330404 dynamic pages per hour on 300 cores.

Powered by

  • Apache Spark
  • Selenium
    • GhostDriver/PhantomJS
  • JSoup
  • Apache Tika
  • (build by) Apache Maven
    • Scala/ScalaTest plugins
  • (deployed by) Ansible
  • Current implementation is influenced by Spark SQL and Mahout Sparkbinding.

Apache Spark Selenium PhantomJS

Apache Tika Build by Apache Maven Ansible

Clone this wiki locally