Skip to content
Change the repository type filter

All

    Repositories list

    • spruce

      Public
      Enrichment pipeline for CUR reports which adds energy and carbon data allowing to report and reduce the impact of the your cloud usage.
      Java
      2581Updated Aug 20, 2025Aug 20, 2025
    • carbonara

      Public archive
      Enrichment pipeline for CUR / FOCUS reports which adds energy and carbon data allowing to report and reduce the impact of the your cloud usage.
      Java
      0500Updated Jul 18, 2025Jul 18, 2025
    • benchmark

      Public
      StormCrawler topology to evaluate the performance of different backends and configurations
      Shell
      0000Updated Jul 1, 2025Jul 1, 2025
    • Resources for the DigitalPebble website
      SCSS
      0000Updated Jun 23, 2025Jun 23, 2025
    • Resources for running StormCrawler with Docker services
      Dockerfile
      31000Updated Nov 10, 2024Nov 10, 2024
    • crawlurlfrontier

      Public archive
      Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
      FLUX
      0100Updated May 16, 2024May 16, 2024
    • storm

      Public
      Mirror of Apache Storm
      Java
      4.1k000Updated Apr 10, 2024Apr 10, 2024
    • Wraps the charset detection logic from StormCrawler as a Tika module
      Java
      1000Updated Feb 2, 2024Feb 2, 2024
    • tika

      Public
      The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
      Java
      843000Updated Jan 25, 2024Jan 25, 2024
    • docs

      Public
      Documentation for Docker Official Images in docker-library
      Shell
      2.2k000Updated Jan 16, 2024Jan 16, 2024
    • Ansible playbook for deploying a Storm cluster
      1700Updated Dec 7, 2023Dec 7, 2023
    • nutch

      Public
      Apache Nutch is an extensible and scalable web crawler
      Java
      1.3k100Updated Nov 8, 2023Nov 8, 2023
    • URLFrontier client written in Rust (mostly as a way of learning Rust)
      Rust
      0100Updated Dec 5, 2022Dec 5, 2022
    • Java
      1000Updated Apr 6, 2022Apr 6, 2022
    • A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
      Java
      224810Updated Sep 24, 2021Sep 24, 2021
    • Crawl configurations for benchmarking / testing StormCrawler
      Shell
      51000Updated Sep 19, 2019Sep 19, 2019
    • behemoth

      Public archive
      Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
      Java
      59283121Updated Apr 25, 2018Apr 25, 2018
    • A set of reusable Java components that implement functionality common to any web crawler
      Java
      85400Updated Apr 4, 2017Apr 4, 2017
    • sc-warc

      Public
      WARC resources for StormCrawler
      1230Updated Oct 20, 2016Oct 20, 2016
    • tescobank

      Public archive
      Setup for crawling tescobank with SC
      Java
      2400Updated Sep 23, 2015Sep 23, 2015
    • Use cases for DigitalPebble's TextClassification API
      Java
      31000Updated Sep 1, 2015Sep 1, 2015
    • behemoth-commoncrawl

      Public archive
      Support for old (pre 2013) CommonCrawl dataset in Behemoth
      Java
      0400Updated Apr 20, 2015Apr 20, 2015
    • tika-cc

      Public
      resources for generating a corpus of docs from CC for Tika
      Shell
      0000Updated Nov 28, 2014Nov 28, 2014
    • Resources for comparison between 1.8 and 2.x of Apache Nutch
      Java
      0400Updated Jun 4, 2014Jun 4, 2014
    • ElasticSearch module for Behemoth
      Java
      0100Updated Feb 12, 2014Feb 12, 2014
    • Module for classifying Behemoth documents with a model from our Text Classification API
      Java
      0100Updated Nov 22, 2012Nov 22, 2012
    • GATE Processing Resource wrapping DigitalPebble's TextClassification API
      Java
      3511Updated Jul 12, 2012Jul 12, 2012
    • ngrams-api

      Public archive
      Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
      Java
      2500Updated Apr 27, 2012Apr 27, 2012