DigitalPebble Ltd

All

28 repositories

spruce
Public
Enrichment pipeline for CUR reports which adds energy and carbon data allowing to report and reduce the impact of the your cloud usage.
aws cloud sustainability apache-spark climate carbon-emissions greenops greensoftware
Java
•
Apache License 2.0
•2•5•8•1•Updated Aug 20, 2025Aug 20, 2025
carbonara
Public archive
Enrichment pipeline for CUR / FOCUS reports which adds energy and carbon data allowing to report and reduce the impact of the your cloud usage.
aws cloud sustainability climate focus carbon-emissions apachespark greenops greensoftware
Java
•
Apache License 2.0
•0•5•0•0•Updated Jul 18, 2025Jul 18, 2025
benchmark
Public
StormCrawler topology to evaluate the performance of different backends and configurations
elasticsearch benchmark opensearch stormcrawler
Shell
•0•0•0•0•Updated Jul 1, 2025Jul 1, 2025
digitalpebble.github.io
Public
Resources for the DigitalPebble website
SCSS
•0•0•0•0•Updated Jun 23, 2025Jun 23, 2025
stormcrawler-docker
Public
Resources for running StormCrawler with Docker services
docker apache-storm stormcrawler
Dockerfile
•
Apache License 2.0
•3•10•0•0•Updated Nov 10, 2024Nov 10, 2024
crawlurlfrontier
Public archive
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
FLUX
•0•1•0•0•Updated May 16, 2024May 16, 2024
storm
Public
Mirror of Apache Storm
Java
•
Apache License 2.0
•4.1k•0•0•0•Updated Apr 10, 2024Apr 10, 2024
tika-detector-stormcrawler
Public
Wraps the charset detection logic from StormCrawler as a Tika module
Java
•
Apache License 2.0
•1•0•0•0•Updated Feb 2, 2024Feb 2, 2024
tika
Public
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
Java
•
Apache License 2.0
•843•0•0•0•Updated Jan 25, 2024Jan 25, 2024
docs
Public
Documentation for Docker Official Images in docker-library
Shell
•
MIT License
•2.2k•0•0•0•Updated Jan 16, 2024Jan 16, 2024
ansible-storm
Public
Ansible playbook for deploying a Storm cluster
storm playbook stormcrawler ansible
1•7•0•0•Updated Dec 7, 2023Dec 7, 2023
nutch
Public
Apache Nutch is an extensible and scalable web crawler
Java
•
Apache License 2.0
•1.3k•1•0•0•Updated Nov 8, 2023Nov 8, 2023
urlfrontier-client
Public
URLFrontier client written in Rust (mostly as a way of learning Rust)
rust grpc webcrawler url-frontier urlfrontier
Rust
•
Apache License 2.0
•0•1•0•0•Updated Dec 5, 2022Dec 5, 2022
crawler4j-frontier-battle
Public
Java
•1•0•0•0•Updated Apr 6, 2022Apr 6, 2022
TextClassification
Public
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Java
•
Apache License 2.0
•22•48•1•0•Updated Sep 24, 2021Sep 24, 2021
stormcrawlerfight
Public
Crawl configurations for benchmarking / testing StormCrawler
Shell
•
Apache License 2.0
•5•10•0•0•Updated Sep 19, 2019Sep 19, 2019
behemoth
Public archive
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
nlp mapreduce java hadoop
Java
•
Other
•59•283•12•1•Updated Apr 25, 2018Apr 25, 2018
crawler-commons
Public
A set of reusable Java components that implement functionality common to any web crawler
Java
•
Apache License 2.0
•85•4•0•0•Updated Apr 4, 2017Apr 4, 2017
sc-warc
Public
WARC resources for StormCrawler
1•2•3•0•Updated Oct 20, 2016Oct 20, 2016
tescobank
Public archive
Setup for crawling tescobank with SC
Java
•
Apache License 2.0
•2•4•0•0•Updated Sep 23, 2015Sep 23, 2015
textclassification-examples
Public
Use cases for DigitalPebble's TextClassification API
Java
•
Apache License 2.0
•3•10•0•0•Updated Sep 1, 2015Sep 1, 2015
behemoth-commoncrawl
Public archive
Support for old (pre 2013) CommonCrawl dataset in Behemoth
Java
•0•4•0•0•Updated Apr 20, 2015Apr 20, 2015
tika-cc
Public
resources for generating a corpus of docs from CC for Tika
Shell
•0•0•0•0•Updated Nov 28, 2014Nov 28, 2014
NutchFight
Public
Resources for comparison between 1.8 and 2.x of Apache Nutch
Java
•
Apache License 2.0
•0•4•0•0•Updated Jun 4, 2014Jun 4, 2014
behemoth-elasticsearch
Public archive
ElasticSearch module for Behemoth
Java
•0•1•0•0•Updated Feb 12, 2014Feb 12, 2014
behemoth-textclassification
Public archive
Module for classifying Behemoth documents with a model from our Text Classification API
Java
•0•1•0•0•Updated Nov 22, 2012Nov 22, 2012
TextClassificationPlugin
Public archive
GATE Processing Resource wrapping DigitalPebble's TextClassification API
Java
•3•5•1•1•Updated Jul 12, 2012Jul 12, 2012
ngrams-api
Public archive
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Java
•
Other
•2•5•0•0•Updated Apr 27, 2012Apr 27, 2012