This repository contains a set of stream processing applications taken from the literature, and from existing repositories (e.g., here), which have been cleaned up properly. The applications can be run in a homogeneous manner and their execution collects statistics of throughput and latency in different ways.
Below we list the applications with the availability in different Stream Processing Engines and Libraries. We consider Apache Storm, Apache Flink and WindFlow (link):
Application | Acronym | Apache Storm | Apache Flink | WindFlow |
---|---|---|---|---|
FraudDetection | FD | Yes | Yes | Yes |
SpikeDetection | SD | Yes | Yes | Yes |
TrafficMonitoring | TM | Yes | Yes | Yes |
WordCount | WC | Yes | Yes | Yes |
Yahoo! Streaming Benchmark | YSB | Yes | Yes | Yes |
LinearRoad | LR | Yes | Yes | Yes |
VoipStream | VS | Yes | Yes | Yes |
SentimentAnalysis | SA | No | No | Yes |
LogProcessing | LP | No | No | Yes |
MachineOutlier | MO | No | No | Yes |
ReinforcementLearner | RL | No | No | Yes |
This repository also contains small datasets used to run the applications except for LinearRoad and VoipStream. For these two applications, datasets can be generated as described in 1 and 2. Once generated, please copy the dataset files in the Datasets/LR
and Datasets/VS
folders respectively. The datasets are used by all versions of the same application in all the supported frameworks. For the Yahoo! Streaming Benchmark (YSB) and ReinforcementLearner (RL) no dataset is actually required by the present implementation (synthetic data are continously generated by Sources).
This repository is not totally cleaned and there is a certain duplication of code. The reason is because each application, for each framework, is designed to be a separated standalone project. Refer to the README file within each subfolder (application/framework) for further information about how to run each application and for the required dependencies.
This repository uses the applications that we have recently added to a larger benchmark suite of streaming applications called DSPBench available on GitHub at the following link. If our applications revealed useful for your research, we kindly ask you to give credit to our effort by citing the following paper:
@article{DSPBench,
author={Bordin, Maycon Viana and Griebler, Dalvan and Mencagli, Gabriele and Geyer, Cláudio F. R. and Fernandes, Luiz Gustavo L.},
journal={IEEE Access},
title={DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing Systems},
year={2020},
volume={8},
number={},
pages={222900-222917},
doi={10.1109/ACCESS.2020.3043948}
}
The main developer and maintainer of this repository is Gabriele Mencagli. Other authors of the source code are Alessandra Fais, Andrea Cardaci and Cosimo Agati.