This repository contains a benchmark suite for evaluation of Interval Join operation respectfully for Flink and Windflow implementation.
In order to run the Flink implementation in this project, the following dependencies are needed:
- a C++ compiler with full support for C++17 like GCC (GNU Compiler Collection) version >= 8 or Clang 5 and later
- WindFlow library version >= 4.0.0 and all the required dependencies
- FastFlow library version >= 3.0 and all the required dependencies
- RapidJSON parser for C++. You can install the package on Ubuntu by running
sudo apt-get install -y rapidjson-dev
command.
Whole suite can be run on local machine, further information can be found in the official documentation and in internal README file.
In order to run the Flink implementation in this project, the following dependencies are needed:
- Apache Flink version >= 1.16.0
- Java JDK version >= 1.11
- Maven version >= 3.9.6
You can generate whole test cases simply by running the run_benchmarks.sh
located in the /scripts
folder. The provided script cycles through all parameters specified in the config file and generates throughput and latency charts by running the draw_charts.py
Python tool.
In order to install the draw_charts.py
dependencies you need to have on your system pip: the package installer for Python.
After that, simply set up a virtual environment by using the venv package and run the following command inside the /scripts
folder:
pip install -r requirements.txt
The draw_charts.py
script is used to generate various charts based on the benchmark results. It supports generating latency, throughput, per batch, per source, and comparison charts.
python draw_charts.py <chart_type> [additional arguments]
chart_type
(str): The type of chart to draw. Valid options are:lt
: Latency chartth
: Throughput chartall
: Both latency and throughput chartssrc
: Average performance per source chartbatch
: Average performance per batch chartcomparison
: Comparison chart between all 3 execution modes (key parallelism, data parallelism, and Flink modes)
-
For
comparison
chart type:res_dir
(str): Path to the results directory where the images will be saved.kp_dir
(str): Path to the key parallelism mode directory.dp_dir
(str): Path to the data parallelism mode directory.fl_dir
(str): Path to the Flink tests directory.img_name
(str): Name of the image file to generate.
-
For other chart types:
tests_path
(str): Path to the tests folders.
To generate a comparison chart, you can use the following command:
python draw_charts.py comparison /path/to/results /path/to/kp_dir /path/to/dp_dir /path/to/fl_dir image_name
To generate a latency chart, you can use the following command:
python draw_charts.py lt /path/to/tests
.
└── results/
└── {framework_type}/
└── {workload_type}/
└── [{parallelism_mode}]/
└── [{synthetic_keys_number}]/
└── [{batching_size}]/
├── {source_degree}/
│ ├── {1_test_1}/
│ │ ├── ...
│ │ ├── latency.pdf
│ │ └── throughput.pdf
│ ├── {2_test_2}/
│ ├── {3_test_4}/
│ ├── {4_test_6}/
│ └── avg_source.pdf
└── avg_batch.pdf
In order to generate the synthetic datasets you can compile and run the C++ tool located in /gen_dataset
folder. Further instructions for using the tool can be found in the inner README file.
In order to run a benchmark, of each implementaion, with real datasets you need to download the datasets.tar.gz archive and unzip it by calling tar -zvxf datasets.tar.gz
command into root of this repository. Otherwise you can run the ./download_datasets.sh
script located in /scripts
folder.
Credits for datasets goes to AllianceDB project.