Benchmarking Suite - Interval Join performance evaluation

This repository contains a benchmark suite for evaluation of Interval Join operation respectfully for Flink and Windflow implementation.

WindFlow Dependencies

In order to run the Flink implementation in this project, the following dependencies are needed:

a C++ compiler with full support for C++17 like GCC (GNU Compiler Collection) version >= 8 or Clang 5 and later
WindFlow library version >= 4.0.0 and all the required dependencies
FastFlow library version >= 3.0 and all the required dependencies
RapidJSON parser for C++. You can install the package on Ubuntu by running sudo apt-get install -y rapidjson-dev command.

Flink Dependencies

Whole suite can be run on local machine, further information can be found in the official documentation and in internal README file.

In order to run the Flink implementation in this project, the following dependencies are needed:

Apache Flink version >= 1.16.0
Java JDK version >= 1.11
Maven version >= 3.9.6

Run Benchmark script

You can generate whole test cases simply by running the run_benchmarks.sh located in the /scripts folder. The provided script cycles through all parameters specified in the config file and generates throughput and latency charts by running the draw_charts.py Python tool.

In order to install the draw_charts.py dependencies you need to have on your system pip: the package installer for Python. After that, simply set up a virtual environment by using the venv package and run the following command inside the /scripts folder:

pip install -r requirements.txt

Draw Charts Script

The draw_charts.py script is used to generate various charts based on the benchmark results. It supports generating latency, throughput, per batch, per source, and comparison charts.

Usage

python draw_charts.py <chart_type> [additional arguments]

Parameters

chart_type (str): The type of chart to draw. Valid options are:
- lt: Latency chart
- th: Throughput chart
- all: Both latency and throughput charts
- src: Average performance per source chart
- batch: Average performance per batch chart
- comparison: Comparison chart between all 3 execution modes (key parallelism, data parallelism, and Flink modes)

Additional Arguments

For comparison chart type:
- res_dir (str): Path to the results directory where the images will be saved.
- kp_dir (str): Path to the key parallelism mode directory.
- dp_dir (str): Path to the data parallelism mode directory.
- fl_dir (str): Path to the Flink tests directory.
- img_name (str): Name of the image file to generate.
For other chart types:
- tests_path (str): Path to the tests folders.

Example

To generate a comparison chart, you can use the following command:

python draw_charts.py comparison /path/to/results /path/to/kp_dir /path/to/dp_dir /path/to/fl_dir image_name

To generate a latency chart, you can use the following command:

python draw_charts.py lt /path/to/tests

Results folder structure example

.
└── results/
    └── {framework_type}/
        └── {workload_type}/
            └── [{parallelism_mode}]/
                └── [{synthetic_keys_number}]/
                    └── [{batching_size}]/
                        ├── {source_degree}/
                        │   ├── {1_test_1}/
                        │   │   ├── ...
                        │   │   ├── latency.pdf
                        │   │   └── throughput.pdf
                        │   ├── {2_test_2}/
                        │   ├── {3_test_4}/
                        │   ├── {4_test_6}/
                        │   └── avg_source.pdf
                        └── avg_batch.pdf

Synthetic Datesets

In order to generate the synthetic datasets you can compile and run the C++ tool located in /gen_dataset folder. Further instructions for using the tool can be found in the inner README file.

Real World Datasets

In order to run a benchmark, of each implementaion, with real datasets you need to download the datasets.tar.gz archive and unzip it by calling tar -zvxf datasets.tar.gz command into root of this repository. Otherwise you can run the ./download_datasets.sh script located in /scripts folder.

Credits for datasets goes to AllianceDB project.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
comparison_flink		comparison_flink
comparison_wf		comparison_wf
flink		flink
gen_dataset		gen_dataset
imgs		imgs
scripts		scripts
windflow		windflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmarking Suite - Interval Join performance evaluation

WindFlow Dependencies

Flink Dependencies

Run Benchmark script

Draw Charts Script

Usage

Parameters

Additional Arguments

Example

Results folder structure example

Synthetic Datesets

Real World Datasets

Chart examples

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

DropB1t/IntervalJoinBenchmarks

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Suite - Interval Join performance evaluation

WindFlow Dependencies

Flink Dependencies

Run Benchmark script

Draw Charts Script

Usage

Parameters

Additional Arguments

Example

Results folder structure example

Synthetic Datesets

Real World Datasets

Chart examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages