This is a repo for the Analysis and Design of Information Systems
Project (2023-2024), which compares 2 timeseries databases, InfluxDB and TimescaleDB on the devops
use case. It contains instructions to set the environment up, scripts to automate the execution of commands in order to avoid repetitive work and the results of our research in PNG form.
Important notice: Since this repo uses the commands provided by TSBS in order to automate their bulk execution, readers should first read that README, before proceeding here for a better understanding of the commands and the environment we set.
The following instructions set up the project in Ubuntu 16.04 machines. If that's not the case for you, links for the respective guides are provided below to support your machine.
- TimescaleDB (v2.4.0) (single node version)
- InfluxDB v1 OSS (v1.8.10)
- PostgreSQL 13
- Golang (v.1.21.6)
Every folder contained here has to be placed in the tsbs root directory of the repo that we'll clone (described later in the steps).
PNG files of the metrics compared in this project. These were created using the metrics_visualizer.py
script, which reads data from the influx.py
and timescale.py
- all of them located under Visualizer_Scripts
- to visualize them. In addition to these, influx_storage.py
and timescale_storage.py
are used to create InfluxDB_storage.png and TimescaleDB_storage.png respectively.
PDF files containing the final report and the data we based our results on.
# Update your system
sudo apt-get update
sudo apt-get upgrade
# Install wget package manager
sudo apt-get install wget
# Download and install InfluxDB (v1.8.10)
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.8.10_amd64.deb
sudo dpkg -i influxdb_1.8.10_amd64.deb
sudo service influxdb start
# Download and install Postgres 13
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt-archive.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" | sudo tee /etc/apt/sources.list.d/pgdg.list
sudo apt-get update
sudo apt-get install postgresql-13
sudo -u postgres psql -c "SELECT version();" # Verify installation
# Download and install TimescaleDB
sudo add-apt-repository ppa:timescale/timescaledb-ppa
sudo apt-get update
sudo apt-get install timescaledb-2-postgresql-13
sudo timescaledb-tune
sudo service postgresql restart
#Download and install Go (v.1.21.6)
wget https://golang.org/dl/go1.21.6.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.21.6.linux-amd64.tar.gz
# Go binaries, source files etc are located in /usr/local/go
# Set these environment variables in ~/.bashrc
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
echo 'export GOPATH=$HOME/go' >> ~/.bashrc
echo 'export PATH=$PATH:$GOPATH/bin' >> ~/.bashrc
echo 'export GOROOT=/usr/local/go' >> ~/.bashrc
source ~/.bashrc
Your Go workspace should now be set up and look like this: (the repo that we will work in (TSBS) will be cloned where project1 is, under src)
# Create a dir named 'src' under $GOPATH
# where we will store our repo
mkdir $GOPATH/src
cd $GOPATH/src
# Clone the TSBS repo and build it in the src we created
git clone https://github.com/timescale/tsbs.git
cd tsbs
make
We used TSBS for the data generation and loading. More specifically:
- Create 6 databases, 3 Influx DBs and 3 Timescale DBs, named small, medium and big, 1 for each DB.
- In the tsbs repo we previously cloned and under the root directory, create a directory named
datasets
, in which datasets will be stored. - Right after, we run the
generate_data.sh
, which creates the 6 datasets using thetsbs_generate_data
command, in a zipped form, and stores them under the directory we previously created.
mkdir datasets
# uncomment the next line if you don't have
# execution permission on the script
# chmod +x generate_data.sh
./generate_data.sh
As for the data loading, since it's a process that could cause problems to our machines due to storage limitations if done on bulk, there are no scripts, so you have to follow them as is from the source documentation. We provide a few of them for examples:
# Load the "timescale_big" dataset to the "big" timescale db using 4 workers
cat ./datasets/timescale_big.gz | gunzip | tsbs_load_timescaledb --host="localhost" --port=5432 --pass="12345678" --user="postgres" --workers=4 --do-create-db=false --do-abort-on-exist=false --db-name="big"
# Load the influx_big dataset to the "big" influx db using 8 workers
cat ./datasets/influx_big.gz | gunzip | tsbs_load_influx --workers=8 --do-create-db=false --do-abort-on-exist=false --db-name="big"
Regarding the measurement of the storage each DB occupies, we execute the following:
- For Timescale databases:
# Print the space used by medium timescale database
sudo -su postgres psql
\c medium
SELECT pg_size_pretty(pg_database_size(current_database()));
- For Influx databases:
# Print the space used by big influx database
du -sh /var/lib/infludb/data/big
For this step, we execute each script inside the generate_queries
folder.
Results are stored in a zipped form inside the queries
folder. For example, in
order to create and then view the zipped files in the queries/influx_big
folder:
cd generate_queries
./influx_big.sh
cd ..
ls queries/influx_big
This will create 15 zipped files, 1 for each query type, containing 1000 queries each, to be used for the influx big database. For more info on the flags used, you can take a look at the corresponding section of TSBS as mentioned earlier.
At this point, we have generated and stored the queries that will be executed.
So the next step is execution. Following the same logic as before, we move to the
run_queries
folder, where we are faced with 2 folders run_influx
and
run_timescale
. Since the logic is similar, let's say that we get in the
influx folder. We are then faced with scripts named run_influx_small/medium/big_1/2/4
,
denoting the base that the queries will be run against and the number of workers to be used.
For example, if we want to query the influx big database using 4 workers, we follow this flow:
cd run_queries
cd run_influx
./run_influx_big4.sh
The steps mentioned (data loading, query generation, query execution) produce metrics
that we store in the influx.py
and timescale.py
respectively.
Each of these files contains a small description about the format of its data. The
metrics_visualiser.py
then uses these data in order to visualize them,
so that results can be easily inferred.
It's also worth mentioning that each execution of the queries, is following the
time
command (/usr/bin/time). In this way we can get more useful metrics
than those provided by the suite.