Skip to content

This is a repo for the "Analsysis and Design of Information Systems" Project (2023-2024), which compares 2 timeseries databases, InfluxDB and TimescaleDB.

Notifications You must be signed in to change notification settings

LavredisG/NTUA-Analysis-and-Design-of-Information-Systems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NTUA-Analysis-and-Design-of-Information-Systems

This is a repo for the Analysis and Design of Information Systems Project (2023-2024), which compares 2 timeseries databases, InfluxDB and TimescaleDB on the devops use case. It contains instructions to set the environment up, scripts to automate the execution of commands in order to avoid repetitive work and the results of our research in PNG form.

Important notice: Since this repo uses the commands provided by TSBS in order to automate their bulk execution, readers should first read that README, before proceeding here for a better understanding of the commands and the environment we set.

The following instructions set up the project in Ubuntu 16.04 machines. If that's not the case for you, links for the respective guides are provided below to support your machine.

Repo Structure

scripts:

Every folder contained here has to be placed in the tsbs root directory of the repo that we'll clone (described later in the steps).

Metrics_Comparison:

PNG files of the metrics compared in this project. These were created using the metrics_visualizer.py script, which reads data from the influx.py and timescale.py - all of them located under Visualizer_Scripts - to visualize them. In addition to these, influx_storage.py and timescale_storage.py are used to create InfluxDB_storage.png and TimescaleDB_storage.png respectively.

Reports:

PDF files containing the final report and the data we based our results on.

1. Installation and Setup

Download and Install tools and databases

# Update your system
sudo apt-get update
sudo apt-get upgrade

# Install wget package manager
sudo apt-get install wget

# Download and install InfluxDB (v1.8.10)
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.8.10_amd64.deb
sudo dpkg -i influxdb_1.8.10_amd64.deb
sudo service influxdb start

# Download and install Postgres 13
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
echo "deb http://apt-archive.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" | sudo tee /etc/apt/sources.list.d/pgdg.list
sudo apt-get update
sudo apt-get install postgresql-13
sudo -u postgres psql -c "SELECT version();" # Verify installation

# Download and install TimescaleDB
sudo add-apt-repository ppa:timescale/timescaledb-ppa
sudo apt-get update
sudo apt-get install timescaledb-2-postgresql-13
sudo timescaledb-tune
sudo service postgresql restart

#Download and install Go (v.1.21.6)
wget https://golang.org/dl/go1.21.6.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.21.6.linux-amd64.tar.gz

Set environment variables for Go workspace

# Go binaries, source files etc are located in /usr/local/go
# Set these environment variables in ~/.bashrc
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
echo 'export GOPATH=$HOME/go' >> ~/.bashrc
echo 'export PATH=$PATH:$GOPATH/bin' >> ~/.bashrc
echo 'export GOROOT=/usr/local/go' >> ~/.bashrc
source ~/.bashrc

Your Go workspace should now be set up and look like this: (the repo that we will work in (TSBS) will be cloned where project1 is, under src)

Go workspace

Clone TSBS repo

# Create a dir named 'src' under $GOPATH
# where we will store our repo
mkdir $GOPATH/src
cd $GOPATH/src

# Clone the TSBS repo and build it in the src we created
git clone https://github.com/timescale/tsbs.git
cd tsbs
make

2. Data generation and loading to the DBs

We used TSBS for the data generation and loading. More specifically:

  • Create 6 databases, 3 Influx DBs and 3 Timescale DBs, named small, medium and big, 1 for each DB.
  • In the tsbs repo we previously cloned and under the root directory, create a directory named datasets, in which datasets will be stored.
  • Right after, we run the generate_data.sh, which creates the 6 datasets using the tsbs_generate_data command, in a zipped form, and stores them under the directory we previously created.
mkdir datasets
# uncomment the next line if you don't have
# execution permission on the script 
# chmod +x generate_data.sh
./generate_data.sh

As for the data loading, since it's a process that could cause problems to our machines due to storage limitations if done on bulk, there are no scripts, so you have to follow them as is from the source documentation. We provide a few of them for examples:

# Load the "timescale_big" dataset to the "big" timescale db using 4 workers
cat ./datasets/timescale_big.gz | gunzip | tsbs_load_timescaledb --host="localhost" --port=5432 --pass="12345678" --user="postgres" --workers=4 --do-create-db=false --do-abort-on-exist=false --db-name="big"

# Load the influx_big dataset to the "big" influx db using 8 workers
cat ./datasets/influx_big.gz | gunzip | tsbs_load_influx --workers=8 --do-create-db=false --do-abort-on-exist=false --db-name="big"

Regarding the measurement of the storage each DB occupies, we execute the following:

  • For Timescale databases:
# Print the space used by medium timescale database
sudo -su postgres psql
\c medium
SELECT pg_size_pretty(pg_database_size(current_database()));
  • For Influx databases:
# Print the space used by big influx database
  du -sh /var/lib/infludb/data/big

3. Query generation

For this step, we execute each script inside the generate_queries folder. Results are stored in a zipped form inside the queries folder. For example, in order to create and then view the zipped files in the queries/influx_big folder:

cd generate_queries
./influx_big.sh
cd ..
ls queries/influx_big

This will create 15 zipped files, 1 for each query type, containing 1000 queries each, to be used for the influx big database. For more info on the flags used, you can take a look at the corresponding section of TSBS as mentioned earlier.

4. Measurement and visualization of relevant performance metrics

At this point, we have generated and stored the queries that will be executed. So the next step is execution. Following the same logic as before, we move to the run_queries folder, where we are faced with 2 folders run_influx and run_timescale. Since the logic is similar, let's say that we get in the influx folder. We are then faced with scripts named run_influx_small/medium/big_1/2/4, denoting the base that the queries will be run against and the number of workers to be used. For example, if we want to query the influx big database using 4 workers, we follow this flow:

cd run_queries
cd run_influx
./run_influx_big4.sh

The steps mentioned (data loading, query generation, query execution) produce metrics that we store in the influx.py and timescale.py respectively. Each of these files contains a small description about the format of its data. The metrics_visualiser.py then uses these data in order to visualize them, so that results can be easily inferred.

It's also worth mentioning that each execution of the queries, is following the time command (/usr/bin/time). In this way we can get more useful metrics than those provided by the suite.

About

This is a repo for the "Analsysis and Design of Information Systems" Project (2023-2024), which compares 2 timeseries databases, InfluxDB and TimescaleDB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •