Benchmarks

Please note that those benchmark informations are very old.

Please note that Ruby version was removed. It was slower, and Kubernetes community prefer tools written in Go.

Postgresql was faster than MySQL, so it was only supported in a dropped Ruby version. Go version uses Postgres.

Please not that recently (May 2018) influxDB was removed as well, Postgres is faster even when using it as a time-series database.

Here are historical results of benchmarks Go & Ruby tools:

gha2db tool benchmarks

We're trying 3 versions:

Ruby on Postgres.
Ruby on MySQL.
Go on Postgres.

On 2 data sets:

All kubernetes GHA events (orgs kubernetes, kubernetes-incubator, kubernetes-client, kubernetes-csi) on single month 2017-07-01 - 2017-08-01.
All GHA events (no org/repo filter) on 2 days 2017-08-01 - 2017-08-03 (Tue and Wed).

Columns:

Benchmark - benchmark name.
Events - number of GHA events created. Note that for All we have 2.5M events in just 2 days, while for only Kubernetes we have 66K events in a month.
Real time - time it took to compute.
User time - time it took to compute on all CPUs (so this is the time it would take on single CPU machine).
Parallelism - this is the ratio of User time to Real time - parallelism factor.
Range - length of data processed.

And final run for Kubernetes for all 2015-08-06 - 2017-08-26 using Go version of gha2db:

time PG_PASS='...' PG_DB='test' ./gha2db 2015-08-06 0 2017-08-26 0 'kubernetes,kubernetes-incubator,kubernetes-client,kubernetes-csi'

Outputs 1200426 GHA events in:

real  112m6.604s --> 6726s
user  1718m21.020s --> 103101s
sys 99m43.964s

Results Table:

Benchmark	Events	Real time	User time	Parallelism	Range
K8s Go / Psql	65851	5m5.3s	81m44.1s	16.06x	1 month
K8s Ruby / Psql	65851	63m26.817s	68m25.120s	1.078x	1 month
K8s Ruby / MySQL	65851	66m13.291s	69m45.604s	1.053x	1 month
All Go / Psql	2550663	6m4.652s	37m10.932s	6.12x	2 days
All Ruby / Psql	2550663	45m16.238s	50m19.916s	1.118x	2 days
All Ruby / MySQL	2550663	46m55.949s	40m43.796s	0.868x	2 days
Full K8s Go / Psql	1200426	1h52m6.6s	26h38m21s	15.33x	~2 years

Results

When processing only Kubernetes events, we still need to download, decompress, parse all JSON and select only those with specific org.

This is lightning fast in Go, while terribly slow in Ruby.

Ruby is not really multi-threaded (Ruby MRI), it uses GIL, and essentially it is just single threaded.

We can see max parallelism ratio about 1.11x which mean that even with 48 CPU cores, current Ruby implementation can make use of 1.11 cores.

Links:

Ruby threads not really use multiple CPUs
For true concurrency having more then 2 cores or 2 processors is required - but it may not work if implementation is single-threaded (such as the MRI).:
Ruby threads in parallel
Ruby interpreter GIL
JRuby:

JRuby has the significant architectural advantage to be able to leverage JVM threads without being constrained by a global interpreter lock (similarly to Rubinius), therefore achieving full parallelism within a process, which Ruby MRI cannot achieve despite leveraging OS threads.

Seems like only JRuby implementation has real MT processing:

Remember that only in JRuby threads are truly parallel (other interpreters implement GIL).

So Go will kill Ruby all the time! It is about 10x - 15x faster than Ruby on average.

One word: Go version can import all GitHub archives data (not discarding anything) for all Kubernetes orgs/repos, from the beginning on GitHub 2015-08-06 in about 2 hours!

We can also see that MySQL is very slightly slower that Postgres (but this is just for inserting data, without indexes defined yet). MySQL is a lot slower on metrics/queries - but this is not checked in this benchmark.

db2influx/gha2db_sync tools benchmarks

To benchmark db2influx we're just running a command that regenerates all metrics.

This is done automatically by sync/sync.rb tool that just call db2influx/db2influx.rb for all time range (when GHA2DB_RESETIDB env is set).

This also updates Postgres DB (since last run) using gha2db/gha2db.rb, so we need to run it once, to make sure no updates are needed.

And then we can start benchmark in another run (it will only recompute all InfluxDB time-series then).

Go version: time GHA2DB_RESETIDB=1 PG_PASS='...' IDB_PASS='...' PG_DB=test IDB_DB=test ./sync_go.sh

real  1m28.336s
user  1m54.352s
sys 1m11.860s

Ruby version: time GHA2DB_PSQL=1 GHA2DB_RESETIDB=1 PG_PASS='...' IDB_PASS='...' PG_DB=test IDB_DB=test ./sync_ruby.sh

real  1m46.342s
user  0m51.488s
sys 0m27.408s

Benchmark	Real time	User time	Parallelism	Range
All InfluxDB / Go	1m28.336s	1m54.352s	1.29x	~2 years
All InfluxDB / Ruby	1m46.342s	0m51.488s	0.484x	~2 years

We can see that this task is dominated by query Postgres time, so it is less important which programming language is used to do queries.

We can see that Ruby is very slightly slower.

Plase note that this regenerates ALL InfluxDB data for 2 years and is < 2 minutes.

Generating index takes < 3 minutes.

In a typical case it will only add new time-series since last run + eventually process single new GHA hour (since last run). It usually takes less than minute in both languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCHMARK.md

BENCHMARK.md

Benchmarks

gha2db tool benchmarks

Results

db2influx/gha2db_sync tools benchmarks

Files

BENCHMARK.md

Latest commit

History

BENCHMARK.md

File metadata and controls

Benchmarks

gha2db tool benchmarks

Results

db2influx/gha2db_sync tools benchmarks