[patch / rfc] 0.9.x: Fetch all request targets in a single http query to the backends #1045

jraby · 2014-12-02T18:27:46Z

Last week, I tested the 0.9.x branch with the new bulk fetching feature and had high hopes of performance improvement. Our use case includes having multiple backends 'far away' from the frontend (80ms ping latency), so reducing the number of http calls between the frontend and the backends would greatly help us.
The speedup was definitely there, but somehow not as impressive as other reported. Probably because we try to use the aggregated metrics instead of wildcards in our most frequent queries, so the number of http request was already pretty low.

So I set out to find a way to improve the performance, by trying to fetch all targets in a single http request instead of doing one http call per target (per remote backend).

I've come up with a patch which can be found here: datacratic#2

I've written a bit a background and context in that pull request, but here's an overview of how it works:

When a render request comes in:

Have a function extract all the pathExpressions (metric name/patterns) from the targets
fetch all pathExpressions from all backends and store the result in a hash table keyed per backend and per metric name + timerange for easy lookups:
cache[remoteBackendName][originalTargetWithWildcards-startTime-endTime] = [ series_from_remote_backend]
Store that hash table in the requestContext (which has the targets list, startTime, endTime, etc...)
Continue processing as before (via the recursive parser - evaluateTarget)
In the fetchData method do a hashtable lookup for the requested data.
If the data is there, or if there is no data but a prefetch call was made for that pathExpr, skip the remote fetch and use the data from the cache.
If there is a cache miss, do a regular remote fetch. This case is pretty rare, it happens when functions needs data out of the timerange of the original query. For example: timeshift, movingAverage, etc.

I've done some testing and it seems to be faster is all our use cases (low latency backends and high latency backends). You can see the result of those rudimentary tests at the bottom of the pull request.

Unfortunately, the patch builds on another patch that we use to fetch data from backends in parallel, so I couldn't do a proper pull request for review.
Nonetheless, I thought you guys might be interested in seeing this since the speedup is quite important and I'd also appreciate your input on the more hacky parts of the patch.

Thanks.

The text was updated successfully, but these errors were encountered:

deniszh mentioned this issue Dec 21, 2014

Performance of queries that return large number of metrics #1058

Closed

This was referenced Dec 30, 2014

Do /metrics/find queries in parallel #1072

Merged

Do a single http req per backend - and double the perf in some cases #1073

Merged

jraby closed this as completed Dec 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[patch / rfc] 0.9.x: Fetch all request targets in a single http query to the backends #1045

[patch / rfc] 0.9.x: Fetch all request targets in a single http query to the backends #1045

jraby commented Dec 2, 2014

[patch / rfc] 0.9.x: Fetch all request targets in a single http query to the backends #1045

[patch / rfc] 0.9.x: Fetch all request targets in a single http query to the backends #1045

Comments

jraby commented Dec 2, 2014