You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last week, I tested the 0.9.x branch with the new bulk fetching feature and had high hopes of performance improvement. Our use case includes having multiple backends 'far away' from the frontend (80ms ping latency), so reducing the number of http calls between the frontend and the backends would greatly help us.
The speedup was definitely there, but somehow not as impressive as other reported. Probably because we try to use the aggregated metrics instead of wildcards in our most frequent queries, so the number of http request was already pretty low.
So I set out to find a way to improve the performance, by trying to fetch all targets in a single http request instead of doing one http call per target (per remote backend).
I've come up with a patch which can be found here: datacratic#2
I've written a bit a background and context in that pull request, but here's an overview of how it works:
When a render request comes in:
Have a function extract all the pathExpressions (metric name/patterns) from the targets
fetch all pathExpressions from all backends and store the result in a hash table keyed per backend and per metric name + timerange for easy lookups: cache[remoteBackendName][originalTargetWithWildcards-startTime-endTime] = [ series_from_remote_backend]
Store that hash table in the requestContext (which has the targets list, startTime, endTime, etc...)
Continue processing as before (via the recursive parser - evaluateTarget)
In the fetchData method do a hashtable lookup for the requested data.
If the data is there, or if there is no data but a prefetch call was made for that pathExpr, skip the remote fetch and use the data from the cache.
If there is a cache miss, do a regular remote fetch. This case is pretty rare, it happens when functions needs data out of the timerange of the original query. For example: timeshift, movingAverage, etc.
I've done some testing and it seems to be faster is all our use cases (low latency backends and high latency backends). You can see the result of those rudimentary tests at the bottom of the pull request.
Unfortunately, the patch builds on another patch that we use to fetch data from backends in parallel, so I couldn't do a proper pull request for review.
Nonetheless, I thought you guys might be interested in seeing this since the speedup is quite important and I'd also appreciate your input on the more hacky parts of the patch.
Thanks.
The text was updated successfully, but these errors were encountered:
Last week, I tested the 0.9.x branch with the new bulk fetching feature and had high hopes of performance improvement. Our use case includes having multiple backends 'far away' from the frontend (80ms ping latency), so reducing the number of http calls between the frontend and the backends would greatly help us.
The speedup was definitely there, but somehow not as impressive as other reported. Probably because we try to use the aggregated metrics instead of wildcards in our most frequent queries, so the number of http request was already pretty low.
So I set out to find a way to improve the performance, by trying to fetch all targets in a single http request instead of doing one http call per target (per remote backend).
I've come up with a patch which can be found here: datacratic#2
I've written a bit a background and context in that pull request, but here's an overview of how it works:
When a render request comes in:
pathExpressions
(metric name/patterns) from the targetspathExpressions
from all backends and store the result in a hash table keyed per backend and per metric name + timerange for easy lookups:cache[remoteBackendName][originalTargetWithWildcards-startTime-endTime] = [ series_from_remote_backend]
requestContext
(which has the targets list, startTime, endTime, etc...)evaluateTarget
)fetchData
method do a hashtable lookup for the requested data.If the data is there, or if there is no data but a prefetch call was made for that pathExpr, skip the remote fetch and use the data from the cache.
If there is a cache miss, do a regular remote fetch. This case is pretty rare, it happens when functions needs data out of the timerange of the original query. For example:
timeshift
,movingAverage
, etc.I've done some testing and it seems to be faster is all our use cases (low latency backends and high latency backends). You can see the result of those rudimentary tests at the bottom of the pull request.
Unfortunately, the patch builds on another patch that we use to fetch data from backends in parallel, so I couldn't do a proper pull request for review.
Nonetheless, I thought you guys might be interested in seeing this since the speedup is quite important and I'd also appreciate your input on the more hacky parts of the patch.
Thanks.
The text was updated successfully, but these errors were encountered: