RONDB-854: Metrics updater for RDRS2 #637
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To implement Request statistics we use prometheus-cpp library. However it is not a good idea to call this library on each request. This will kill performance.
To handle this prometheus-cpp offers a possibility to report histograms instead of reporting every response time. In this implementation we have reported 61 entries in the histogram plus 3 for error codes.
This means that request counters can be had by summing all of those histogram counters together.
In addition we keep a counter of number of primary key lookups that RDRS2 is doing towards RonDB. This uses a separate counter.
Also ping and health have separate counters and no response time handling.
Since prometheus end point will likely be called every 10 seconds it means that we report 323 values every 10 seconds. This should also ensure that we don't overload the memory of the prometheus server. Reporting each response time would create hundreds of thousands of rows in prometheus and not likely to be handled well by the prometheus server.
The histogram reports static increments for short response times, for long response times the times are increasing logarithmically instead. This gives good accuracy for common, short response times while still providing some level of accuracy to long response times.