-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266
base: master
Are you sure you want to change the base?
Conversation
5d1b07f
to
92a4f45
Compare
ef2a7d3
to
3ebbfc6
Compare
|
||
[no_user_modify, no_storage] double seconds_processing_messages { | ||
get; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works! 👍
< HTTP/1.1 200 OK
< Server: Icinga/v2.14.0-375-gecea52568
< Content-Type: application/json
< Content-Length: 591
<
{
"results": [
{
"attrs": {
"seconds_processing_messages": 3.1292e-05,
"seconds_reading_messages": 0.000155709
},
"joins": {},
"meta": {},
"name": "dummy",
"type": "Endpoint"
},
{
"attrs": {
"seconds_processing_messages": 0,
"seconds_reading_messages": 0
},
"joins": {},
"meta": {},
"name": "ws-aklimov7777777.local",
"type": "Endpoint"
}
]
}
3ebbfc6
to
f845407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also works! 👍
lib/methods/clusterzonechecktask.cpp
Outdated
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond) | ||
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond), | ||
new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages), | ||
new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
"counter": false,
"crit": null,
"label": "sum_seconds_reading_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 0.000787875,
"warn": null
},
{
"counter": false,
"crit": null,
"label": "sum_seconds_processing_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 3.45e-05,
"warn": null
}
lib/methods/icingachecktask.cpp
Outdated
@@ -148,6 +152,8 @@ void IcingaCheckTask::ScriptFunc(const Checkable::Ptr& checkable, const CheckRes | |||
perfdata->Add(new PerfdataValue("sum_messages_received_per_second", messagesReceivedPerSecond)); | |||
perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond)); | |||
perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)); | |||
perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages)); | |||
perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
"counter": false,
"crit": null,
"label": "sum_seconds_reading_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 0.000264792,
"warn": null
},
{
"counter": false,
"crit": null,
"label": "sum_seconds_processing_messages",
"max": null,
"min": null,
"type": "PerfdataValue",
"unit": "",
"value": 5.2666e-05,
"warn": null
}
get; | ||
}; | ||
|
||
[no_user_modify, no_storage] double seconds_processing_messages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, there's also Log HTTP/RPC message processing stats #10141, but it's debug log unless a singe message takes excessively long. So, unless combined with Logger#object_filter: restrict Logger to messages referring to specific objects #9844, it's pretty useless in a large env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Log HTTP/RPC message processing stats #10141 (comment) legitimately mentions that we already have logs which warn if the semaphore wait delays message processing noticeably. I.e you have more busy connections than semaphore slots.
- But if you have enough slots and your messages themselves take long to process (Give PerfdataWriter a 👷♂️WorkQueue like e.g GraphiteWriter🖋️ #10267?), the suggested aggregation tells you.
20460e6
to
055c6b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also rebase, the Actions are pretty outdated.
lib/base/benchmark.hpp
Outdated
/** | ||
* @return The total accumulated time in seconds | ||
*/ | ||
template<class T> | ||
operator T() const noexcept | ||
{ | ||
return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total accumulated time in seconds
Isn't this templated because T
can be used to specify the unit, i.e. it could be something other than seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/base/benchmark.hpp
Outdated
/** | ||
* Benchmark result. | ||
* | ||
* @ingroup base | ||
*/ | ||
class Benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark (I mean that could also be single-threaded without any need for atomic types).
And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:
icinga2/lib/remote/endpoint.cpp
Lines 140 to 153 in 055c6b6
double Endpoint::GetSecondsReadingMessages() const | |
{ | |
return m_InputReadTime; | |
} | |
double Endpoint::GetSecondsAwaitingSemaphore() const | |
{ | |
return m_InputSemaphoreTime; | |
} | |
double Endpoint::GetSecondsProcessingMessages() const | |
{ | |
return m_InputProcessTime; | |
} |
icinga2/lib/remote/endpoint.hpp
Lines 48 to 53 in 055c6b6
void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime) | |
{ | |
m_InputReadTime += readTime; | |
m_InputSemaphoreTime += semaphoreTime; | |
m_InputProcessTime += processTime; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark
I don't think so, but I'm open for naming suggestions.
(I mean that could also be single-threaded without any need for atomic types).
Actually, Benchmark instances could be queried during update, so yes, atomics are needed.
And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:
icinga2/lib/remote/endpoint.cpp
Lines 140 to 153 in 055c6b6
double Endpoint::GetSecondsReadingMessages() const { return m_InputReadTime; } double Endpoint::GetSecondsAwaitingSemaphore() const { return m_InputSemaphoreTime; } double Endpoint::GetSecondsProcessingMessages() const { return m_InputProcessTime; } icinga2/lib/remote/endpoint.hpp
Lines 48 to 53 in 055c6b6
void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime) { m_InputReadTime += readTime; m_InputSemaphoreTime += semaphoreTime; m_InputProcessTime += processTime; }
This would duplicate triplicate the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I mean that could also be single-threaded without any need for atomic types).
Actually, Benchmark instances could be queried during update, so yes, atomics are needed.
I was referring to naming it benchmark, not how it's used in this PR. I mean that class name would also fit for a helper for actual benchmarking, that might not need any thread-safety.
That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark
I don't think so, but I'm open for naming suggestions.
Wouldn't taking that suggestion literally (i.e. AtomicDuration
) describe what this type does? It's only member is storing a duration, using atomics so that one thread can add, another can read.
This would
duplicatetriplicate the code.
What would be duplicated? These two lines:
icinga2/lib/base/benchmark.cpp
Line 18 in 531991f
m_Sum.fetch_add(elapsedTime.count(), std::memory_order_relaxed); |
icinga2/lib/base/benchmark.hpp
Line 29 in 531991f
return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count(); |
The first one is quite trivial, so I wouldn't even bother. The second one is a bit more annoying, I'd say it's right at the edge: it's still simple enough that I'd say it would be okay to have it there three times (avoiding the naming question 😅), but also complex enough that I'd say it's not excessive having it in a class (in that case, the first one should of course remain in that class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but also complex enough that I'd say it's not excessive having it in a class
Then I'd just keep the class as such.
I don't think so, but I'm open for naming suggestions.
Wouldn't taking that suggestion literally (i.e.
AtomicDuration
) describe what this type does?
Why would one accumulate durations? What for? For benchmarking!
If you disagree with the class name, please suggest one I SHALL take literally.
Or to say it with #10313 (comment): I have no time to discuss a simple word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't taking that suggestion literally (i.e.
AtomicDuration
) describe what this type does?Why would one accumulate durations? What for? For benchmarking!
No, not really. This PR uses it for metrics. Would Metric
be a good class name? No, for similar reasons like Benchmark
.
Also, for a benchmark, you don't want to accumulate durations like this class does. Especially if you want to benchmark short-running code, you want to run it many times and measure the total time to reduce how much of an effect the measurement itself has on the result.
If you disagree with the class name, please suggest one I SHALL take literally.
I already made a suggestion with AtomicDuration
?
67ddc24
to
531991f
Compare
lib/remote/jsonrpcconnection.cpp
Outdated
if (m_Endpoint) { | ||
readStarted = Benchmark::Clock::now(); | ||
} | ||
|
||
try { | ||
jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024); | ||
} catch (const std::exception& ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does seconds_reading_messages
actually provides some useful information? This value mixes both idle time waiting for the network and busy time spent on TLS operations. Both are fundamentally different, so what can one deduce from the sum? If the value is high, it's fine if it's due to the network read waiting for the peer, it wouldn't be fine, if it was actually due to the TLS load, but you can't tell the difference. If it's low, you only know it's spending the time elsewhere, but you need another metric like seconds_processing_messages
for that anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you know that if it's high, it's very likely due to net lag. And at least we separate that stuff from the other two metrics. Due to the latter it makes sense to have this third metric, so that all metrics add up.
I mean, if that is our biggest problem here, I've done pretty good work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a difference between quickly adding some metrics to debug a specific problem and considering it for merging it into an actual release.
But you know that if it's high, it's very likely due to net lag.
How do you actually tell that? A connection with only a few messages will look just the same (think a random agent connection for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look.
Before we can handle a message, we have to read it, right? And if reading by itself is slow, everything is slow. But w/o this third metric you never know that it's net lag (or TLS).
But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if reading by itself is slow, everything is slow.
No, this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.
But w/o this third metric you never know that it's net lag (or TLS).
By net lag you mean the underlying network only provides a slower connection than what would be necessary to transfer the amount of data written into it? The current metric wouldn't tell you that, but you could figure that out by looking at the sizes of the send and receive queues of the TCP sockets on both ends of the connection.
But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.
I don't know either. But I don't really see how one would interpret the information given by that metric in it's current state to get something useful out of it. And if we can't tell what it's useful for, omitting it would be an option for me as well.
531991f
to
eb9c69c
Compare
eb9c69c
to
167e8c0
Compare
167e8c0
to
70cee1c
Compare
70cee1c
to
124c2e8
Compare
lib/remote/jsonrpcconnection.cpp
Outdated
// Only once we receive at least one byte, we know there must be a message to read. | ||
if (!m_Stream->in_avail()) { | ||
m_Stream->async_fill(yc); | ||
} | ||
|
||
// Only then we can start measuring the time it takes to read it. | ||
if (m_Endpoint) { | ||
readStarted = AtomicDuration::Clock::now(); | ||
} | ||
|
||
jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.
🎉
I've filtered out those 5 minutes!
124c2e8
to
e6814c8
Compare
e6814c8
to
f461342
Compare
…ssing_messages} for the API
f461342
to
c67f8dd
Compare
ref/NC/820479
Tests