Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

@Al2Klimov Al2Klimov self-assigned this Dec 10, 2024
@cla-bot cla-bot bot added the cla/signed label Dec 10, 2024
@Al2Klimov Al2Klimov changed the title Introduce Benchmark* Introduce Benchmark Dec 10, 2024
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 5d1b07f to 92a4f45 Compare December 10, 2024 14:50
@Al2Klimov Al2Klimov changed the title Introduce Benchmark Benchmark message reading/processing time per endpoint Dec 10, 2024
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from ef2a7d3 to 3ebbfc6 Compare December 10, 2024 15:34

[no_user_modify, no_storage] double seconds_processing_messages {
get;
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works! 👍

< HTTP/1.1 200 OK
< Server: Icinga/v2.14.0-375-gecea52568
< Content-Type: application/json
< Content-Length: 591
<
{
    "results": [
        {
            "attrs": {
                "seconds_processing_messages": 3.1292e-05,
                "seconds_reading_messages": 0.000155709
            },
            "joins": {},
            "meta": {},
            "name": "dummy",
            "type": "Endpoint"
        },
        {
            "attrs": {
                "seconds_processing_messages": 0,
                "seconds_reading_messages": 0
            },
            "joins": {},
            "meta": {},
            "name": "ws-aklimov7777777.local",
            "type": "Endpoint"
        }
    ]
}

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 3ebbfc6 to f845407 Compare December 10, 2024 16:01
Copy link
Member Author

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also works! 👍

new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)
new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond),
new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages),
new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_reading_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 0.000787875,
                            "warn": null
                        },
                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_processing_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 3.45e-05,
                            "warn": null
                        }

@@ -148,6 +152,8 @@ void IcingaCheckTask::ScriptFunc(const Checkable::Ptr& checkable, const CheckRes
perfdata->Add(new PerfdataValue("sum_messages_received_per_second", messagesReceivedPerSecond));
perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond));
perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond));
perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages));
perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_reading_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 0.000264792,
                            "warn": null
                        },
                        {
                            "counter": false,
                            "crit": null,
                            "label": "sum_seconds_processing_messages",
                            "max": null,
                            "min": null,
                            "type": "PerfdataValue",
                            "unit": "",
                            "value": 5.2666e-05,
                            "warn": null
                        }

@Al2Klimov Al2Klimov removed their assignment Dec 10, 2024
@Al2Klimov Al2Klimov marked this pull request as ready for review December 10, 2024 16:02
@Al2Klimov Al2Klimov added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) area/api REST API area/checks Check execution and results labels Dec 10, 2024
get;
};

[no_user_modify, no_storage] double seconds_processing_messages {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Al2Klimov Al2Klimov Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 20460e6 to 055c6b6 Compare December 11, 2024 09:41
@Al2Klimov Al2Klimov changed the title Benchmark message reading/processing time per endpoint Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} Dec 11, 2024
@Al2Klimov Al2Klimov requested a review from yhabteab December 16, 2024 11:33
@yhabteab yhabteab added this to the 2.15.0 milestone Jan 13, 2025
@Al2Klimov Al2Klimov removed this from the 2.15.0 milestone Jan 21, 2025
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also rebase, the Actions are pretty outdated.

Comment on lines 24 to 30
/**
* @return The total accumulated time in seconds
*/
template<class T>
operator T() const noexcept
{
return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total accumulated time in seconds

Isn't this templated because T can be used to specify the unit, i.e. it could be something other than seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 11 to 16
/**
* Benchmark result.
*
* @ingroup base
*/
class Benchmark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark (I mean that could also be single-threaded without any need for atomic types).

And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:

double Endpoint::GetSecondsReadingMessages() const
{
return m_InputReadTime;
}
double Endpoint::GetSecondsAwaitingSemaphore() const
{
return m_InputSemaphoreTime;
}
double Endpoint::GetSecondsProcessingMessages() const
{
return m_InputProcessTime;
}

void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime)
{
m_InputReadTime += readTime;
m_InputSemaphoreTime += semaphoreTime;
m_InputProcessTime += processTime;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark

I don't think so, but I'm open for naming suggestions.

(I mean that could also be single-threaded without any need for atomic types).

Actually, Benchmark instances could be queried during update, so yes, atomics are needed.

And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:

double Endpoint::GetSecondsReadingMessages() const
{
return m_InputReadTime;
}
double Endpoint::GetSecondsAwaitingSemaphore() const
{
return m_InputSemaphoreTime;
}
double Endpoint::GetSecondsProcessingMessages() const
{
return m_InputProcessTime;
}

void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime)
{
m_InputReadTime += readTime;
m_InputSemaphoreTime += semaphoreTime;
m_InputProcessTime += processTime;
}

This would duplicate triplicate the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean that could also be single-threaded without any need for atomic types).

Actually, Benchmark instances could be queried during update, so yes, atomics are needed.

I was referring to naming it benchmark, not how it's used in this PR. I mean that class name would also fit for a helper for actual benchmarking, that might not need any thread-safety.

That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark

I don't think so, but I'm open for naming suggestions.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does? It's only member is storing a duration, using atomics so that one thread can add, another can read.

This would duplicate triplicate the code.

What would be duplicated? These two lines:

m_Sum.fetch_add(elapsedTime.count(), std::memory_order_relaxed);

return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count();

The first one is quite trivial, so I wouldn't even bother. The second one is a bit more annoying, I'd say it's right at the edge: it's still simple enough that I'd say it would be okay to have it there three times (avoiding the naming question 😅), but also complex enough that I'd say it's not excessive having it in a class (in that case, the first one should of course remain in that class).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also complex enough that I'd say it's not excessive having it in a class

Then I'd just keep the class as such.

I don't think so, but I'm open for naming suggestions.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does?

Why would one accumulate durations? What for? For benchmarking!

If you disagree with the class name, please suggest one I SHALL take literally.

Or to say it with #10313 (comment): I have no time to discuss a simple word.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does?

Why would one accumulate durations? What for? For benchmarking!

No, not really. This PR uses it for metrics. Would Metric be a good class name? No, for similar reasons like Benchmark.

Also, for a benchmark, you don't want to accumulate durations like this class does. Especially if you want to benchmark short-running code, you want to run it many times and measure the total time to reduce how much of an effect the measurement itself has on the result.

If you disagree with the class name, please suggest one I SHALL take literally.

I already made a suggestion with AtomicDuration?

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 67ddc24 to 531991f Compare March 10, 2025 10:20
Comment on lines 77 to 88
if (m_Endpoint) {
readStarted = Benchmark::Clock::now();
}

try {
jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);
} catch (const std::exception& ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does seconds_reading_messages actually provides some useful information? This value mixes both idle time waiting for the network and busy time spent on TLS operations. Both are fundamentally different, so what can one deduce from the sum? If the value is high, it's fine if it's due to the network read waiting for the peer, it wouldn't be fine, if it was actually due to the TLS load, but you can't tell the difference. If it's low, you only know it's spending the time elsewhere, but you need another metric like seconds_processing_messages for that anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you know that if it's high, it's very likely due to net lag. And at least we separate that stuff from the other two metrics. Due to the latter it makes sense to have this third metric, so that all metrics add up.

I mean, if that is our biggest problem here, I've done pretty good work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between quickly adding some metrics to debug a specific problem and considering it for merging it into an actual release.

But you know that if it's high, it's very likely due to net lag.

How do you actually tell that? A connection with only a few messages will look just the same (think a random agent connection for example).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look.

Before we can handle a message, we have to read it, right? And if reading by itself is slow, everything is slow. But w/o this third metric you never know that it's net lag (or TLS).

But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if reading by itself is slow, everything is slow.

No, this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

But w/o this third metric you never know that it's net lag (or TLS).

By net lag you mean the underlying network only provides a slower connection than what would be necessary to transfer the amount of data written into it? The current metric wouldn't tell you that, but you could figure that out by looking at the sizes of the send and receive queues of the TCP sockets on both ends of the connection.

But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.

I don't know either. But I don't really see how one would interpret the information given by that metric in it's current state to get something useful out of it. And if we can't tell what it's useful for, omitting it would be an option for me as well.

@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 531991f to eb9c69c Compare March 10, 2025 16:13
@Al2Klimov Al2Klimov removed the request for review from yhabteab March 11, 2025 09:40
@Al2Klimov Al2Klimov self-assigned this Mar 11, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from eb9c69c to 167e8c0 Compare March 11, 2025 10:39
@Al2Klimov Al2Klimov removed their assignment Mar 11, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 167e8c0 to 70cee1c Compare March 11, 2025 15:45
@Al2Klimov Al2Klimov self-assigned this Mar 17, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 70cee1c to 124c2e8 Compare March 17, 2025 09:26
Comment on lines 77 to 87
// Only once we receive at least one byte, we know there must be a message to read.
if (!m_Stream->in_avail()) {
m_Stream->async_fill(yc);
}

// Only then we can start measuring the time it takes to read it.
if (m_Endpoint) {
readStarted = AtomicDuration::Clock::now();
}

jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

🎉

I've filtered out those 5 minutes!

@Al2Klimov Al2Klimov removed their assignment Mar 17, 2025
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 124c2e8 to e6814c8 Compare March 17, 2025 10:27
@Al2Klimov Al2Klimov requested a review from julianbrost March 20, 2025 16:01
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from e6814c8 to f461342 Compare March 20, 2025 16:21
@Al2Klimov Al2Klimov force-pushed the json-rpc-read-process-metrics branch from f461342 to c67f8dd Compare March 21, 2025 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients) cla/signed enhancement New feature or request ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants