Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Al2Klimov · 2024-12-10T12:42:51Z

ref/NC/820479

Tests

lib/base/benchmark.hpp

Al2Klimov · 2024-12-10T15:35:44Z

lib/remote/endpoint.ti

+
+	[no_user_modify, no_storage] double seconds_processing_messages {
+		get;
+	};


It works! 👍

< HTTP/1.1 200 OK < Server: Icinga/v2.14.0-375-gecea52568 < Content-Type: application/json < Content-Length: 591 < { "results": [ { "attrs": { "seconds_processing_messages": 3.1292e-05, "seconds_reading_messages": 0.000155709 }, "joins": {}, "meta": {}, "name": "dummy", "type": "Endpoint" }, { "attrs": { "seconds_processing_messages": 0, "seconds_reading_messages": 0 }, "joins": {}, "meta": {}, "name": "ws-aklimov7777777.local", "type": "Endpoint" } ] }

Al2Klimov

Also works! 👍

Al2Klimov · 2024-12-10T16:01:49Z

lib/methods/clusterzonechecktask.cpp

-			new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond)
+			new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond),
+			new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages),
+			new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages)


{ "counter": false, "crit": null, "label": "sum_seconds_reading_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 0.000787875, "warn": null }, { "counter": false, "crit": null, "label": "sum_seconds_processing_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 3.45e-05, "warn": null }

Al2Klimov · 2024-12-10T16:02:23Z

lib/methods/icingachecktask.cpp

@@ -148,6 +152,8 @@ void IcingaCheckTask::ScriptFunc(const Checkable::Ptr& checkable, const CheckRes
 	perfdata->Add(new PerfdataValue("sum_messages_received_per_second", messagesReceivedPerSecond));
 	perfdata->Add(new PerfdataValue("sum_bytes_sent_per_second", bytesSentPerSecond));
 	perfdata->Add(new PerfdataValue("sum_bytes_received_per_second", bytesReceivedPerSecond));
+	perfdata->Add(new PerfdataValue("sum_seconds_reading_messages", secondsReadingMessages));
+	perfdata->Add(new PerfdataValue("sum_seconds_processing_messages", secondsProcessingMessages));


{ "counter": false, "crit": null, "label": "sum_seconds_reading_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 0.000264792, "warn": null }, { "counter": false, "crit": null, "label": "sum_seconds_processing_messages", "max": null, "min": null, "type": "PerfdataValue", "unit": "", "value": 5.2666e-05, "warn": null }

Al2Klimov · 2024-12-10T16:25:41Z

lib/remote/endpoint.ti

+		get;
+	};
+
+	[no_user_modify, no_storage] double seconds_processing_messages {


Colleagues, addition we could also (just for the API) record seconds_processing_messages PER message. I.e. there should be also another attribute returning a dict like {"event::CheckResult":42.0,... saying that this endpoint e.g. spent 42 seconds handling (already read and decoded) event::CheckResult messages since program start.

Yes, there's also Log HTTP/RPC message processing stats #10141, but it's debug log unless a singe message takes excessively long. So, unless combined with Logger#object_filter: restrict Logger to messages referring to specific objects #9844, it's pretty useless in a large env.

Log HTTP/RPC message processing stats #10141 (comment) legitimately mentions that we already have logs which warn if the semaphore wait delays message processing noticeably. I.e you have more busy connections than semaphore slots.

But if you have enough slots and your messages themselves take long to process (Give PerfdataWriter a 👷‍♂️WorkQueue like e.g GraphiteWriter🖋️ #10267?), the suggested aggregation tells you.

julianbrost

Please also rebase, the Actions are pretty outdated.

lib/base/benchmark.cpp

julianbrost · 2025-03-07T14:02:41Z

lib/base/benchmark.hpp

+	/**
+	 * @return The total accumulated time in seconds
+	 */
+	template<class T>
+	operator T() const noexcept
+	{
+		return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count();
+	}


The total accumulated time in seconds

Isn't this templated because T can be used to specify the unit, i.e. it could be something other than seconds?

No, T is just an arithmetic type, or a class emulating an arithmetic type, representing the number of ticks.

julianbrost · 2025-03-07T14:17:45Z

lib/base/benchmark.hpp

+/**
+ * Benchmark result.
+ *
+ * @ingroup base
+ */
+class Benchmark


That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark (I mean that could also be single-threaded without any need for atomic types).

And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:

icinga2/lib/remote/endpoint.cpp

Lines 140 to 153 in 055c6b6

double Endpoint::GetSecondsReadingMessages() const

{

return m_InputReadTime;

}

double Endpoint::GetSecondsAwaitingSemaphore() const

{

return m_InputSemaphoreTime;

}

double Endpoint::GetSecondsProcessingMessages() const

{

return m_InputProcessTime;

}

icinga2/lib/remote/endpoint.hpp

Lines 48 to 53 in 055c6b6

void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime)

{

m_InputReadTime += readTime;

m_InputSemaphoreTime += semaphoreTime;

m_InputProcessTime += processTime;

}

That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark

I don't think so, but I'm open for naming suggestions.

(I mean that could also be single-threaded without any need for atomic types).

Actually, Benchmark instances could be queried during update, so yes, atomics are needed.

And this class really doesn't do much, I'd just consider doing that directly where it's currently used, i.e. there:

icinga2/lib/remote/endpoint.cpp

Lines 140 to 153 in 055c6b6

double Endpoint::GetSecondsReadingMessages() const

{

return m_InputReadTime;

}

double Endpoint::GetSecondsAwaitingSemaphore() const

{

return m_InputSemaphoreTime;

}

double Endpoint::GetSecondsProcessingMessages() const

{

return m_InputProcessTime;

}

icinga2/lib/remote/endpoint.hpp

Lines 48 to 53 in 055c6b6

void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime)

{

m_InputReadTime += readTime;

m_InputSemaphoreTime += semaphoreTime;

m_InputProcessTime += processTime;

}

This would ~~duplicate~~ triplicate the code.

(I mean that could also be single-threaded without any need for atomic types).

Actually, Benchmark instances could be queried during update, so yes, atomics are needed.

I was referring to naming it benchmark, not how it's used in this PR. I mean that class name would also fit for a helper for actual benchmarking, that might not need any thread-safety.

That's quite a generic name for a pretty specific type, in the sense that this is more or less an atomic duration, not really a benchmark

I don't think so, but I'm open for naming suggestions.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does? It's only member is storing a duration, using atomics so that one thread can add, another can read.

This would ~~duplicate~~ triplicate the code.

What would be duplicated? These two lines:

icinga2/lib/base/benchmark.cpp

Line 18 in 531991f

m_Sum.fetch_add(elapsedTime.count(), std::memory_order_relaxed);

icinga2/lib/base/benchmark.hpp

Line 29 in 531991f

return std::chrono::duration<T>(Clock::duration(m_Sum.load(std::memory_order_relaxed))).count();

The first one is quite trivial, so I wouldn't even bother. The second one is a bit more annoying, I'd say it's right at the edge: it's still simple enough that I'd say it would be okay to have it there three times (avoiding the naming question 😅), but also complex enough that I'd say it's not excessive having it in a class (in that case, the first one should of course remain in that class).

but also complex enough that I'd say it's not excessive having it in a class

Then I'd just keep the class as such.

I don't think so, but I'm open for naming suggestions.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does?

Why would one accumulate durations? What for? For benchmarking!

If you disagree with the class name, please suggest one I SHALL take literally.

Or to say it with #10313 (comment): I have no time to discuss a simple word.

Wouldn't taking that suggestion literally (i.e. AtomicDuration) describe what this type does?

Why would one accumulate durations? What for? For benchmarking!

No, not really. This PR uses it for metrics. Would Metric be a good class name? No, for similar reasons like Benchmark.

Also, for a benchmark, you don't want to accumulate durations like this class does. Especially if you want to benchmark short-running code, you want to run it many times and measure the total time to reduce how much of an effect the measurement itself has on the result.

If you disagree with the class name, please suggest one I SHALL take literally.

I already made a suggestion with AtomicDuration?

julianbrost · 2025-03-10T15:39:36Z

lib/remote/jsonrpcconnection.cpp

+		if (m_Endpoint) {
+			readStarted = Benchmark::Clock::now();
+		}
+
 		try {
 			jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);
 		} catch (const std::exception& ex) {


Does seconds_reading_messages actually provides some useful information? This value mixes both idle time waiting for the network and busy time spent on TLS operations. Both are fundamentally different, so what can one deduce from the sum? If the value is high, it's fine if it's due to the network read waiting for the peer, it wouldn't be fine, if it was actually due to the TLS load, but you can't tell the difference. If it's low, you only know it's spending the time elsewhere, but you need another metric like seconds_processing_messages for that anyway.

But you know that if it's high, it's very likely due to net lag. And at least we separate that stuff from the other two metrics. Due to the latter it makes sense to have this third metric, so that all metrics add up.

I mean, if that is our biggest problem here, I've done pretty good work.

There's a difference between quickly adding some metrics to debug a specific problem and considering it for merging it into an actual release.

But you know that if it's high, it's very likely due to net lag.

How do you actually tell that? A connection with only a few messages will look just the same (think a random agent connection for example).

Look.

Before we can handle a message, we have to read it, right? And if reading by itself is slow, everything is slow. But w/o this third metric you never know that it's net lag (or TLS).

But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.

And if reading by itself is slow, everything is slow.

No, this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

But w/o this third metric you never know that it's net lag (or TLS).

By net lag you mean the underlying network only provides a slower connection than what would be necessary to transfer the amount of data written into it? The current metric wouldn't tell you that, but you could figure that out by looking at the sizes of the send and receive queues of the TCP sockets on both ends of the connection.

But if this is so important to you, I'm open to suggestions. Idk how to separate net lag and TLS.

I don't know either. But I don't really see how one would interpret the information given by that metric in it's current state to get something useful out of it. And if we can't tell what it's useful for, omitting it would be an option for me as well.

test/base-benchmark.cpp

Al2Klimov · 2025-03-17T09:27:50Z

lib/remote/jsonrpcconnection.cpp

+			// Only once we receive at least one byte, we know there must be a message to read.
+			if (!m_Stream->in_avail()) {
+				m_Stream->async_fill(yc);
+			}
+
+			// Only then we can start measuring the time it takes to read it.
+			if (m_Endpoint) {
+				readStarted = AtomicDuration::Clock::now();
+			}
+
 			jsonString = JsonRpc::ReadMessage(m_Stream, yc, m_Endpoint ? -1 : 1024 * 1024);


this metric would also report an agent connection as slow where there's a few check messages every 5 minutes and a few heartbeat messages.

🎉

I've filtered out those 5 minutes!

…ssing_messages} for the API

Al2Klimov self-assigned this Dec 10, 2024

cla-bot bot added the cla/signed label Dec 10, 2024

Al2Klimov changed the title Introduce Benchmark* Introduce Benchmark Dec 10, 2024

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 5d1b07f to 92a4f45 Compare December 10, 2024 14:50

Al2Klimov changed the title ~~Introduce Benchmark~~ Benchmark message reading/processing time per endpoint Dec 10, 2024

Al2Klimov commented Dec 10, 2024

View reviewed changes

lib/base/benchmark.hpp Outdated Show resolved Hide resolved

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from ef2a7d3 to 3ebbfc6 Compare December 10, 2024 15:34

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 3ebbfc6 to f845407 Compare December 10, 2024 16:01

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov removed their assignment Dec 10, 2024

Al2Klimov marked this pull request as ready for review December 10, 2024 16:02

Al2Klimov added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) area/api REST API area/checks Check execution and results labels Dec 10, 2024

Al2Klimov commented Dec 10, 2024

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 20460e6 to 055c6b6 Compare December 11, 2024 09:41

Al2Klimov changed the title ~~Benchmark message reading/processing time per endpoint~~ Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} Dec 11, 2024

Al2Klimov added the ref/NC label Dec 11, 2024

This was referenced Dec 11, 2024

JSON-RPC outgoing messages queue metric #10269

Open

Log HTTP/RPC message processing stats #10141

Merged

Al2Klimov requested a review from yhabteab December 16, 2024 11:33

yhabteab added this to the 2.15.0 milestone Jan 13, 2025

Al2Klimov removed this from the 2.15.0 milestone Jan 21, 2025

julianbrost requested changes Mar 7, 2025

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch 2 times, most recently from 67ddc24 to 531991f Compare March 10, 2025 10:20

julianbrost reviewed Mar 10, 2025

View reviewed changes

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 531991f to eb9c69c Compare March 10, 2025 16:13

Al2Klimov removed the request for review from yhabteab March 11, 2025 09:40

Al2Klimov self-assigned this Mar 11, 2025

Introduce AtomicDuration

055149b

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from eb9c69c to 167e8c0 Compare March 11, 2025 10:39

Al2Klimov removed their assignment Mar 11, 2025

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 167e8c0 to 70cee1c Compare March 11, 2025 15:45

Al2Klimov self-assigned this Mar 17, 2025

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 70cee1c to 124c2e8 Compare March 17, 2025 09:26

Al2Klimov commented Mar 17, 2025

View reviewed changes

Al2Klimov removed their assignment Mar 17, 2025

Benchmark message reading/waiting/processing time per endpoint

b31919c

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from 124c2e8 to e6814c8 Compare March 17, 2025 10:27

Al2Klimov requested a review from julianbrost March 20, 2025 16:01

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from e6814c8 to f461342 Compare March 20, 2025 16:21

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,proce…

c67f8dd

…ssing_messages} for the API

Al2Klimov force-pushed the json-rpc-read-process-metrics branch from f461342 to c67f8dd Compare March 21, 2025 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Al2Klimov commented Dec 10, 2024 •

edited

Loading

Al2Klimov Dec 10, 2024

Al2Klimov left a comment

Al2Klimov Dec 10, 2024

Al2Klimov Dec 10, 2024

Al2Klimov Dec 10, 2024

Al2Klimov Dec 10, 2024

Al2Klimov Dec 12, 2024 •

edited

Loading

julianbrost left a comment

julianbrost Mar 7, 2025

Al2Klimov Mar 10, 2025

julianbrost Mar 7, 2025

Al2Klimov Mar 10, 2025

julianbrost Mar 10, 2025

Al2Klimov Mar 10, 2025

julianbrost Mar 10, 2025

julianbrost Mar 10, 2025

Al2Klimov Mar 10, 2025

julianbrost Mar 10, 2025

Al2Klimov Mar 11, 2025

julianbrost Mar 11, 2025

Al2Klimov Mar 17, 2025

	double Endpoint::GetSecondsReadingMessages() const
	{
	return m_InputReadTime;
	}

	double Endpoint::GetSecondsAwaitingSemaphore() const
	{
	return m_InputSemaphoreTime;
	}

	double Endpoint::GetSecondsProcessingMessages() const
	{
	return m_InputProcessTime;
	}

	void AddInputTimes(const R& readTime, const S& semaphoreTime, const P& processTime)
	{
	m_InputReadTime += readTime;
	m_InputSemaphoreTime += semaphoreTime;
	m_InputProcessTime += processTime;
	}

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Are you sure you want to change the base?

Introduce Endpoint#seconds_{reading_messages,awaiting_semaphore,processing_messages} #10266

Conversation

Al2Klimov commented Dec 10, 2024 • edited Loading

Tests

Choose a reason for hiding this comment

Al2Klimov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al2Klimov Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

julianbrost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

🎉

Al2Klimov commented Dec 10, 2024 •

edited

Loading

Al2Klimov Dec 12, 2024 •

edited

Loading