Add condition to flush based on time #4

fvallenilla · 2024-08-22T19:23:16Z

Currently the Atlas high performance mode is enabled for all apps. This mode enables batching of metrics to reduce the number of events sent to spectatord.

This high performance mode has a negative impact on low RPS apps, because they can end up emitting IPC metrics every 2-4 minutes rather than every minute. That's a needless regression relative to SBN's behavior.

This commit updates the buffering logic to ensure we emit metrics if some configurable amount of time has passed (flush_interval).

High RPS apps should not be impacted by this change since the idea is to set this interval at ~60 seconds, and high RPS apps are already flushing more often than that.

How was this tested?

Ran unit tests with docker build --memory=4g -t spectator-cpp:latest .; docker run -it spectator-cpp:latest
Imported into proxyd with 55s flush interval, ran it locally, and saw that making IPC requests every 75s led to the associated metrics being flushed each time.
Ran a canary against meshtestsbn and did not see any degradation: https://chap.prod.netflix.net/runs/22cc3c20-63e7-11ef-b5f0-fd267578bb8a

Currently the Atlas high performance mode is enabled for all apps. This mode enables batching of metrics to reduce the number of events sent to spectatord. This high performance mode has a negative impact on really low RPS apps, because they can end up emitting IPC metrics every 2-4 minutes rather than every minute. This commit updates the buffering logic to ensure we emit metrics if some configurable amount of time has passed (flush_interval).

fvallenilla · 2024-08-27T00:37:31Z

spectator/publisher_test.cc

+
+  // Wait for 3 seconds, increment, and the counter should not be flushed (3s is less than the 5s flush interval)
+  std::this_thread::sleep_for(std::chrono::seconds(3));
+  c.Increment();


Incrementing after the sleeps is needed since the increment itself is what triggers the logic in setup_unix_domain to determine whether a flush is needed

KBaichoo

Nice work!

KBaichoo · 2024-08-27T21:11:48Z

spectator/config.h

@@ -9,6 +9,7 @@ struct Config {
  std::string endpoint;
  std::unordered_map<std::string, std::string> common_tags;
  uint32_t bytes_to_buffer;
+  std::chrono::seconds flush_interval;


KBaichoo · 2024-08-27T21:12:19Z

spectator/publisher.h

@@ -34,6 +35,8 @@ class SpectatordPublisher {
  asio::local::datagram_protocol::socket local_socket_;
  std::string buffer_;
  uint32_t bytes_to_buffer_;
+  std::chrono::steady_clock::time_point last_flush_time_;
+  std::chrono::seconds flush_interval_;


KBaichoo · 2024-08-27T21:13:19Z

spectator/publisher.h

@@ -34,6 +35,8 @@ class SpectatordPublisher {
  asio::local::datagram_protocol::socket local_socket_;
  std::string buffer_;
  uint32_t bytes_to_buffer_;
+  std::chrono::steady_clock::time_point last_flush_time_;
+  std::chrono::seconds flush_interval_;


I think we should have the flush interval be mills just for additional granularity.

KBaichoo · 2024-08-27T21:18:19Z

spectator/publisher.cc

@@ -55,11 +59,16 @@ void SpectatordPublisher::setup_unix_domain(absl::string_view path) {
  std::string local_path{path};
  sender_ = [local_path, this](std::string_view msg) {
    buffer_.append(msg);
-    if (buffer_.length() >= bytes_to_buffer_) {
+    auto now = std::chrono::steady_clock::now();


KBaichoo · 2024-08-27T21:18:50Z

spectator/publisher.cc

@@ -55,11 +59,16 @@ void SpectatordPublisher::setup_unix_domain(absl::string_view path) {
  std::string local_path{path};
  sender_ = [local_path, this](std::string_view msg) {
    buffer_.append(msg);
-    if (buffer_.length() >= bytes_to_buffer_) {
+    auto now = std::chrono::steady_clock::now();
+    bool should_flush = buffer_.length() >= bytes_to_buffer_ ||


KBaichoo · 2024-08-27T21:29:32Z

spectator/publisher.cc

-    if (buffer_.length() >= bytes_to_buffer_) {
+    auto now = std::chrono::steady_clock::now();
+    bool should_flush = buffer_.length() >= bytes_to_buffer_ ||
+                        std::chrono::duration_cast<std::chrono::seconds>(now - last_flush_time_) >= flush_interval_;


No need to do duration cast

KBaichoo · 2024-08-27T21:35:19Z

spectator/publisher_test.cc

+  logger->info("Unix Server started on path {}", path);
+
+  // Set buffer size to a large value so that flushing is based on time
+  SpectatordPublisher publisher{fmt::format("unix:{}", path), 10000, std::chrono::seconds(5)};


Not a blocker here, but ideally we'd have some sort of test time system instead of a real clock.

cancecen · 2024-08-29T20:56:49Z

The change overall looks good to me - however I still quite don't get why we need it based on my previous tests.

Here's a quick math too: Envoy flushes every 5 seconds and we send the metrics here no matter what the size, every 60 seconds. That means in the worst case we'd have 12 batches of whole Envoy metrics. The current batch limit is 60KB, so for this to help in every flush interval we need to be sending less than 5KB of metric lines. Last time I checked we were above this already.

fvallenilla · 2024-08-29T23:45:52Z

@cancecen This is related to the special handling that happens for IPC metrics. They're not flushed with everything else:
https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L130-L134

Note the difference between IPC and HTTP metrics:
https://go.prod.netflix.net/SMRnQm

fvallenilla · 2024-08-29T23:56:30Z

Is there something else that could cause that sort of un-evenness?

cancecen · 2024-08-30T00:43:57Z

@cancecen This is related to the special handling that happens for IPC metrics. They're not flushed with everything else: https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L130-L134

Note the difference between IPC and HTTP metrics: https://go.prod.netflix.net/SMRnQm

https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L169

yeah, but they get sent out to the buffer immediately.

cancecen · 2024-08-30T00:46:43Z

Is there something else that could cause that sort of un-evenness?

I don't think so. Did you observe evenness when you tested this with your change? I can't remember if you showed the fixed graph in your demo, sorry.

fvallenilla · 2024-08-30T18:35:14Z

For posterity, here is what things look like after the change.

Metrics don't go to 0 for minutes at a time anymore, which was the main issue to address.
The shape of IPC metrics more closely matches HTTP metrics, but they are still slightly "spiky-er".

fvallenilla force-pushed the freddy/timed-flush branch from 9aad5c1 to c649f23 Compare August 26, 2024 22:24

fvallenilla marked this pull request as ready for review August 27, 2024 00:33

fvallenilla commented Aug 27, 2024

View reviewed changes

KBaichoo approved these changes Aug 27, 2024

View reviewed changes

PR feedback

7b6255b

KBaichoo approved these changes Aug 28, 2024

View reviewed changes

cancecen merged commit 170401a into cancecen:main Aug 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add condition to flush based on time #4

Add condition to flush based on time #4

fvallenilla commented Aug 22, 2024 •

edited

Loading

fvallenilla Aug 27, 2024

KBaichoo left a comment

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

KBaichoo Aug 27, 2024

cancecen commented Aug 29, 2024

fvallenilla commented Aug 29, 2024

fvallenilla commented Aug 29, 2024

cancecen commented Aug 30, 2024 •

edited

Loading

cancecen commented Aug 30, 2024

fvallenilla commented Aug 30, 2024

Add condition to flush based on time #4

Add condition to flush based on time #4

Conversation

fvallenilla commented Aug 22, 2024 • edited Loading

How was this tested?

Choose a reason for hiding this comment

KBaichoo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cancecen commented Aug 29, 2024

fvallenilla commented Aug 29, 2024

fvallenilla commented Aug 29, 2024

cancecen commented Aug 30, 2024 • edited Loading

cancecen commented Aug 30, 2024

fvallenilla commented Aug 30, 2024

fvallenilla commented Aug 22, 2024 •

edited

Loading

cancecen commented Aug 30, 2024 •

edited

Loading