Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add condition to flush based on time #4

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

fvallenilla
Copy link

@fvallenilla fvallenilla commented Aug 22, 2024

Currently the Atlas high performance mode is enabled for all apps. This mode enables batching of metrics to reduce the number of events sent to spectatord.

This high performance mode has a negative impact on low RPS apps, because they can end up emitting IPC metrics every 2-4 minutes rather than every minute. That's a needless regression relative to SBN's behavior.

This commit updates the buffering logic to ensure we emit metrics if some configurable amount of time has passed (flush_interval).

High RPS apps should not be impacted by this change since the idea is to set this interval at ~60 seconds, and high RPS apps are already flushing more often than that.

How was this tested?

  1. Ran unit tests with docker build --memory=4g -t spectator-cpp:latest .; docker run -it spectator-cpp:latest
  2. Imported into proxyd with 55s flush interval, ran it locally, and saw that making IPC requests every 75s led to the associated metrics being flushed each time.
  3. Ran a canary against meshtestsbn and did not see any degradation: https://chap.prod.netflix.net/runs/22cc3c20-63e7-11ef-b5f0-fd267578bb8a

Currently the Atlas high performance mode is enabled for all apps. This
mode enables batching of metrics to reduce the number of events sent to
spectatord.

This high performance mode has a negative impact on really low RPS apps,
because they can end up emitting IPC metrics every 2-4 minutes rather
than every minute.

This commit updates the buffering logic to ensure we emit metrics if
some configurable amount of time has passed (flush_interval).
@fvallenilla fvallenilla marked this pull request as ready for review August 27, 2024 00:33

// Wait for 3 seconds, increment, and the counter should not be flushed (3s is less than the 5s flush interval)
std::this_thread::sleep_for(std::chrono::seconds(3));
c.Increment();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incrementing after the sleeps is needed since the increment itself is what triggers the logic in setup_unix_domain to determine whether a flush is needed

Copy link

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -9,6 +9,7 @@ struct Config {
std::string endpoint;
std::unordered_map<std::string, std::string> common_tags;
uint32_t bytes_to_buffer;
std::chrono::seconds flush_interval;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@@ -34,6 +35,8 @@ class SpectatordPublisher {
asio::local::datagram_protocol::socket local_socket_;
std::string buffer_;
uint32_t bytes_to_buffer_;
std::chrono::steady_clock::time_point last_flush_time_;
std::chrono::seconds flush_interval_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@@ -34,6 +35,8 @@ class SpectatordPublisher {
asio::local::datagram_protocol::socket local_socket_;
std::string buffer_;
uint32_t bytes_to_buffer_;
std::chrono::steady_clock::time_point last_flush_time_;
std::chrono::seconds flush_interval_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have the flush interval be mills just for additional granularity.

@@ -55,11 +59,16 @@ void SpectatordPublisher::setup_unix_domain(absl::string_view path) {
std::string local_path{path};
sender_ = [local_path, this](std::string_view msg) {
buffer_.append(msg);
if (buffer_.length() >= bytes_to_buffer_) {
auto now = std::chrono::steady_clock::now();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@@ -55,11 +59,16 @@ void SpectatordPublisher::setup_unix_domain(absl::string_view path) {
std::string local_path{path};
sender_ = [local_path, this](std::string_view msg) {
buffer_.append(msg);
if (buffer_.length() >= bytes_to_buffer_) {
auto now = std::chrono::steady_clock::now();
bool should_flush = buffer_.length() >= bytes_to_buffer_ ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

if (buffer_.length() >= bytes_to_buffer_) {
auto now = std::chrono::steady_clock::now();
bool should_flush = buffer_.length() >= bytes_to_buffer_ ||
std::chrono::duration_cast<std::chrono::seconds>(now - last_flush_time_) >= flush_interval_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do duration cast

logger->info("Unix Server started on path {}", path);

// Set buffer size to a large value so that flushing is based on time
SpectatordPublisher publisher{fmt::format("unix:{}", path), 10000, std::chrono::seconds(5)};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker here, but ideally we'd have some sort of test time system instead of a real clock.

@cancecen
Copy link
Owner

The change overall looks good to me - however I still quite don't get why we need it based on my previous tests.

Here's a quick math too: Envoy flushes every 5 seconds and we send the metrics here no matter what the size, every 60 seconds. That means in the worst case we'd have 12 batches of whole Envoy metrics. The current batch limit is 60KB, so for this to help in every flush interval we need to be sending less than 5KB of metric lines. Last time I checked we were above this already.

@fvallenilla
Copy link
Author

@cancecen This is related to the special handling that happens for IPC metrics. They're not flushed with everything else:
https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L130-L134

Note the difference between IPC and HTTP metrics:
https://go.prod.netflix.net/SMRnQm

@fvallenilla
Copy link
Author

Is there something else that could cause that sort of un-evenness?

@cancecen
Copy link
Owner

cancecen commented Aug 30, 2024

@cancecen This is related to the special handling that happens for IPC metrics. They're not flushed with everything else: https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L130-L134

Note the difference between IPC and HTTP metrics: https://go.prod.netflix.net/SMRnQm

https://github.netflix.net/corp/cpie-proxyd/blob/7ee083971176597883d87fc5fb52fce1307880c6/proxyd/netflix/extensions/stat_sinks/atlas/sink.cc#L169

yeah, but they get sent out to the buffer immediately.

@cancecen
Copy link
Owner

Is there something else that could cause that sort of un-evenness?

I don't think so. Did you observe evenness when you tested this with your change? I can't remember if you showed the fixed graph in your demo, sorry.

@cancecen cancecen merged commit 170401a into cancecen:main Aug 30, 2024
1 check passed
@fvallenilla
Copy link
Author

For posterity, here is what things look like after the change.

  • Metrics don't go to 0 for minutes at a time anymore, which was the main issue to address.
  • The shape of IPC metrics more closely matches HTTP metrics, but they are still slightly "spiky-er".
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants