net: Add bytes sent/received metrics #2629

StephanDollberg · 2025-01-23T11:09:39Z

Adds two basic metrics that track bytes sent and received from the
networking stack.

This can give a better per application insight than what host wide
metrics (for example via node exporter) can do.

Further it can serve as a comparison point with host based metrics for
possible write amplification (though this is less likely on the network
side).

The patch follows a similar pattern as to how memory and disk based
metrics work.

Note one might be inclined to introduce some grouping into the metrics
in relation to either the interface or source IP similar to how we have
mountpoints and/or groups on the disk side.

While this sounds useful at first in practice it would be less useful.
Often for the major cloud providers and similar in self hosted
environments there is only a single interface/source-IP and routing
happens at a later point (switches, routers etc.). Further, adding this
separation would make the implementation more expensive in either
compute or memory space.

A possible improvement would be add tagging to the API such that users
could be explicit about attribution. However this is left as a future
improvement and this patch keeps it simple for now.

avikivity · 2025-01-23T19:53:12Z

src/net/native-stack.cc

 }
+void increment_bytes_received(uint64_t bytes) {
+    net::bytes_received += bytes;
+}


This must be in the same translation unit as its used in, statistics must compile to a single ADD instruction.

avikivity · 2025-01-23T19:54:22Z

src/net/native-stack-impl.hh

@@ -193,6 +201,7 @@ public:
        : _conn(std::move(conn)) {}
    using data_sink_impl::put;
    virtual future<> put(packet p) override {
+        internal::increment_bytes_sent(p.len());


Better place is the network interface class.

You mean to measure the bytes or to store the counter? I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.

You mean to measure the bytes or to store the counter?

Both.

I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.

It's okay to make the counters namespace-scope.

avikivity · 2025-01-23T19:58:28Z

For write amplification, you can compare TCP bytes sent vs interface bytes sent (not sure such a counter exists), or just count packets and multiple by typical header size.

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

StephanDollberg · 2025-01-27T09:08:15Z

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

This sounds interesting yes but I think it goes a bit against the "single ADD instruction" as the streams are not directly linked to any scheduling group so we'd always have to get the current group (maybe via the scheduling_group_specific stuff)?

travisdowns · 2025-01-27T19:35:42Z

Just last night I was having a very sweet dream 😴 that seastar had network in/out metrics, which were in my top 5 metrics along with aio write/write-bytes, reactor utilization and tasks processed, helping me crush performance problems almost as soon as they materialize.

I no longer had to beg end users to separately ship me host-level metrics, I never had to argue with them that something else on the host was slamming the network interface, I never had to guess what shard was doing all the network IO, I just had net metrics there along with all my other seasar metrics, the same labels, aggregation rules, ?foo=bar query capabilities on my metrics endpoint, I never had to struggle with weird k8s setups with interface names apparently generated like strong passwords, not with per-cgroup statistics which aren't rolled up into the host-level metrics [1].

Then I woke up 😞 .

[1] Admittedly, this doesn't happen for bytes as far as I'm aware, but does occur for other metrics like "connections": node_exporter doesn't (and won't) include all connections in their metrics, since the /proc files they query only includes connections for the root cgroup, and not for the cgroups where the action is happening, so your "host level" metrics report very few connections, even if you have 1 million hidden in a pod. So this would be a great case for a seastar-level connection metric.

StephanDollberg · 2025-01-28T10:26:09Z

but I think it goes a bit against the "single ADD instruction"

Looking at it it seems to take about 2-3 instructions.

avikivity · 2025-01-28T11:56:03Z

but I think it goes a bit against the "single ADD instruction"

Looking at it it seems to take about 2-3 instructions.

That's reasonable. The idea is that those instructions don't block the rest of the instruction flow since nothing depends on them.

avikivity · 2025-01-28T11:58:53Z

Just last night I was having a very sweet dream 😴 that seastar had network in/out metrics, which were in my top 5 metrics along with aio write/write-bytes, reactor utilization and tasks processed, helping me crush performance problems almost as soon as they materialize.

I no longer had to beg end users to separately ship me host-level metrics, I never had to argue with them that something else on the host was slamming the network interface, I never had to guess what shard was doing all the network IO, I just had net metrics there along with all my other seasar metrics, the same labels, aggregation rules, ?foo=bar query capabilities on my metrics endpoint, I never had to struggle with weird k8s setups with interface names apparently generated like strong passwords, not with per-cgroup statistics which aren't rolled up into the host-level metrics [1].

Then I woke up 😞 .

It is good to have dreams to inspire you and as an escape from dreadful reality.

[1] Admittedly, this doesn't happen for bytes as far as I'm aware, but does occur for other metrics like "connections": node_exporter doesn't (and won't) include all connections in their metrics, since the /proc files they query only includes connections for the root cgroup, and not for the cgroups where the action is happening, so your "host level" metrics report very few connections, even if you have 1 million hidden in a pod. So this would be a great case for a seastar-level connection metric.

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

avikivity · 2025-01-28T12:03:01Z

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

btw, this would be useful for network scheduling. Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network. So soon we'll have to apply scheduling group shares to networking.

I'm not sure we'll do it at user level - perhaps we can offload it to Linux traffic control. But having the metrics is useful.

travisdowns · 2025-01-28T13:03:12Z

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

Indeed, we have application level connection metrics as well. Still, the host metrics have proven useful there as a second "totally inclusive check" and because there are some states where the connection is still known by the kernel but not by the application or vice-versa (e.g., TCP keepalive as detected that the connection has dropped but the application hasn't noticed because it hasn't sent anything). Annoyingly, connection related metrics, including stuff we can't do at the application layer like some connection memory use, isn't available in host metrics because it's per-cgroup and not rolled up into parent cgroups unlike most other metrics.

travisdowns · 2025-01-28T13:06:08Z

Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network.

Yeah, I guess this is the spiritual successor to i3en, but they've dropped the "n", so bandwidth is halved, and for a mix RW load I think disk throughput is up even more than advertised because these 2nd and 3rd generation nitro SSDs look like they are almost totally full-duplex compared to the old ones which weren't.

OTOH even with older instance types we could be network limited (usually network egress) in cases where workloads hit in cache so bytes out over the network didn't necessarily come from disk at all.

avikivity · 2025-01-28T13:36:12Z

Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network.

Yeah, I guess this is the spiritual successor to i3en, but they've dropped the "n",

Ah, I thought they had to drop the "n" because the second "i" ate its budget.

so bandwidth is halved, and for a mix RW load I think disk throughput is up even more than advertised because these 2nd and 3rd generation nitro SSDs look like they are almost totally full-duplex compared to the old ones which weren't.

OTOH even with older instance types we could be network limited (usually network egress) in cases where workloads hit in cache so bytes out over the network didn't necessarily come from disk at all.

Sure, you're moving bytes around while we're managing (usually small) cells, so you're more likely to hit the network limit first.

This strengthens the case for network scheduling.

avikivity · 2025-01-28T13:37:33Z

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

Indeed, we have application level connection metrics as well. Still, the host metrics have proven useful there as a second "totally inclusive check" and because there are some states where the connection is still known by the kernel but not by the application or vice-versa (e.g., TCP keepalive as detected that the connection has dropped but the application hasn't noticed because it hasn't sent anything). Annoyingly, connection related metrics, including stuff we can't do at the application layer like some connection memory use, isn't available in host metrics because it's per-cgroup and not rolled up into parent cgroups unlike most other metrics.

Makes sense.

Did you report the Linux accounting problem upstream?

StephanDollberg · 2025-01-29T09:20:22Z

I have pushed an update now that tracks the metrics per scheduling group.

travisdowns · 2025-01-29T13:58:38Z

Did you report the Linux accounting problem upstream?

I feel like it is "by design" for the way the networking namespaces works in cgroups so very unlikely to be fixed (I guess networking namespaces are not even really in a hierarchy). I did go to report it for node_exporter (which I think needs to do the hard work of aggregating these values for all namespace), but it was already reported and closed as won't fix. However in that latter case there were sort of two issues mixed together (whether node exporter should report "per pod/cgroup" values (definitely no), or whether NE should aggregate across all cgroups for values that are'nt rolled up to the top level.

So maybe it's worth another shot on the NE side.

avikivity · 2025-01-29T14:15:32Z

include/seastar/net/api.hh

+    uint64_t bytes_received() const {
+        return _bytes_received;
+    };
+};


Minor, but such pure data classes are better as structs with exposed data members and inline initializers for all data members. Less noisy.

We did investigate the struct option on our end, but it added 1 instruction to the hot path and as we knew Avi-instruction-counter would be reviewing this we went with the separateed approach.

The extra instruction is probably only on x86 because at 16 bytes the struct can't be indexed with indexed addressing which maxes out at * 8, so it needs a separate shl do to the indexing.

How can struct/class possibly matter here? The layout is the same.

Sorry @avikivity I didn't look carefully enough, I though you were referring to the separate read/write TLS variables, which are not bundled into a struct, not this one which is used to expose the metrics.

avikivity · 2025-01-29T14:17:52Z

src/net/native-stack-impl.hh

+#else
+    static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_sent = {};
+    static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_received = {};
+#endif


Doesn't the inline option work for both modes?

Doesn't the inline option work for both modes?

Well this is a questin I also had, but "no" according to this (where inline was removed from thread_locals in shared lib builds):

#1494

However there it was just presented as fact that this doesn't work, but is that a bug in the compilers or an implementation limitation of thread_local? Even across shared libs non-thread_local globals will correctly resolve down to a single symbol across the proces at runtime, via weak symbol magic (same thing that makes static members inside templates work), but I guess that didn't extend to thread_local even though I think the C++ standard requires it.

If static inline doesn't work, why not use static non-inline instead of splitting the implementation?

Same reason as #1494 does it (presumably): the static inline is more efficient since it allows direct access to the thread local from other TUs. If you use non-inline other TU's will need to indirect though an expensive general purpose get_tls_blah method.

FTR we already went though the same confusion internally and concluded that the split approach in #1494 is the right thing per above: I can dig up the godbolt links from our internal discussion if you want.

So confirm you think for now we should do:

inline uint64_t& bytes_sent(unsigned sg_id) { static thread_local std::array<uint64_t, max_scheduling_groups()> bytes_sent; return bytes_sent[sg_id]; }

?

Just to clarify, I think we agreed above that static-thread-local-in-a-function isn't any better than a global in terms of (a) removing ifdefs or (b) performance. It behaves identically to an "inline global" in those respects.

So you want that for scoping reasons?

I just want the #ifdef gone and not to compromise on static library performance.

I think that the problem with global inline only happens when it is visible to both the shared library and the main executable, which isn't the case. So I think we can go back to global inline.

I just want the #ifdef gone and not to compromise on static library performance.

As discussed above, the function-local static provide no benefit in that case. It also suffers from the "duplicate copies in .so" bug as evidenced by #1494 which removed exactly that pattern, if shared across TUs and if there is no need to share across TUs we can just put the global in the .cc file no problem.

So I think we can go back to global inline.

Yes, agree for this case (not in public includes).

I have pushed "global inline" now. Let me know if I have misunderstood what you wanted.

avikivity · 2025-01-29T14:21:16Z

src/core/reactor.cc

@@ -960,6 +960,9 @@ reactor::task_queue::register_stats() {
        }, sm::description("Total amount in milliseconds we were in violation of the task quota"),
           {group_label}),
    });
+
+    register_net_metrics_for_scheduling_group(new_metrics, _id, group_label);
+


This is the only call, so new groups won't have metrics.

Wait that's completely wrong.

Sorry I don't follow. This gets called whenever a new group is added (or renamed). It works effectively the same way as scheduling groups get their metrics right now.

"that's completely wrong" referred to my previous comment

travisdowns · 2025-01-29T15:05:47Z

This strengthens the case for network scheduling.

+1

Adds two basic metrics that track bytes sent and received from the networking stack. This can give a better per application insight than what host wide metrics (for example via node exporter) can do. Further it can serve as a comparison point with host based metrics for possible write amplification (though this is less likely on the network side). The patch follows a similar pattern as to how memory and disk based metrics work. Note one might be inclined to introduce some grouping into the metrics in relation to either the interface or source IP similar to how we have mountpoints and/or groups on the disk side. While this sounds useful at first in practice it would be less useful. Often for the major cloud providers and similar in self hosted environments there is only a single interface/source-IP and routing happens at a later point (switches, routers etc.). Further, adding this separation would make the implementation more expensive in either compute or memory space. We link the metrics to the cpu scheduling group which allows for getting a more detailed picture of where the network usage is coming from. Further it sets up for future network scheduling at the group level.

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 975e5ac to 70ccf9c Compare January 23, 2025 12:24

avikivity reviewed Jan 23, 2025

View reviewed changes

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 70ccf9c to 0ab3604 Compare January 27, 2025 09:05

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 0ab3604 to c6c9ae7 Compare January 28, 2025 09:40

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from c6c9ae7 to eb61401 Compare January 29, 2025 09:17

avikivity reviewed Jan 29, 2025

View reviewed changes

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from eb61401 to 2646a55 Compare January 30, 2025 09:59

StephanDollberg mentioned this pull request Jan 30, 2025

bazel: fix sanitizers in debug mode redpanda-data/redpanda#24984

Merged

7 tasks

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 2646a55 to ed98fa8 Compare January 31, 2025 18:35

StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from ed98fa8 to 882c0a9 Compare January 31, 2025 18:36

net: Add bytes sent/received metrics #2629

Are you sure you want to change the base?

net: Add bytes sent/received metrics #2629

Conversation

StephanDollberg commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanDollberg Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avikivity commented Jan 23, 2025

StephanDollberg commented Jan 27, 2025

travisdowns commented Jan 27, 2025 • edited Loading

StephanDollberg commented Jan 28, 2025

avikivity commented Jan 28, 2025

avikivity commented Jan 28, 2025

avikivity commented Jan 28, 2025

travisdowns commented Jan 28, 2025

travisdowns commented Jan 28, 2025

avikivity commented Jan 28, 2025

avikivity commented Jan 28, 2025

StephanDollberg commented Jan 29, 2025

travisdowns commented Jan 29, 2025

Choose a reason for hiding this comment

travisdowns Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

StephanDollberg Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns commented Jan 29, 2025

StephanDollberg Jan 27, 2025 •

edited

Loading

travisdowns commented Jan 27, 2025 •

edited

Loading

travisdowns Jan 29, 2025 •

edited

Loading

travisdowns Jan 29, 2025 •

edited

Loading

StephanDollberg Jan 31, 2025 •

edited

Loading