Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Add bytes sent/received metrics #2629

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StephanDollberg
Copy link
Contributor

Adds two basic metrics that track bytes sent and received from the
networking stack.

This can give a better per application insight than what host wide
metrics (for example via node exporter) can do.

Further it can serve as a comparison point with host based metrics for
possible write amplification (though this is less likely on the network
side).

The patch follows a similar pattern as to how memory and disk based
metrics work.

Note one might be inclined to introduce some grouping into the metrics
in relation to either the interface or source IP similar to how we have
mountpoints and/or groups on the disk side.

While this sounds useful at first in practice it would be less useful.
Often for the major cloud providers and similar in self hosted
environments there is only a single interface/source-IP and routing
happens at a later point (switches, routers etc.). Further, adding this
separation would make the implementation more expensive in either
compute or memory space.

A possible improvement would be add tagging to the API such that users
could be explicit about attribution. However this is left as a future
improvement and this patch keeps it simple for now.

@StephanDollberg StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 975e5ac to 70ccf9c Compare January 23, 2025 12:24
}
void increment_bytes_received(uint64_t bytes) {
net::bytes_received += bytes;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be in the same translation unit as its used in, statistics must compile to a single ADD instruction.

@@ -193,6 +201,7 @@ public:
: _conn(std::move(conn)) {}
using data_sink_impl::put;
virtual future<> put(packet p) override {
internal::increment_bytes_sent(p.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better place is the network interface class.

Copy link
Contributor Author

@StephanDollberg StephanDollberg Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to measure the bytes or to store the counter? I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to measure the bytes or to store the counter?

Both.

I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.

It's okay to make the counters namespace-scope.

@avikivity
Copy link
Member

For write amplification, you can compare TCP bytes sent vs interface bytes sent (not sure such a counter exists), or just count packets and multiple by typical header size.

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

@StephanDollberg StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 70ccf9c to 0ab3604 Compare January 27, 2025 09:05
@StephanDollberg
Copy link
Contributor Author

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

This sounds interesting yes but I think it goes a bit against the "single ADD instruction" as the streams are not directly linked to any scheduling group so we'd always have to get the current group (maybe via the scheduling_group_specific stuff)?

@travisdowns
Copy link
Contributor

travisdowns commented Jan 27, 2025

Just last night I was having a very sweet dream 😴 that seastar had network in/out metrics, which were in my top 5 metrics along with aio write/write-bytes, reactor utilization and tasks processed, helping me crush performance problems almost as soon as they materialize.

I no longer had to beg end users to separately ship me host-level metrics, I never had to argue with them that something else on the host was slamming the network interface, I never had to guess what shard was doing all the network IO, I just had net metrics there along with all my other seasar metrics, the same labels, aggregation rules, ?foo=bar query capabilities on my metrics endpoint, I never had to struggle with weird k8s setups with interface names apparently generated like strong passwords, not with per-cgroup statistics which aren't rolled up into the host-level metrics [1].

Then I woke up 😞 .


[1] Admittedly, this doesn't happen for bytes as far as I'm aware, but does occur for other metrics like "connections": node_exporter doesn't (and won't) include all connections in their metrics, since the /proc files they query only includes connections for the root cgroup, and not for the cgroups where the action is happening, so your "host level" metrics report very few connections, even if you have 1 million hidden in a pod. So this would be a great case for a seastar-level connection metric.

@StephanDollberg StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from 0ab3604 to c6c9ae7 Compare January 28, 2025 09:40
@StephanDollberg
Copy link
Contributor Author

but I think it goes a bit against the "single ADD instruction"

Looking at it it seems to take about 2-3 instructions.

@avikivity
Copy link
Member

but I think it goes a bit against the "single ADD instruction"

Looking at it it seems to take about 2-3 instructions.

That's reasonable. The idea is that those instructions don't block the rest of the instruction flow since nothing depends on them.

@avikivity
Copy link
Member

Just last night I was having a very sweet dream 😴 that seastar had network in/out metrics, which were in my top 5 metrics along with aio write/write-bytes, reactor utilization and tasks processed, helping me crush performance problems almost as soon as they materialize.

I no longer had to beg end users to separately ship me host-level metrics, I never had to argue with them that something else on the host was slamming the network interface, I never had to guess what shard was doing all the network IO, I just had net metrics there along with all my other seasar metrics, the same labels, aggregation rules, ?foo=bar query capabilities on my metrics endpoint, I never had to struggle with weird k8s setups with interface names apparently generated like strong passwords, not with per-cgroup statistics which aren't rolled up into the host-level metrics [1].

Then I woke up 😞 .

It is good to have dreams to inspire you and as an escape from dreadful reality.

[1] Admittedly, this doesn't happen for bytes as far as I'm aware, but does occur for other metrics like "connections": node_exporter doesn't (and won't) include all connections in their metrics, since the /proc files they query only includes connections for the root cgroup, and not for the cgroups where the action is happening, so your "host level" metrics report very few connections, even if you have 1 million hidden in a pod. So this would be a great case for a seastar-level connection metric.

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

@avikivity
Copy link
Member

It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial.

btw, this would be useful for network scheduling. Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network. So soon we'll have to apply scheduling group shares to networking.

I'm not sure we'll do it at user level - perhaps we can offload it to Linux traffic control. But having the metrics is useful.

@travisdowns
Copy link
Contributor

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

Indeed, we have application level connection metrics as well. Still, the host metrics have proven useful there as a second "totally inclusive check" and because there are some states where the connection is still known by the kernel but not by the application or vice-versa (e.g., TCP keepalive as detected that the connection has dropped but the application hasn't noticed because it hasn't sent anything). Annoyingly, connection related metrics, including stuff we can't do at the application layer like some connection memory use, isn't available in host metrics because it's per-cgroup and not rolled up into parent cgroups unlike most other metrics.

@travisdowns
Copy link
Contributor

Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network.

Yeah, I guess this is the spiritual successor to i3en, but they've dropped the "n", so bandwidth is halved, and for a mix RW load I think disk throughput is up even more than advertised because these 2nd and 3rd generation nitro SSDs look like they are almost totally full-duplex compared to the old ones which weren't.

OTOH even with older instance types we could be network limited (usually network egress) in cases where workloads hit in cache so bytes out over the network didn't necessarily come from disk at all.

@avikivity
Copy link
Member

Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network.

Yeah, I guess this is the spiritual successor to i3en, but they've dropped the "n",

Ah, I thought they had to drop the "n" because the second "i" ate its budget.

so bandwidth is halved, and for a mix RW load I think disk throughput is up even more than advertised because these 2nd and 3rd generation nitro SSDs look like they are almost totally full-duplex compared to the old ones which weren't.

OTOH even with older instance types we could be network limited (usually network egress) in cases where workloads hit in cache so bytes out over the network didn't necessarily come from disk at all.

Sure, you're moving bytes around while we're managing (usually small) cells, so you're more likely to hit the network limit first.

This strengthens the case for network scheduling.

@avikivity
Copy link
Member

We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts.

Indeed, we have application level connection metrics as well. Still, the host metrics have proven useful there as a second "totally inclusive check" and because there are some states where the connection is still known by the kernel but not by the application or vice-versa (e.g., TCP keepalive as detected that the connection has dropped but the application hasn't noticed because it hasn't sent anything). Annoyingly, connection related metrics, including stuff we can't do at the application layer like some connection memory use, isn't available in host metrics because it's per-cgroup and not rolled up into parent cgroups unlike most other metrics.

Makes sense.

Did you report the Linux accounting problem upstream?

@StephanDollberg StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from c6c9ae7 to eb61401 Compare January 29, 2025 09:17
@StephanDollberg
Copy link
Contributor Author

I have pushed an update now that tracks the metrics per scheduling group.

@travisdowns
Copy link
Contributor

Did you report the Linux accounting problem upstream?

I feel like it is "by design" for the way the networking namespaces works in cgroups so very unlikely to be fixed (I guess networking namespaces are not even really in a hierarchy). I did go to report it for node_exporter (which I think needs to do the hard work of aggregating these values for all namespace), but it was already reported and closed as won't fix. However in that latter case there were sort of two issues mixed together (whether node exporter should report "per pod/cgroup" values (definitely no), or whether NE should aggregate across all cgroups for values that are'nt rolled up to the top level.

So maybe it's worth another shot on the NE side.

uint64_t bytes_received() const {
return _bytes_received;
};
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but such pure data classes are better as structs with exposed data members and inline initializers for all data members. Less noisy.

Copy link
Contributor

@travisdowns travisdowns Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did investigate the struct option on our end, but it added 1 instruction to the hot path and as we knew Avi-instruction-counter would be reviewing this we went with the separateed approach.

The extra instruction is probably only on x86 because at 16 bytes the struct can't be indexed with indexed addressing which maxes out at * 8, so it needs a separate shl do to the indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can struct/class possibly matter here? The layout is the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @avikivity I didn't look carefully enough, I though you were referring to the separate read/write TLS variables, which are not bundled into a struct, not this one which is used to expose the metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

#else
static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_sent = {};
static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_received = {};
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the inline option work for both modes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the inline option work for both modes?

Well this is a questin I also had, but "no" according to this (where inline was removed from thread_locals in shared lib builds):

#1494

However there it was just presented as fact that this doesn't work, but is that a bug in the compilers or an implementation limitation of thread_local? Even across shared libs non-thread_local globals will correctly resolve down to a single symbol across the proces at runtime, via weak symbol magic (same thing that makes static members inside templates work), but I guess that didn't extend to thread_local even though I think the C++ standard requires it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If static inline doesn't work, why not use static non-inline instead of splitting the implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as #1494 does it (presumably): the static inline is more efficient since it allows direct access to the thread local from other TUs. If you use non-inline other TU's will need to indirect though an expensive general purpose get_tls_blah method.

Copy link
Contributor

@travisdowns travisdowns Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR we already went though the same confusion internally and concluded that the split approach in #1494 is the right thing per above: I can dig up the godbolt links from our internal discussion if you want.

Copy link
Contributor Author

@StephanDollberg StephanDollberg Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So confirm you think for now we should do:

inline uint64_t& bytes_sent(unsigned sg_id) {
    static thread_local std::array<uint64_t, max_scheduling_groups()> bytes_sent;
    return bytes_sent[sg_id];
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I think we agreed above that static-thread-local-in-a-function isn't any better than a global in terms of (a) removing ifdefs or (b) performance. It behaves identically to an "inline global" in those respects.

So you want that for scoping reasons?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want the #ifdef gone and not to compromise on static library performance.

I think that the problem with global inline only happens when it is visible to both the shared library and the main executable, which isn't the case. So I think we can go back to global inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want the #ifdef gone and not to compromise on static library performance.

As discussed above, the function-local static provide no benefit in that case. It also suffers from the "duplicate copies in .so" bug as evidenced by #1494 which removed exactly that pattern, if shared across TUs and if there is no need to share across TUs we can just put the global in the .cc file no problem.

So I think we can go back to global inline.

Yes, agree for this case (not in public includes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed "global inline" now. Let me know if I have misunderstood what you wanted.

@@ -960,6 +960,9 @@ reactor::task_queue::register_stats() {
}, sm::description("Total amount in milliseconds we were in violation of the task quota"),
{group_label}),
});

register_net_metrics_for_scheduling_group(new_metrics, _id, group_label);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only call, so new groups won't have metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait that's completely wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't follow. This gets called whenever a new group is added (or renamed). It works effectively the same way as scheduling groups get their metrics right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that's completely wrong" referred to my previous comment

@travisdowns
Copy link
Contributor

This strengthens the case for network scheduling.

+1

Adds two basic metrics that track bytes sent and received from the
networking stack.

This can give a better per application insight than what host wide
metrics (for example via node exporter) can do.

Further it can serve as a comparison point with host based metrics for
possible write amplification (though this is less likely on the network
side).

The patch follows a similar pattern as to how memory and disk based
metrics work.

Note one might be inclined to introduce some grouping into the metrics
in relation to either the interface or source IP similar to how we have
mountpoints and/or groups on the disk side.

While this sounds useful at first in practice it would be less useful.
Often for the major cloud providers and similar in self hosted
environments there is only a single interface/source-IP and routing
happens at a later point (switches, routers etc.). Further, adding this
separation would make the implementation more expensive in either
compute or memory space.

We link the metrics to the cpu scheduling group which allows for getting
a more detailed picture of where the network usage is coming from.

Further it sets up for future network scheduling at the group level.
@StephanDollberg StephanDollberg force-pushed the stephan/net-tx-rx-metrics branch from ed98fa8 to 882c0a9 Compare January 31, 2025 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants