-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: Add bytes sent/received metrics #2629
base: master
Are you sure you want to change the base?
net: Add bytes sent/received metrics #2629
Conversation
975e5ac
to
70ccf9c
Compare
src/net/native-stack.cc
Outdated
} | ||
void increment_bytes_received(uint64_t bytes) { | ||
net::bytes_received += bytes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be in the same translation unit as its used in, statistics must compile to a single ADD instruction.
src/net/native-stack-impl.hh
Outdated
@@ -193,6 +201,7 @@ public: | |||
: _conn(std::move(conn)) {} | |||
using data_sink_impl::put; | |||
virtual future<> put(packet p) override { | |||
internal::increment_bytes_sent(p.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better place is the network interface class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean to measure the bytes or to store the counter? I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean to measure the bytes or to store the counter?
Both.
I have put the counter into a separate class now as most of the classes in the native stack are templates so it's not immediately easy to put static members there.
It's okay to make the counters namespace-scope.
For write amplification, you can compare TCP bytes sent vs interface bytes sent (not sure such a counter exists), or just count packets and multiple by typical header size. It would be more interesting to make these counters per scheduling group. As is the difference between this and the host counters is too trivial. |
70ccf9c
to
0ab3604
Compare
This sounds interesting yes but I think it goes a bit against the "single ADD instruction" as the streams are not directly linked to any scheduling group so we'd always have to get the current group (maybe via the scheduling_group_specific stuff)? |
Just last night I was having a very sweet dream 😴 that seastar had network in/out metrics, which were in my top 5 metrics along with aio write/write-bytes, reactor utilization and tasks processed, helping me crush performance problems almost as soon as they materialize. I no longer had to beg end users to separately ship me host-level metrics, I never had to argue with them that something else on the host was slamming the network interface, I never had to guess what shard was doing all the network IO, I just had net metrics there along with all my other seasar metrics, the same labels, aggregation rules, Then I woke up 😞 . [1] Admittedly, this doesn't happen for bytes as far as I'm aware, but does occur for other metrics like "connections": node_exporter doesn't (and won't) include all connections in their metrics, since the /proc files they query only includes connections for the root cgroup, and not for the cgroups where the action is happening, so your "host level" metrics report very few connections, even if you have 1 million hidden in a pod. So this would be a great case for a seastar-level connection metric. |
0ab3604
to
c6c9ae7
Compare
Looking at it it seems to take about 2-3 instructions. |
That's reasonable. The idea is that those instructions don't block the rest of the instruction flow since nothing depends on them. |
It is good to have dreams to inspire you and as an escape from dreadful reality.
We found that connection counts are better when associated with a scope. For example, client-server connection counts vs node-node connection counts. |
btw, this would be useful for network scheduling. Looking at AWS i7ie instances that have 0.5 Gbps (50 MB/s) per vcpu and 6 Gbps (600 MB/s) per disk, the bottleneck is no longer in either the CPU or disk, but the network. So soon we'll have to apply scheduling group shares to networking. I'm not sure we'll do it at user level - perhaps we can offload it to Linux traffic control. But having the metrics is useful. |
Indeed, we have application level connection metrics as well. Still, the host metrics have proven useful there as a second "totally inclusive check" and because there are some states where the connection is still known by the kernel but not by the application or vice-versa (e.g., TCP keepalive as detected that the connection has dropped but the application hasn't noticed because it hasn't sent anything). Annoyingly, connection related metrics, including stuff we can't do at the application layer like some connection memory use, isn't available in host metrics because it's per-cgroup and not rolled up into parent cgroups unlike most other metrics. |
Yeah, I guess this is the spiritual successor to i3en, but they've dropped the "n", so bandwidth is halved, and for a mix RW load I think disk throughput is up even more than advertised because these 2nd and 3rd generation nitro SSDs look like they are almost totally full-duplex compared to the old ones which weren't. OTOH even with older instance types we could be network limited (usually network egress) in cases where workloads hit in cache so bytes out over the network didn't necessarily come from disk at all. |
Ah, I thought they had to drop the "n" because the second "i" ate its budget.
Sure, you're moving bytes around while we're managing (usually small) cells, so you're more likely to hit the network limit first. This strengthens the case for network scheduling. |
Makes sense. Did you report the Linux accounting problem upstream? |
c6c9ae7
to
eb61401
Compare
I have pushed an update now that tracks the metrics per scheduling group. |
I feel like it is "by design" for the way the networking namespaces works in cgroups so very unlikely to be fixed (I guess networking namespaces are not even really in a hierarchy). I did go to report it for node_exporter (which I think needs to do the hard work of aggregating these values for all namespace), but it was already reported and closed as won't fix. However in that latter case there were sort of two issues mixed together (whether node exporter should report "per pod/cgroup" values (definitely no), or whether NE should aggregate across all cgroups for values that are'nt rolled up to the top level. So maybe it's worth another shot on the NE side. |
uint64_t bytes_received() const { | ||
return _bytes_received; | ||
}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but such pure data classes are better as structs with exposed data members and inline initializers for all data members. Less noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did investigate the struct option on our end, but it added 1 instruction to the hot path and as we knew Avi-instruction-counter would be reviewing this we went with the separateed approach.
The extra instruction is probably only on x86 because at 16 bytes the struct can't be indexed with indexed addressing which maxes out at * 8, so it needs a separate shl
do to the indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can struct/class possibly matter here? The layout is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @avikivity I didn't look carefully enough, I though you were referring to the separate read/write TLS variables, which are not bundled into a struct, not this one which is used to expose the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
src/net/native-stack-impl.hh
Outdated
#else | ||
static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_sent = {}; | ||
static inline thread_local std::array<unsigned, max_scheduling_groups()> bytes_received = {}; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the inline option work for both modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the inline option work for both modes?
Well this is a questin I also had, but "no" according to this (where inline was removed from thread_locals in shared lib builds):
However there it was just presented as fact that this doesn't work, but is that a bug in the compilers or an implementation limitation of thread_local
? Even across shared libs non-thread_local globals will correctly resolve down to a single symbol across the proces at runtime, via weak symbol magic (same thing that makes static members inside templates work), but I guess that didn't extend to thread_local
even though I think the C++ standard requires it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If static inline doesn't work, why not use static non-inline instead of splitting the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as #1494 does it (presumably): the static inline
is more efficient since it allows direct access to the thread local from other TUs. If you use non-inline other TU's will need to indirect though an expensive general purpose get_tls_blah
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR we already went though the same confusion internally and concluded that the split approach in #1494 is the right thing per above: I can dig up the godbolt links from our internal discussion if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So confirm you think for now we should do:
inline uint64_t& bytes_sent(unsigned sg_id) {
static thread_local std::array<uint64_t, max_scheduling_groups()> bytes_sent;
return bytes_sent[sg_id];
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, I think we agreed above that static-thread-local-in-a-function isn't any better than a global in terms of (a) removing ifdefs or (b) performance. It behaves identically to an "inline global" in those respects.
So you want that for scoping reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want the #ifdef gone and not to compromise on static library performance.
I think that the problem with global inline only happens when it is visible to both the shared library and the main executable, which isn't the case. So I think we can go back to global inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want the #ifdef gone and not to compromise on static library performance.
As discussed above, the function-local static provide no benefit in that case. It also suffers from the "duplicate copies in .so" bug as evidenced by #1494 which removed exactly that pattern, if shared across TUs and if there is no need to share across TUs we can just put the global in the .cc file no problem.
So I think we can go back to global inline.
Yes, agree for this case (not in public includes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed "global inline" now. Let me know if I have misunderstood what you wanted.
@@ -960,6 +960,9 @@ reactor::task_queue::register_stats() { | |||
}, sm::description("Total amount in milliseconds we were in violation of the task quota"), | |||
{group_label}), | |||
}); | |||
|
|||
register_net_metrics_for_scheduling_group(new_metrics, _id, group_label); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only call, so new groups won't have metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait that's completely wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't follow. This gets called whenever a new group is added (or renamed). It works effectively the same way as scheduling groups get their metrics right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that's completely wrong" referred to my previous comment
+1 |
eb61401
to
2646a55
Compare
2646a55
to
ed98fa8
Compare
Adds two basic metrics that track bytes sent and received from the networking stack. This can give a better per application insight than what host wide metrics (for example via node exporter) can do. Further it can serve as a comparison point with host based metrics for possible write amplification (though this is less likely on the network side). The patch follows a similar pattern as to how memory and disk based metrics work. Note one might be inclined to introduce some grouping into the metrics in relation to either the interface or source IP similar to how we have mountpoints and/or groups on the disk side. While this sounds useful at first in practice it would be less useful. Often for the major cloud providers and similar in self hosted environments there is only a single interface/source-IP and routing happens at a later point (switches, routers etc.). Further, adding this separation would make the implementation more expensive in either compute or memory space. We link the metrics to the cpu scheduling group which allows for getting a more detailed picture of where the network usage is coming from. Further it sets up for future network scheduling at the group level.
ed98fa8
to
882c0a9
Compare
Adds two basic metrics that track bytes sent and received from the
networking stack.
This can give a better per application insight than what host wide
metrics (for example via node exporter) can do.
Further it can serve as a comparison point with host based metrics for
possible write amplification (though this is less likely on the network
side).
The patch follows a similar pattern as to how memory and disk based
metrics work.
Note one might be inclined to introduce some grouping into the metrics
in relation to either the interface or source IP similar to how we have
mountpoints and/or groups on the disk side.
While this sounds useful at first in practice it would be less useful.
Often for the major cloud providers and similar in self hosted
environments there is only a single interface/source-IP and routing
happens at a later point (switches, routers etc.). Further, adding this
separation would make the implementation more expensive in either
compute or memory space.
A possible improvement would be add tagging to the API such that users
could be explicit about attribution. However this is left as a future
improvement and this patch keeps it simple for now.