Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: collector.arp.netlink is super slow to retrieve metrics #3075

Closed
maxime1907 opened this issue Jul 17, 2024 · 14 comments · Fixed by #3133
Closed

Bug: collector.arp.netlink is super slow to retrieve metrics #3075

maxime1907 opened this issue Jul 17, 2024 · 14 comments · Fixed by #3133
Labels

Comments

@maxime1907
Copy link

maxime1907 commented Jul 17, 2024

Host operating system: output of uname -a

Linux node 6.1.96-flatcar #1 SMP PREEMPT_DYNAMIC Mon Jul  1 23:29:55 -00 2024 x86_64 AMD EPYC 9654 96-Core Processor AuthenticAMD GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.1 (branch: HEAD, revision: 400c3979931613db930ea035f39ce7b377cdbb5b)
  build user:       root@7afbff271a3f
  build date:       20240521-18:36:22
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             unknown

node_exporter command line flags

            - '--path.procfs=/host/proc'
            - '--path.sysfs=/host/sys'
            - '--path.rootfs=/host/root'
            - '--path.udev.data=/host/root/run/udev/data'
            - '--web.listen-address=[$(HOST_IP)]:9100'
            - '--collector.textfile.directory=/var/lib/node_exporter'
            - '--no-collector.ipvs'
            - '--collector.netclass.netlink'

node_exporter log output

ts=2024-07-17T14:32:56.130Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.1, branch=HEAD, revision=400c3979931613db930ea035f39ce7b377cdbb5b)"
ts=2024-07-17T14:32:56.131Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.3, platform=linux/amd64, user=root@7afbff271a3f, date=20240521-18:36:22, tags=unknown)"
ts=2024-07-17T14:32:56.133Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-07-17T14:32:56.133Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-07-17T14:32:56.135Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=fibrechannel
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=hwmon
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=infiniband
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netclass
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nfs
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nfsd
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=os
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=powersupplyclass
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=rapl
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=softnet
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=tapestats
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=thermal_zone
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=time
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=udp_queues
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=xfs
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=zfs
ts=2024-07-17T14:32:56.182Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9100
ts=2024-07-17T14:32:56.182Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9100

Are you running node_exporter in Docker?

No, we use containerd directly

What did you do that produced an error?

Run node exporter which uses by default the collector collector.arp.netlink

What did you expect to see?

Normal scrape times like 300ms

What did you see instead?

Scrape times of more than 20 seconds

Here is a graph of before and after the deactivation of the arp netlink collector with this flag:

            - '--no-collector.arp.netlink'

image

Here is the output of pprof with:

go tool pprof -svg http://localhost:9100/debug/pprof/profile > node_exporter.svg

node_exporter

@chengjoey
Copy link
Contributor

you are not running in docker, it's the --path.procfs correct?
Or is there any error in the log?

maybe it's dial cost too many time

conn, err := rtnetlink.Dial(nil)
if err != nil {
return nil, err
}
defer conn.Close()

@dswarbrick
Copy link
Contributor

The pprof SVG appears to be truncated, and only shows the collection taking 1.84s (which is admittedly still kinda slow). I can't think of a reason off the top of my head for the rtnetlink variant of the arp collector to be slow. If anything, it should be faster than parsing /proc/net/arp, especially on systems with a large number of ARP entries.

Can you please attach a complete pprof, so that it reveals where that ~1.8s is being consumed by the arpCollector?

The gaps in your Grafana graphs also suggest that you're getting the occasional scrape timeout. Is anything unusual logged when that happens?

@discordianfish
Copy link
Member

Goign to close this due to inactivity. @maxime1907 feel free to re-open and attach the full pprof if this is still an issue

@maxime1907
Copy link
Author

maxime1907 commented Sep 23, 2024

Sorry for the late response i was busy these days and yes this is still an issue

@chengjoey
yes we mount /proc from the host machine to /host/proc inside the container as defined in the chart that we use here:

@dswarbrick
i have reuploaded the full pprof here:
This one is when we enable arp.netlink
node_exporter_arp_netlink_enabled
This one is when it is disabled:
node_exporter_arp_netlink_disabled

i think that the file being truncated is just github having some problems to render the whole svg but you can right click then save to see the full content

@dswarbrick
Copy link
Contributor

dswarbrick commented Sep 23, 2024

@maxime1907 Strange, clicking on the SVG images still shows them truncated. I can save the first one locally and it appears complete when I open the locally saved copy. If I try to save the second one, I get a zero-byte file. GitHub seems to be really messing with them.

The 1.98s consumed by Syscall6 is not solely caused by the ARP netlink collector, since all IO operations (including basic file IO) also ultimately go via a syscall to the kernel, and node_exporter does a lot of file IO when parsing the various procfs / sysfs entries. To really focus solely on the ARP collector and eliminate any other "noise" from the other collectors that ultimately culminate in a syscall, you would need to disable all other collectors and only enable the ARP collector - and then try switching between the netlink / procfs variants of the ARP collector.

Everything beyond the syscall is kernel-space, so outside the visibility of pprof. Do you have a very large number of network interfaces on the system, or a very large number of ARP entries in the cache? How long does it take to list ARP entries via a command like ip -4 neigh?

Are you able to try running a local build from the current master branch? I also don't want to rule out the possibility of a compiler quirk. A syscall of this nature should be extremely fast.

@dswarbrick
Copy link
Contributor

@maxime1907 Can you also try running node_exporter with --log.level debug and pasting the arp collector execution time, with and without netlink enabled, e.g.

ts=2024-09-23T11:33:33.678Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=3.5476e-05

@SuperQ
Copy link
Member

SuperQ commented Sep 23, 2024

Please post the pprof raw data to pprof.me. It's much easier to examine than the images.

@dswarbrick
Copy link
Contributor

dswarbrick commented Sep 23, 2024

Ok, on a small test system with half a dozen ARP entries, I see that the netlink method is measurably slower:

$ go run node_exporter.go --log.level=debug --collector.disable-defaults --collector.arp 2>&1 | grep arp
ts=2024-09-23T11:53:22.211Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-09-23T11:53:27.002Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.00022538
ts=2024-09-23T11:53:27.907Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.000206112
ts=2024-09-23T11:53:28.899Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.000181347
ts=2024-09-23T11:53:29.787Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.00014375
ts=2024-09-23T11:53:30.689Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.00016240
$ go run node_exporter.go --log.level=debug --collector.disable-defaults --collector.arp --no-collector.arp.netlink 2>&1 | grep arp
ts=2024-09-23T11:52:37.021Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-09-23T11:53:00.165Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=4.8105e-05
ts=2024-09-23T11:53:01.056Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=5.5741e-05
ts=2024-09-23T11:53:01.937Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=3.9537e-05
ts=2024-09-23T11:53:02.770Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=5.1522e-05
ts=2024-09-23T11:53:03.562Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=5.5472e-05

I don't have enough ARP entries to conclusively say whether the execution time scales proportionally to the number of ARP entries. However I doubt it, since the number of netlink operations to get the neighbors is constant. The only thing that is variable is the net.InterfaceByIndex() call, since that will perform a netlink operation to resolve the interface index to a device name, for each interface which has a non-zero ARP entry count. This interface name resolution is however only performed after summing the ARP entries per interface, i.e. it is not performed for each ARP entry.

	ifIndexEntries := make(map[uint32]uint32)

	for _, n := range neighbors {
		// Neighbors will also contain IPv6 neighbors, but since this is purely an ARP collector,
		// restrict to AF_INET. Also skip entries which have state NUD_NOARP to conform to output
		// of /proc/net/arp.
		if n.Family == unix.AF_INET && n.State&unix.NUD_NOARP == 0 {
			ifIndexEntries[n.Index]++
		}
	}

	enumEntries := make(map[string]uint32)

	// Convert interface indexes to names.
	for ifIndex, entryCount := range ifIndexEntries {
		iface, err := net.InterfaceByIndex(int(ifIndex))
		if err != nil {
			if errors.Unwrap(err).Error() == "no such network interface" {
				continue
			}
			return nil, err
		}

		enumEntries[iface.Name] = entryCount
	}

Also tested with current master branch, and both Go 1.22.7 and 1.23.1.

If I spam my ARP cache with (failed) entries, e.g., for x in {1..254}; do ping -c1 -W 0.01 192.168.1.$x; done, there is no appreciable increase in collector execution time. This is in line with the fact that there is only one netlink operation to fetch the neighbors (conn.Neigh.List()) , regardless of their number.

If I fudge 50K interfaces in the code to force a large number of net.InterfaceByIndex() calls, I can get the collector to take just under 0.5s.

	for x := 1; x < 50000; x++ {
		ifIndexEntries[uint32(x)] = 42
	}
ts=2024-09-23T12:25:09.746Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=0.466097157

That's still a long way from ~1.98s however, and I strongly doubt that many systems will have that many interfaces.

@maxime1907
Copy link
Author

maxime1907 commented Sep 23, 2024

@dswarbrick
Listing arp entries is instant

$ time ip -4 neigh | wc -l
278

real	0m0.011s
user	0m0.001s
sys	0m0.012s

About the number of interfaces its not that much

$ ip -o link show | wc -l
182

From the grafana dashboards in the issue description we can see that it take less than 1 second when arp.netlink is disabled and enabled it takes more than 20 seconds

I will try executing it with --debug tomorrow to see if it differs from our graphs

@SuperQ
i will post the full pprof with the website you gave tomorrow

@dswarbrick
Copy link
Contributor

@maxime1907 Something is really fishy if your scrapes are taking 20s with the ARP netlink collector enabled. Neither your number of ARP entries nor your number of interfaces would be sufficient to require more than a mere fraction of a second to collect via netlink. The timing of ip -4 neigh is more or less what I'd expect, but is that executed from within a container equivalent to that in which node_exporter runs?

Since I'm starting to run out of ideas what could cause such slow scrapes, what else can you tell us about the container environment in which node_exporter is running? Is there an selinux policy which could be interfering with netlink sockets?

Something possibly worth trying is an strace with just the ARP collector enable, filtering just the netlink-related stuff. For example:

strace -e sendto,recvfrom,sendmsg,recvmsg -f -r node_exporter --collector.disable-defaults --collector.arp

The -r switch will print relative timestamps between the syscalls.

@dswarbrick
Copy link
Contributor

And just to play devil's advocate, it's not unheard of for drivers to cause long delays in netlink sockets: https://unix.stackexchange.com/questions/584401/what-could-cause-af-netlink-socket-transactions-to-take-many-seconds

Since this issue is the first that I've heard of where the ARP netlink collector is misbehaving, I would be highly suspicious of the environment, rather than the code.

@maxime1907
Copy link
Author

@dswarbrick
i tried executing directly on the host machine and inside the node_exporter container same thing

i tried only executing the arp collector and still the same thing

here is the full strace output with the command you provided
output.zip

@dswarbrick
Copy link
Contributor

@maxime1907 Thanks for providing the strace output. That really clarified what is going on behind the scenes. I don't know if that output was just a single scrape, or whether you let strace run for a while, but I was a bit shocked to see roughly 18K syscalls in that output. It's no wonder that Go is spending so much time in Syscall6.

After digging through some of Go's net package source, it's apparent why there is such a huge amount of rtnetlink traffic in your strace. The node_exporter arp collector calls net.InterfaceByIndex(index int) to resolve an interface index to an interface name, once for each interface index that it encounters in the neighbor list. Behind the scenes, Go is eventually calling this private function:

// If the ifindex is zero, interfaceTable returns mappings of all
// network interfaces. Otherwise it returns a mapping of a specific
// interface.
func interfaceTable(ifindex int) ([]Interface, error) {
	tab, err := syscall.NetlinkRIB(syscall.RTM_GETLINK, syscall.AF_UNSPEC)
...
		if ifindex == 0 || ifindex == int(ifim.Index) {
			// (add to []interface result)

Despite the fact that this function takes an ifindex param, it is not actually used in the RTM_GETLINK syscall. Go is fetching the entire interface link table, and filtering that table later in the function.

There is a similar function in the jsimonetti/rtnetlink package, https://pkg.go.dev/github.com/jsimonetti/rtnetlink#LinkService.Get, which does pass the interface index to the rtnetlink syscall, resulting in a much more concise reply.

Here is a trace of the Go net.InterfaceByIndex() call:

[pid 16677]      0.000213 sendto(9, [{nlmsg_len=17, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ...}], 17, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 17

and here is jsimonetti/rtnetlink's equivalent:

[pid 16677]      0.000000 sendmsg(8, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=32, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP_FILTERED, nlmsg_seq=1063860054, nlmsg_pid=16677}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_NETROM, ifi_index=if_nametoindex("wlp0s20f3"), ifi_flags=0, ifi_change=0}], iov_len=32}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32

The jsimonetti/rtnetlink method is a larger rtnetlink request, because it specifies the interface index by which we want to filter (note the NLM_F_DUMP_FILTERED flag), compared to the Go net.InterfaceByIndex method NLM_F_DUMP give-me-everything flag. However, the jsimonetti/rtnetlink method results in rtnetlink replying with just the interface we asked about, instead of the entire link table as with Go's net package.

I count 183 caliNNN interfaces in your strace log, so the collector is fetching the entire link table (containing 183 interfaces worth of info), 183 times 🤦‍♂️

It appears that the kernel is packing two interfaces into each recvfrom response in your strace output. So doing the math, we have:

  • sendto syscall to fetch the link table
  • 183 / 2 = 92 recvfrom syscalls to read the response

All that, multiplied by 183, works out to ~17K syscalls, more or less in line with the log that you supplied.

The amount of rtnetlink response traffic will increase exponentially with the number of interfaces in the system.

So, we have two options:

  1. Use the low-level jsimonetti/rtnetlink method of fetching the interface name. This would still require an individual rtnetlink syscall for each interface, although the response to each would be small and concise - around 366 syscalls in the case of 183 interfaces (one sendto and one recvfrom for each interface)
  2. Fetch the interface table once with net.Interfaces(), and do our own filtering / matching. This would require significantly fewer syscalls - around 93 in the case of 183 interfaces (one sendto, and packed response comprised of 92 recvfrom)

In terms of context switching penalty, the latter option would be preferable.

dswarbrick added a commit to dswarbrick/node_exporter that referenced this issue Sep 25, 2024
dswarbrick added a commit to dswarbrick/node_exporter that referenced this issue Oct 4, 2024
github.com/jsimonetti/rtnetlink provides a high level rtnl wrapper
around the lower level rtnetlink functions, which essentially does all
that we need. The rtnl.Conn.Neighbors uses an internal cache for
resolving interface indexes to names, so it makes at most one rtnetlink
call per interface to resolve the name.

Using this high level wrapper hugely simplifies our code and makes it
easier to understand and maintain.

Fixes: prometheus#3075

Signed-off-by: Daniel Swarbrick <[email protected]>
SuperQ pushed a commit that referenced this issue Jan 10, 2025
github.com/jsimonetti/rtnetlink provides a high level rtnl wrapper
around the lower level rtnetlink functions, which essentially does all
that we need. The rtnl.Conn.Neighbors uses an internal cache for
resolving interface indexes to names, so it makes at most one rtnetlink
call per interface to resolve the name.

Using this high level wrapper hugely simplifies our code and makes it
easier to understand and maintain.

Fixes: #3075

Signed-off-by: Daniel Swarbrick <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants