Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

DasSkelett · 2024-03-28T22:41:19Z

This hardens the connected peers metrics collection threads of the worker by handling exceptions gracefully, instead of crashing the thread, which caused the metrics to go stale after time, domain after domain.

This also fixes one particular cause for exceptions, pyroute2.netlink.exceptions.NetlinkDumpInterrupted, which apparently is an "expected" exception, being thrown whenever the base data is changed while the kernel returns it. This signals the userspace application that the data might be useless and the netlink request should be retried.
This probably happens when a new peer is added while asking for the interface data.

Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: Exception in thread Thread-31:
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: Traceback (most recent call last):
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     self.run()
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/usr/lib/python3.8/threading.py", line 870, in run
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     self._target(*self._args, **self._kwargs)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/mqtt.py", line 209, in publish_metrics_loop
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     publish_metrics(client, topic, domain)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/mqtt.py", line 230, in publish_metrics
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     peer_count = get_connected_peers_count(iface)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/netlink.py", line 226, in get_connected_peers_count
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     msgs = wg.info(wg_interface)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/generic/wireguard.py", line 274, in info
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return self.nlm_request(
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return tuple(self._genlm_request(*argv, **kwarg))
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/generic/__init__.py", line 126, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return tuple(super().nlm_request(*argv, **kwarg))
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/nlsocket.py", line 1257, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     raise defer
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: pyroute2.netlink.exceptions.NetlinkDumpInterrupted: (-1, 'dump interrupted')

We retry it once, and if it still doesn't succeed we just ignore it, don't publish new metrics data and wait for the next loop iteration.

We can also think about additionally doing our own mutex locking across all places where we read/write wg interface data; especially if this happens more during e.g. mass reconnects, where up-to-date peer numbers would be important as well. I'll monitor this and see if the additional complexity is needed/worth it.

T0biii · 2024-03-28T22:52:31Z

ModuleNotFoundError: No module named 'pyroute2.netlink.exceptions'; 'pyroute2.netlink' is not a package - https://github.com/freifunkMUC/wgkex/actions/runs/8474567354/job/23221187018?pr=137#step:4:61
i guess something in https://github.com/freifunkMUC/wgkex/blob/main/wgkex/worker/netlink_test.py needs to be adjusted as well?

T0biii · 2024-03-28T23:38:41Z

for what is coverage/coveralls?

DasSkelett · 2024-03-29T00:05:23Z

for what is coverage/coveralls?

It's a test coverage checker, basically it complains there are a handful of new lines which are not tested by unit tests.
I might have an idea for an easy test to add.

DasSkelett · 2024-04-02T18:58:34Z

Well this was a bit of a fight with the mock library, but got there in the end. This should be ready now.

T0biii

LGTM (catching Exceptions)

…xceptions

DasSkelett added the bug Something isn't working label Mar 28, 2024

DasSkelett requested review from grische, T0biii and awlx March 28, 2024 22:41

DasSkelett added the worker label Mar 28, 2024

DasSkelett marked this pull request as draft March 29, 2024 00:03

DasSkelett force-pushed the fix/netlink-interrupt branch 3 times, most recently from 4efe903 to 5dd2d33 Compare April 2, 2024 18:54

DasSkelett marked this pull request as ready for review April 2, 2024 18:57

T0biii approved these changes Apr 3, 2024

View reviewed changes

GoliathLabs force-pushed the fix/netlink-interrupt branch from 5dd2d33 to 73f462f Compare April 21, 2024 18:57

Handle NetlinkDumpInterrupted, fix worker metrics going stale after e…

18578c0

…xceptions

DasSkelett force-pushed the fix/netlink-interrupt branch from 73f462f to 18578c0 Compare April 23, 2024 20:12

DasSkelett merged commit 9483fc0 into freifunkMUC:main Apr 23, 2024
4 checks passed

DasSkelett deleted the fix/netlink-interrupt branch April 23, 2024 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

DasSkelett commented Mar 28, 2024 •

edited

Loading

T0biii commented Mar 28, 2024

T0biii commented Mar 28, 2024

DasSkelett commented Mar 29, 2024

DasSkelett commented Apr 2, 2024

T0biii left a comment

Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

Conversation

DasSkelett commented Mar 28, 2024 • edited Loading

T0biii commented Mar 28, 2024

T0biii commented Mar 28, 2024

DasSkelett commented Mar 29, 2024

DasSkelett commented Apr 2, 2024

T0biii left a comment

Choose a reason for hiding this comment

DasSkelett commented Mar 28, 2024 •

edited

Loading