Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137

Merged
merged 1 commit into from
Apr 23, 2024

Conversation

DasSkelett
Copy link
Member

@DasSkelett DasSkelett commented Mar 28, 2024

This hardens the connected peers metrics collection threads of the worker by handling exceptions gracefully, instead of crashing the thread, which caused the metrics to go stale after time, domain after domain.

This also fixes one particular cause for exceptions, pyroute2.netlink.exceptions.NetlinkDumpInterrupted, which apparently is an "expected" exception, being thrown whenever the base data is changed while the kernel returns it. This signals the userspace application that the data might be useless and the netlink request should be retried.
This probably happens when a new peer is added while asking for the interface data.

Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: Exception in thread Thread-31:
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: Traceback (most recent call last):
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     self.run()
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/usr/lib/python3.8/threading.py", line 870, in run
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     self._target(*self._args, **self._kwargs)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/mqtt.py", line 209, in publish_metrics_loop
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     publish_metrics(client, topic, domain)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/mqtt.py", line 230, in publish_metrics
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     peer_count = get_connected_peers_count(iface)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/wgkex/worker/netlink.py", line 226, in get_connected_peers_count
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     msgs = wg.info(wg_interface)
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/generic/wireguard.py", line 274, in info
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return self.nlm_request(
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return tuple(self._genlm_request(*argv, **kwarg))
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/generic/__init__.py", line 126, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     return tuple(super().nlm_request(*argv, **kwarg))
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:   File "/srv/wgkex/wgkex/venv/lib/python3.8/site-packages/pyroute2/netlink/nlsocket.py", line 1257, in nlm_request
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]:     raise defer
Mar 28 22:25:07 gw04.in.ffmuc.net wgkex[3226905]: pyroute2.netlink.exceptions.NetlinkDumpInterrupted: (-1, 'dump interrupted')

We retry it once, and if it still doesn't succeed we just ignore it, don't publish new metrics data and wait for the next loop iteration.

We can also think about additionally doing our own mutex locking across all places where we read/write wg interface data; especially if this happens more during e.g. mass reconnects, where up-to-date peer numbers would be important as well. I'll monitor this and see if the additional complexity is needed/worth it.

@DasSkelett DasSkelett added the bug Something isn't working label Mar 28, 2024
@T0biii
Copy link
Member

T0biii commented Mar 28, 2024

ModuleNotFoundError: No module named 'pyroute2.netlink.exceptions'; 'pyroute2.netlink' is not a package - https://github.com/freifunkMUC/wgkex/actions/runs/8474567354/job/23221187018?pr=137#step:4:61
i guess something in https://github.com/freifunkMUC/wgkex/blob/main/wgkex/worker/netlink_test.py needs to be adjusted as well?

@T0biii
Copy link
Member

T0biii commented Mar 28, 2024

for what is coverage/coveralls?

@DasSkelett DasSkelett marked this pull request as draft March 29, 2024 00:03
@DasSkelett
Copy link
Member Author

for what is coverage/coveralls?

It's a test coverage checker, basically it complains there are a handful of new lines which are not tested by unit tests.
I might have an idea for an easy test to add.

@DasSkelett DasSkelett force-pushed the fix/netlink-interrupt branch 3 times, most recently from 4efe903 to 5dd2d33 Compare April 2, 2024 18:54
@DasSkelett DasSkelett marked this pull request as ready for review April 2, 2024 18:57
@DasSkelett
Copy link
Member Author

Well this was a bit of a fight with the mock library, but got there in the end. This should be ready now.

Copy link
Member

@T0biii T0biii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (catching Exceptions)

@DasSkelett DasSkelett merged commit 9483fc0 into freifunkMUC:main Apr 23, 2024
4 checks passed
@DasSkelett DasSkelett deleted the fix/netlink-interrupt branch April 23, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants