Handle NetlinkDumpInterrupted, fix worker metrics going stale after exceptions #137
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This hardens the connected peers metrics collection threads of the worker by handling exceptions gracefully, instead of crashing the thread, which caused the metrics to go stale after time, domain after domain.
This also fixes one particular cause for exceptions,
pyroute2.netlink.exceptions.NetlinkDumpInterrupted
, which apparently is an "expected" exception, being thrown whenever the base data is changed while the kernel returns it. This signals the userspace application that the data might be useless and the netlink request should be retried.This probably happens when a new peer is added while asking for the interface data.
We retry it once, and if it still doesn't succeed we just ignore it, don't publish new metrics data and wait for the next loop iteration.
We can also think about additionally doing our own mutex locking across all places where we read/write wg interface data; especially if this happens more during e.g. mass reconnects, where up-to-date peer numbers would be important as well. I'll monitor this and see if the additional complexity is needed/worth it.