-
Notifications
You must be signed in to change notification settings - Fork 486
Send SIGUSR1 to dnsmasq periodically #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
/test all |
The failed test fails identically at HEAD (issue #646). |
/cc @aojea |
I think you have several changes in the same commit that makes it hard to review. In addition, what is the motivation to make this configurable, I don't think we should allow people to use this feature if is not for debugging, and if that is the case, you can perfectly do it with a simple bash script |
Thank you very much for a quick reaction, Antonio!
Yeah, I replace
You've mentioned running a bash script. Do I understand correctly that users can also run such a script sending SIGUSR1 to a dnsmasq instance? If so, why don't you want to allow them to use this feature?
We can hardcode some interval, but I thought it would be safer to have things configurable, even just for turning the feature on/off. The general motivation is to improve dnsmasq observability and not to require users to run custom scripts (or reach out to their cloud providers) to gain some insights about their kubedns load. Does it make sense to you? |
I can understand if you expose these metrics with a metrics/prometheus endpoints, that allows you to set alerts and integrate with common monitoring stacks ... but you still needs to evaluate the performance implications on sending a periodic signal to the process |
Does dnsmasq handle syscall interruption gracefully? e.g. on receipt of signal, the syscall is canceled with EINTR. |
Yes, dnsmasq mentions specific actions on
The statistics logged by dnsmasq on
For what you describe, the first step is still the current PR, which makes dnsmasq print statistics to logs, which we can later parse and expose with metrics/prometheus endpoints or set up log-based metrics. Do you think the current PR is only sensible if extended with these follow-up steps? |
I don't know, it really feels awkward running a process in production that you are sigkilling periodically ... but I don't know if this pattern is also used elsewhere |
For anyone reading this without much context, Antonio meant sending a signal periodically. The signal In any case, PRs to this repo are very rare, which means the resources are limited (two PRs with vulnerability fixes have been waiting for attention since July). The present PR (which actually was on my to-do list for months) is a rather low-hanging fruit to improve kubedns observability which is among priorities to help with supportability. Let's not rush and see what others have to say. |
problems on typing fast, thanks for clarifying, I meant what Damian said 😅 |
+1, Good work, Damian! |
If we can do a test and include the results in the commit message to demonstrate that sending the signal is benign:
|
38ac3af
to
b0ca20f
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale I need a bit of free time to run the benchmark suggested by Bowei. |
b0ca20f
to
ac59c53
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DamianSawicki The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ac59c53
to
bc910f0
Compare
I benchmarked the change with SIGUSR1 sent as frequently as every second for both UDP and TCP (with up to 50 concurrent TCP connections) and observed no negative performance impact. Added details to commit description. Please review 🙏 |
This moves the logging line for errors during killing dnsmasq and adds logging for errors during starting dnsmasq.
This adds a feature flag --logInterval for periodic triggering dnsmasq to log its statistics by sending SIGUSR1 to it. The new motivation for adding this feature comes from the recently added dnsmasq flag --max-tcp-connections and the related statistics of TCP connection utilisation being added to the log output triggered by SIGUSR1. ------ Benchmarking ------ The change has been benchmarked by running a dnsmasq instance, periodically sending SIGUSR1 signals to it, and verifying performance with dnsperf. Benchmarking was done for both, tcp and udp protocols, with a large set of queries (to guarantee that quries need to be forwarded to upstream nameservers). Despite the fact that signals were sent as frequently as every second, no negative performance impact was observed. The dnsmasq instance was run with the following parameters --port=12362 --max-tcp-connections=200 --no-negcache --max-ttl=30 --max-cache-ttl=30 --dns-forward-max=1500 --cache-size=1000 -S [IP-of-the-DNS-server-in-my-network] --- Before --- - UDP - [Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m udp -l 300 ... [Status] Testing complete (time limit) Statistics: Queries sent: 402334 Queries completed: 401488 (99.79%) Queries lost: 846 (0.21%) Response codes: NOERROR 273885 (68.22%), SERVFAIL 6954 (1.73%), NXDOMAIN 120610 (30.04%), REFUSED 39 (0.01%) Average packet size: request 38, response 103 Run time (s): 303.836407 Queries per second: 1321.395299 Average Latency (s): 0.064198 (min 0.000043, max 4.957514) Latency StdDev (s): 0.235460 - TCP - [Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m tcp -l 300 ... [Status] Testing complete (time limit) Statistics: Queries sent: 110594 Queries completed: 108527 (98.13%) Queries lost: 2067 (1.87%) Response codes: NOERROR 73895 (68.09%), SERVFAIL 1945 (1.79%), NXDOMAIN 32609 (30.05%), REFUSED 78 (0.07%) Average packet size: request 38, response 109 Run time (s): 303.499665 Queries per second: 357.585238 Average Latency (s): 0.180993 (min 0.000051, max 4.998068) Latency StdDev (s): 0.453856 Connection Statistics: Connection attempts: 1118 (1118 successful, 100.00%) Average Latency (s): 0.000653 (min 0.000040, max 0.021616) Latency StdDev (s): 0.002854 --- After --- - UDP - [Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m udp -l 300 ... [Status] Testing complete (time limit) Statistics: Queries sent: 414633 Queries completed: 413764 (99.79%) Queries lost: 869 (0.21%) Response codes: NOERROR 282299 (68.23%), SERVFAIL 7150 (1.73%), NXDOMAIN 124275 (30.04%), REFUSED 40 (0.01%) Average packet size: request 38, response 103 Run time (s): 302.939248 Queries per second: 1365.831607 Average Latency (s): 0.062068 (min 0.000038, max 4.993021) Latency StdDev (s): 0.232104 - TCP - [Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m tcp -l 300 ... [Status] Testing complete (time limit) Statistics: Queries sent: 116190 Queries completed: 114125 (98.22%) Queries lost: 2065 (1.78%) Response codes: NOERROR 77695 (68.08%), SERVFAIL 2047 (1.79%), NXDOMAIN 34316 (30.07%), REFUSED 67 (0.06%) Average packet size: request 38, response 108 Run time (s): 302.931036 Queries per second: 376.735912 Average Latency (s): 0.172552 (min 0.000053, max 4.998624) Latency StdDev (s): 0.445567 Connection Statistics: Connection attempts: 1174 (1174 successful, 100.00%) Average Latency (s): 0.000553 (min 0.000038, max 0.018632) Latency StdDev (s): 0.002394
bc910f0
to
9e83aaf
Compare
This is a (+70, -3) change, but since Antonio mentioned above that it includes several logical changes, I split it into 3 commits to further facilitate review:
|
/ok-to-test |
Hey @aojea @bowei @MrHohn! The PR is now ready for the final approval (and review perhaps, although there are no code changes compared to October: just rebased, added benchmarking results to a commit message, and spit the original commit in 3 parts), but since the PR has been inactive for almost 5 months, I'm pinging you so that it's not missed. |
This adds a feature flag
--logInterval
for periodic triggering dnsmasq to log its statistics by sendingSIGUSR1
to it.The new motivation for adding this feature comes from dnsmasq changes in
v2.90
: the new flag--max-tcp-connections
and the related statistics of TCP connection utilisation added to the log output triggered bySIGUSR1
.The change was benchmarked (see the commit description) and no negative impact was observed.