Skip to content

Send SIGUSR1 to dnsmasq periodically #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DamianSawicki
Copy link
Collaborator

@DamianSawicki DamianSawicki commented Oct 6, 2024

This adds a feature flag --logInterval for periodic triggering dnsmasq to log its statistics by sending SIGUSR1 to it.

The new motivation for adding this feature comes from dnsmasq changes in v2.90: the new flag --max-tcp-connectionsand the related statistics of TCP connection utilisation added to the log output triggered by SIGUSR1.

The change was benchmarked (see the commit description) and no negative impact was observed.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 6, 2024
@k8s-ci-robot k8s-ci-robot requested review from bowei and kl52752 October 6, 2024 10:47
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 6, 2024
@DamianSawicki
Copy link
Collaborator Author

/test all

@DamianSawicki
Copy link
Collaborator Author

The failed test fails identically at HEAD (issue #646).

@DamianSawicki DamianSawicki marked this pull request as ready for review October 6, 2024 18:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2024
@DamianSawicki
Copy link
Collaborator Author

/cc @aojea

@k8s-ci-robot k8s-ci-robot requested a review from aojea October 6, 2024 18:34
@aojea
Copy link
Member

aojea commented Oct 6, 2024

I think you have several changes in the same commit that makes it hard to review.

In addition, what is the motivation to make this configurable, I don't think we should allow people to use this feature if is not for debugging, and if that is the case, you can perfectly do it with a simple bash script

@DamianSawicki
Copy link
Collaborator Author

Thank you very much for a quick reaction, Antonio!

I think you have several changes in the same commit that makes it hard to review.

Yeah, I replace nanny.X() with if err := nanny.X(); err != nil { log } for X in {Kill, Start}, which I'm happy to move to a separate commit, but everything else belongs together: there is a functionality change with the corresponding test, and the new test case requires expanding the testing fixture.

I don't think we should allow people to use this feature

You've mentioned running a bash script. Do I understand correctly that users can also run such a script sending SIGUSR1 to a dnsmasq instance? If so, why don't you want to allow them to use this feature?

In addition, what is the motivation to make this configurable

We can hardcode some interval, but I thought it would be safer to have things configurable, even just for turning the feature on/off.

The general motivation is to improve dnsmasq observability and not to require users to run custom scripts (or reach out to their cloud providers) to gain some insights about their kubedns load. Does it make sense to you?

@aojea
Copy link
Member

aojea commented Oct 6, 2024

I can understand if you expose these metrics with a metrics/prometheus endpoints, that allows you to set alerts and integrate with common monitoring stacks ... but you still needs to evaluate the performance implications on sending a periodic signal to the process

@bowei
Copy link
Member

bowei commented Oct 7, 2024

Does dnsmasq handle syscall interruption gracefully? e.g. on receipt of signal, the syscall is canceled with EINTR.

@DamianSawicki
Copy link
Collaborator Author

@bowei

Does dnsmasq handle syscall interruption gracefully? e.g. on receipt of signal, the syscall is canceled with EINTR.

Yes, dnsmasq mentions specific actions on SIGUSR1, SIGUSR2, andSIGHUP in its linux man page and it is a very mature piece of software, so I cannot imagine it handling signals incorrectly. Just to be 100% sure, I've just checked the codebase, and there are many of instances of errno == EINTR and errno != EINTR throughout.

@aojea

you still needs to evaluate the performance implications on sending a periodic signal to the process

The statistics logged by dnsmasq on SIGUSR1 contain not just the current utilisation, but also the maximal utilisation since the last SIGUSR1, so I was thinking about using ~5 min intervals in practice (1 min at least, while 1 h would still be better than nothing). The function dump_cache handling SIGUSR1 is cheap (unless dnsmasq runs in debug mode when it does a full cache dump). With such infrequent and minimal additional load and correct handling of EINTR, performance implications should be negligible.

I can understand if you expose these metrics with a metrics/prometheus endpoints, that allows you to set alerts and integrate with common monitoring stacks

For what you describe, the first step is still the current PR, which makes dnsmasq print statistics to logs, which we can later parse and expose with metrics/prometheus endpoints or set up log-based metrics. Do you think the current PR is only sensible if extended with these follow-up steps?

@aojea
Copy link
Member

aojea commented Oct 7, 2024

I don't know, it really feels awkward running a process in production that you are sigkilling periodically ... but I don't know if this pattern is also used elsewhere

@DamianSawicki
Copy link
Collaborator Author

DamianSawicki commented Oct 7, 2024

that you are sigkilling periodically

For anyone reading this without much context, Antonio meant sending a signal periodically. The signal SUGUSR1 does not kill the process, it's just a way of communicating with it. Dnsmasq is a rather old piece of software, and that's one way of communication that it supports, but I agree that it may feel awkward today.

In any case, PRs to this repo are very rare, which means the resources are limited (two PRs with vulnerability fixes have been waiting for attention since July). The present PR (which actually was on my to-do list for months) is a rather low-hanging fruit to improve kubedns observability which is among priorities to help with supportability. Let's not rush and see what others have to say.

@aojea
Copy link
Member

aojea commented Oct 7, 2024

problems on typing fast, thanks for clarifying, I meant what Damian said 😅

@pwdtr
Copy link

pwdtr commented Oct 7, 2024

+1, Good work, Damian!

@bowei
Copy link
Member

bowei commented Oct 25, 2024

If we can do a test and include the results in the commit message to demonstrate that sending the signal is benign:

  • Send constant stream of dnsperf requests to dnsmasq
  • Enable the periodic interval of SIGUSR1

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2025
@DamianSawicki
Copy link
Collaborator Author

/remove-lifecycle stale

I need a bit of free time to run the benchmark suggested by Bowei.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DamianSawicki

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2025
@DamianSawicki DamianSawicki marked this pull request as draft March 27, 2025 11:37
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@DamianSawicki
Copy link
Collaborator Author

If we can do a test and include the results in the commit message to demonstrate that sending the signal is benign:

  • Send constant stream of dnsperf requests to dnsmasq
  • Enable the periodic interval of SIGUSR1

I benchmarked the change with SIGUSR1 sent as frequently as every second for both UDP and TCP (with up to 50 concurrent TCP connections) and observed no negative performance impact. Added details to commit description.

Please review 🙏

@DamianSawicki DamianSawicki marked this pull request as ready for review March 27, 2025 12:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@k8s-ci-robot k8s-ci-robot requested a review from MrHohn March 27, 2025 12:50
This moves the logging line for errors during killing dnsmasq and adds
logging for errors during starting dnsmasq.
This adds a feature flag --logInterval for periodic triggering dnsmasq
to log its statistics by sending SIGUSR1 to it.

The new motivation for adding this feature comes from the recently
added dnsmasq flag --max-tcp-connections and the related statistics of
TCP connection utilisation being added to the log output triggered
by SIGUSR1.

------ Benchmarking ------

The change has been benchmarked by running a dnsmasq instance,
periodically sending SIGUSR1 signals to it, and verifying performance
with dnsperf. Benchmarking was done for both, tcp and udp protocols,
with a large set of queries (to guarantee that quries need to be
forwarded to upstream nameservers). Despite the fact that signals were
sent as frequently as every second, no negative performance impact was
observed.

The dnsmasq instance was run with the following parameters
 --port=12362
 --max-tcp-connections=200
 --no-negcache
 --max-ttl=30
 --max-cache-ttl=30
 --dns-forward-max=1500
 --cache-size=1000
 -S [IP-of-the-DNS-server-in-my-network]

--- Before ---

- UDP -

[Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m udp -l 300
...
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         402334
  Queries completed:    401488 (99.79%)
  Queries lost:         846 (0.21%)

  Response codes:       NOERROR 273885 (68.22%), SERVFAIL 6954 (1.73%), NXDOMAIN 120610 (30.04%), REFUSED 39 (0.01%)
  Average packet size:  request 38, response 103
  Run time (s):         303.836407
  Queries per second:   1321.395299

  Average Latency (s):  0.064198 (min 0.000043, max 4.957514)
  Latency StdDev (s):   0.235460

- TCP -

[Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m tcp -l 300
...
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         110594
  Queries completed:    108527 (98.13%)
  Queries lost:         2067 (1.87%)

  Response codes:       NOERROR 73895 (68.09%), SERVFAIL 1945 (1.79%), NXDOMAIN 32609 (30.05%), REFUSED 78 (0.07%)
  Average packet size:  request 38, response 109
  Run time (s):         303.499665
  Queries per second:   357.585238

  Average Latency (s):  0.180993 (min 0.000051, max 4.998068)
  Latency StdDev (s):   0.453856

Connection Statistics:

  Connection attempts:  1118 (1118 successful, 100.00%)

  Average Latency (s):  0.000653 (min 0.000040, max 0.021616)
  Latency StdDev (s):   0.002854

--- After ---

- UDP -

[Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m udp -l 300
...
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         414633
  Queries completed:    413764 (99.79%)
  Queries lost:         869 (0.21%)

  Response codes:       NOERROR 282299 (68.23%), SERVFAIL 7150 (1.73%), NXDOMAIN 124275 (30.04%), REFUSED 40 (0.01%)
  Average packet size:  request 38, response 103
  Run time (s):         302.939248
  Queries per second:   1365.831607

  Average Latency (s):  0.062068 (min 0.000038, max 4.993021)
  Latency StdDev (s):   0.232104

- TCP -

[Status] Command line: dnsperf -d dnsperf/fromWeb -c 50 -q 100 -T 20 -p 12362 -m tcp -l 300
...
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         116190
  Queries completed:    114125 (98.22%)
  Queries lost:         2065 (1.78%)

  Response codes:       NOERROR 77695 (68.08%), SERVFAIL 2047 (1.79%), NXDOMAIN 34316 (30.07%), REFUSED 67 (0.06%)
  Average packet size:  request 38, response 108
  Run time (s):         302.931036
  Queries per second:   376.735912

  Average Latency (s):  0.172552 (min 0.000053, max 4.998624)
  Latency StdDev (s):   0.445567

Connection Statistics:

  Connection attempts:  1174 (1174 successful, 100.00%)

  Average Latency (s):  0.000553 (min 0.000038, max 0.018632)
  Latency StdDev (s):   0.002394
@DamianSawicki
Copy link
Collaborator Author

This is a (+70, -3) change, but since Antonio mentioned above that it includes several logical changes, I split it into 3 commits to further facilitate review:

@DamianSawicki
Copy link
Collaborator Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 27, 2025
@DamianSawicki
Copy link
Collaborator Author

Hey @aojea @bowei @MrHohn! The PR is now ready for the final approval (and review perhaps, although there are no code changes compared to October: just rebased, added benchmarking results to a commit message, and spit the original commit in 3 parts), but since the PR has been inactive for almost 5 months, I'm pinging you so that it's not missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants