Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG:Nebula cannot obtain the correct dns server address from the system #909

Open
aa51513 opened this issue Jun 21, 2023 · 9 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@aa51513
Copy link

aa51513 commented Jun 21, 2023

What version of nebula are you using?

1.7.2

What operating system are you using?

Linux ( Arm64 )

Describe the Bug

When starting nebula, an error is reported:
ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

It looks like it can't get the correct dns server address from the system, but I type dig command and everything is normal:

~ $ dig www.google.com

; <<>> DiG 9.16.41 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2327
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		128	IN	A	104.244.46.52

;; Query time: 30 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jun 21 13:19:00 CST 2023
;; MSG SIZE  rcvd: 48

As a supplement, the following are the contents of the file /etc/resolv.conf on my server:

nameserver 8.8.8.8
nameserver 8.8.4.4

I am very confused, why Nebula uses [::1]:53 as the address of the dns server, regardless of the system configuration

Please evaluate whether the optional configuration item of dns server address should be added to the configuration file

Logs from affected hosts

ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

Config files from affected hosts

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/nebula.crt
  key: /etc/nebula/nebula.key
static_host_map:
  "10.10.10.1": ["mynebula.server.com:45445"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "10.10.10.1"
listen:
  host: "::"
  port: 45445
punchy:
  punch: true
relay:
  am_relay: true
  use_relays: true
tun:
  disabled: true
  dev: nebula
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:
logging:
  level: warning
  format: text
firewall:
  outbound_action: drop
  inbound_action: drop
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
@johnmaguire
Copy link
Collaborator

Hi @aa51513 -

Allowing configuration of a DNS resolver in the config file sounds like a good idea to me. That being said, I'm unsure why the settings in /etc/resolv.conf would be ignored. From reading Go docs I see this:

The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.

On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.

By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

However, we disable CGO for Nebula builds so I suspect that only the pure Go resolver is in use. If that's the case, and the comment above is correct, I am surprised to hear that your /etc/resolv.conf settings are not being respected. Are you using a .local or mDNS name?

@johnmaguire johnmaguire added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 25, 2023
@johnmaguire
Copy link
Collaborator

Hi @aa51513, are you able to provide the request information? Thanks!

@aa51513
Copy link
Author

aa51513 commented Jul 18, 2023

ERRO[0000] DNS resolution failed for static_map host error="lookup mynebula.server.com on [::1]:53: read udp [::1]:39679->[:: 1]:53: read: connection refused "hostname=mynebula.server.com network=ip4

I'm sorry that I didn't reply in time these days because of some personal matters.
When the above issue occurred, I was using a normal ".com" domain name, neither a .local nor an mDNS name.
I was able to add cname records, A records, and AAAA records on the domain management page.
I even accessed my domain name through my mobile phone via 4G, and I was able to open my webpage normally, indicating that the problem should not be on the domain name

@johnmaguire johnmaguire removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jul 19, 2023
@johnmaguire johnmaguire added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 26, 2023
@maggie44
Copy link

maggie44 commented Nov 12, 2023

I am also having a lot of DNS issues. On Linux, I sometimes get a long delay before the connection initiates. After around 30 seconds, I will get an error:

ERRO[13680] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname=url.xyz network=ip4

Sometimes it will then just sit there disconnected, although more often than not there will be another message saying the DNS results changed for host list, and it will go and connect.

On other occasions (also Linux) it connects ok, but then intermittently and consistently there will be looping error messages:

ERRO[13290] DNS resolution failed for static_map host     error="lookup url: i/o timeout" hostname= url.xyz network=ip4
INFO[13290] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13320] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[13680] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[13680] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[13710] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14070] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4
INFO[14070] DNS results changed for host list             newSet="map[]" origSet="&map[x.x.x.x:10102:{}]"
INFO[14100] DNS results changed for host list             newSet="map[x.x.x.x:10102:{}]" origSet="&map[]"
ERRO[14460] DNS resolution failed for static_map host     error="lookup url.xyz: i/o timeout" hostname= url.xyz network=ip4

On Mac I haven't been able to connect at all, but my Mac is such a mix of different interfaces and experiments it's been really difficult to debug. If I run Nebula in a Docker container on the same system though, it performs the same as above.

The IP and URL that I removed from above are all standard ipv4 (although there is an ipv6 option on there, the IP in Nebula logs is the ipv4 one) and a subdomain. Domain has been active for months so has propagated fully.

Being able to specify DNS servers would be a good step.

@maggie44
Copy link

maggie44 commented Nov 22, 2023

Still trying to explore this. I can replicate it by changing the DNS entries in resolv.conf on my Mac and see it when using a slow connection. Connecting to a VPN changes resolv.conf and also helps replicate this. After the DNS change on occasion it reports:

ERRO[0060] DNS resolution failed for static_map host     error="lookup 123.xyz: no such host" hostname=123.xyz network=ip4

Then eventually:

INFO[0090] DNS results changed for host list             newSet="map[123.23.23.23:10102:{}]" origSet="&map[]"

and then after another 30 seconds it connects.

I see there is a retry cadence:

https://github.com/slackhq/nebula/pull/879/files

I haven't delved in to the criteria for DNS results changed for host list but it might help if the cadence is lower when there has not yet been a successful DNS lookup, then uses the 30s for subsequent lockups. Also for a connect to be called directly after a DNS results changed for host list if there is not yet a live connection. At the moment, it looks like Nebula is very slow at connecting to lighthouses but I think it is merely the timings of the retries.

I also wonder if the timeout of 200ms is too low for slow connections. I haven't been able to see any improvement by increasing it, but I'm also not sure if there is much benefit to it being that low when slower connections may be using Nebula.

@Frederic-Zhou

This comment was marked as off-topic.

@johnmaguire
Copy link
Collaborator

johnmaguire commented Aug 29, 2024

@maggie44 The error you are seeing is different from the error in the original ticket. Have you tried increasing static_map.lookup_timeout? This is the value associated with the "i/o timeout" message from a slow DNS server. If that doesn't work, let's move the "i/o timeout" issue to a separate ticket.

@johnmaguire
Copy link
Collaborator

@maggie44 FYI, I've posted a PR here that may improve time-to-recovery in the situation you described. I would like to improve this further in the future (mentioned in the PR): #1260

@haras-unicorn
Copy link

i have a peculiar case where it only fails on system startup
my systemd service has After=network-online.target
i have set lookup_timeout to 10 s which never fails when i manually restart it after system startup
i can provide my systemd unit file and nebula config if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants