Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mqtt.local went down and didn't come back up cleanly after reboot #1210

Open
amcewen opened this issue Aug 8, 2019 · 86 comments
Open

mqtt.local went down and didn't come back up cleanly after reboot #1210

amcewen opened this issue Aug 8, 2019 · 86 comments

Comments

@amcewen
Copy link
Member

amcewen commented Aug 8, 2019

Since yesterday the Liverbird hasn't been showing our energy usage.

Doing a bit of poking into it, I found that mqtt.local was offline. @ajlennon power-cycled it, which has brought it back up, but it's failing to connect to its influxdb instance.

@ajlennon
Copy link
Contributor

ajlennon commented Aug 8, 2019

The SSH server is also not accepting connections.

@ajlennon
Copy link
Contributor

ajlennon commented Aug 8, 2019

My suspicion is we finally filled up the uSD card with data

@MatthewCroughan
Copy link
Member

Don't wipe anything before backing up the grafana dashboard json!

@ajlennon
Copy link
Contributor

ajlennon commented Aug 8, 2019

Do you hear the bell tolling @MatthewCroughan ?

@ajlennon
Copy link
Contributor

ajlennon commented Aug 8, 2019

All seems to be running ok

58128060-3FD1-4DCC-8587-CE13C3BAC865

@ajlennon
Copy link
Contributor

ajlennon commented Aug 8, 2019

So, I think this is working ok. Dunno why we were having a problem. Might just have been the m-DNS

We DO urgently need to look at dealing with the contents of the InfluxDB as we're on about 75% disk usage.

So changing the issue name for now. Change it back if it falls over again

Not sure whether we

  • put in a bigger drive
  • archive old data
  • other ?

@ajlennon ajlennon changed the title Local MQTT server is poorly Need to urgently create more space for MQTT server database Aug 8, 2019
@MatthewCroughan
Copy link
Member

Will set this up tonight on the NAS Alex. Perhaps we should really be running a large grafana server here.

@amcewen
Copy link
Member Author

amcewen commented Aug 8, 2019

I'd had the mqtt.local Node RED open in a tab all afternoon, and just spotted there'd been a bunch of the "failed to connect to host" errors and a few connection refused ones too. Plus the power usage messages aren't being generated and it had crashed, so I don't think it's just an "it might fill the disk" issue

@johnmckerrell
Copy link
Member

johnmckerrell commented Aug 8, 2019 via email

@MatthewCroughan
Copy link
Member

I got sidetracked tonight, I'll be providing NFS storage for this tomorrow, you've also mentioned needing storage for other things you're doing.

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

If it happens again @amcewen can you check if the IP is still working.

@amcewen
Copy link
Member Author

amcewen commented Aug 9, 2019

What do you mean by "check if the IP is still working"? The Pi will still have been on the network when it was reporting the errors, as it'll have had a websocket connection to my browser for the debug output.

Happy to check things if I spot the problem (hasn't happened so far today), but not sure what I'm checking ;-)

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

Checking the IP address is responding rather than checking mqtt.local is responding

@MatthewCroughan
Copy link
Member

@ajlennon in 2 hours I'm free, so I'll be setting it up then. What is it exactly that you need?

@MatthewCroughan
Copy link
Member

NFS, or Influx? Because I can probably make a 1TB influx container that's available network-wide if that suits the architecture.

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

Dunno really. We need a policy on archiving data in the database...

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Aug 9, 2019

Well it seems to me that it would be as simple as running a single process and making it available on the network, and giving influx access to storage, wherever that may be. So this could be done one of two ways:

Nfs + Influx, where the Pi's all run influx servers and clients, and all talk to a single NFS
or
Influx + Direct I/O (Dedicated hardware for influxd, and also storage)

I don't know how realistic of a concern timestamping would be if we were to be writing to the database using a FUSE filesystem VS actually writing to Influx via HTTP as intended. I feel they're identical outcomes, but that writing to Influx via the network is more intended than via a networked filesystem, I'm sure the timestamps are preserved either way.

But what if the NFS were to go down? The failure mode is probably catastrophic for Influx, since it wouldn't know how to handle the filesystem not responding or going missing, or its response would be generic and unhelpful in debugging, whereas the http scenario probably has a sane response and is a documented scenario.

I think we should run Influx in a container on a machine with large storage, rather than providing arbitrary NFS to the Pi's for now, to make the problem easy to tackle, and provide some serious reliability. I'll take responsibility for maintaining that storage and server, although I can hand out ssh access to the container to anyone at DoES.

If we want to mesh this setup, we could have a bunch of Pi's running replicated instances and a few load balancer pi's, along with sort of notification/status that lets us see the status of each server. I don't like the idea of this data going missing, so we do need to set up fail-over.

We should be able to handle the load balancing and stuff like that with Balena easily shouldn't we @ajlennon ?

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

Firstly we want to ask @goatchurchprime if we want to retain all the data or bucket it up for archival somehow

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Aug 9, 2019

The server in question is already mirrored between two 4TB drives. Although we only have 4TB in that server at the moment which may need to increase, and no offsite backup. I have a bunch of 1TB drives going spare if we want to set up a decentralized 1TB setup between multiple nodes with Pi's. @ajlennon

How much data is currently in use that we're having trouble with it? I'm going to guess 32GB? What's the rate at which we were accumulating data? 16GB/month?

In fact, the data rate should be a metric in influx itself if we could prevent that from being a feedback loop, so we can see how fast we're ballooning up our storage.

@amcewen
Copy link
Member Author

amcewen commented Aug 9, 2019

But if mqtt.local is responding then presumably the IP address is also responding, no?

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

In any sane world that would of course be true but no. I’ve chatted to @goatchurchprime about this as sometimes the m-DNS mapping somehow fails but the IP itself is reachable. So I’m interested to know if this might be happening here

@MatthewCroughan
Copy link
Member

@ajlennon @amcewen I think that's because of the way the router is working. There's some sort of cache, I'm not familiar with why this happens, and it'll be a setting somewhere in the networking hardware we have.

image

The gateway remembers the mac address of hardware unit, in this case the Pi, and responds on both hangspot.local (it's previous recorded hostname) and also on mqtt.local

image

This is a caching feature, somewhere.

@amcewen
Copy link
Member Author

amcewen commented Aug 9, 2019

yes, the mDNS mapping doesn't always work, so there are times when the mDNS doesn't work but the IP address does. However, I've not encountered the reverse where the mDNS works but the IP address doesn't. (Mostly because that's impossible :-D)

When I see the errors, they're transmitted over the network to my browser, so the IP address must be responding, no? They show up in the debug window of Node RED. I'm not doing any mDNS lookups.

@ajlennon
Copy link
Contributor

ajlennon commented Aug 9, 2019

Maybe I’m misunderstanding. Is your nodered flow talking to the IP address or MQTT.local?

@amcewen
Copy link
Member Author

amcewen commented Aug 9, 2019

It's not my nodered flow, it's whatever you or @goatchurchprime set up.

Have had a look and it seems to be talking to influxdb. That doesn't resolve at present on my machine (either with ping influxdb or ping influxdb.local), so I don't know which IP address I'd need to try

@johnmckerrell
Copy link
Member

And.. on further discussion I've removed those IP addresses from the DHCP configuration, but we can say that those two IP addresses are allocated to this purpose, so should be manually assigned on the box itself (doesn't matter to me if you don't use both of them but I'll record them as being for this purpose on the network documentation).

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Sep 27, 2019

Things that previously used .localdomain are essentially not able to be pinged by their hostname on the network it seems, as I can no longer reach them. This includes Alex's Octoprint instance. They used to coexist, which is quite strange in and of itself and shouldn't be possible, but now they do not. Devices that used localdomain are now unresponsive on anything but their ip.

This device was previously accessible at octopi.localdomain but is now only accessible at its IP at 10.0.39.51

# Generated by resolvconf
domain local
nameserver 10.0.0.1
nameserver 1.1.1.1
nameserver 1.0.0.1

My resolv.conf now shows domain local rather than domain localdomain which is default on a lot of Linux/FreeBSD systems. It may be true that however @ajlennon has his Pi setup with Balena or otherwise is permanently configured to use localdomain which is something that the network is no longer respecting.

I have no idea how the router could have anything to do with this other than DROPPING the packets that are related to .localdomain. I've reconfigured a bunch of my devices and they mostly all changed to local on their own.

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Sep 27, 2019

image

If we've put local in the domain field of whatever the equivalent of this setting is in our router software, we have definitely made a big mistake, as PFSense outlines in its general settings page.

Do not use 'local' as a domain name. It will cause local hosts running mDNS (avahi, bonjour, etc.) to be unable to resolve local hosts not running mDNS.

This is definitely the problem I'm observing, as I've had to install avahi-daemon on a bunch of machines that I did not previously.

Now, if this is true, we are in a situation where every device must install something equivalent to avahi-daemon despite the fact that the DNS Server on the router can resolve these just fine, without clients needing to have their own instance of avahi-daemon

Somewhere in base networking protocols, without avahi or mDNS hostnames are transferred to the router. If our domain is set to .local rather than something else like .localdomain or .lan it means we can't resolve hosts that aren't running mDNS.

If we set the domain to .local and have a device that's not running mDNS with a hostname of foo and it has obtained a DHCP lease from the router, meaning that the router now knows its hostname as configured on the device via some part of DHCP. If the router recieves a lookup for foo.local then it will return the ip address of foo.local successfully.

however if you try to look up foo.local from a separate device that is running mDNS via the avahi-daemon then it will fail to look up foo.local because the mDNS daemon is preferred. It will not be able to return foo.local's IP address, because foo.local is not running mDNS.

Not using .local in the router's domain set up avoids this scenario and allows all devices to find out information about their hostnames and supports devices that aren't running mDNS daemons, rather than not at all, as would be the case if we chose not to enforce .local as the router's domain, which PFSense, OpenWRT and more warn against.

@MatthewCroughan
Copy link
Member

Now, what has ocurred is that you cannot ping hostnames unless you have an mDNS daemon installed on your system, and vice-versa. This is not the way it should be done and explains why all the devices that had .localdomain are no longer visible to even the router itself. All we have done is invalidate the utility of the router's DNS, as it can no longer report back a lookup to a hostname at all.

When you run dig and specify .local, it makes sure to make you aware that .local is reserved for Multicast DNS, mDNS is not supposed to be implemented or enforced by the router's domain.

; <<>> DiG 9.14.5 <<>> matt-octoprint.local @10.0.0.1
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 59060
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;matt-octoprint.local.          IN      A

;; Query time: 1 msec
;; SERVER: 10.0.0.1#53(10.0.0.1)
;; WHEN: Fri Sep 27 03:54:24 BST 2019
;; MSG SIZE  rcvd: 49

@goatchurchprime This is why .localdomain is a thing, or exists at all. So that devices without mDNS can still look up hostnames without mDNS.

@skos-ninja
Copy link
Member

I can have a look at this later today but UniFi support rebroadcasting mDNS responses in order for them to still work in this case

@johnmckerrell
Copy link
Member

"you cannot ping [local] hostnames unless you have an mDNS daemon" Given that's the whole point of mDNS I don't think there's anything particularly non-standard going on here. Having the router's internal hidden DNS proxy also happen to return results for things random people have told it on DHCP sounds a bit more non-standard but what do I know?

I've turned back on the DHCP results showing in DNS and set the network's domain to localdomain, I also tried does.localdomain but that didn't seem to work either. I'll leave it as is for now, maybe it'll only work for things when they renew their DHCP leases.

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Sep 27, 2019

@johnmckerrell What I meant to say is that you can't ping DNS (hostnames over DHCP leasing, was a thing before mDNS existed) if you use mDNS on your system. Which is a problem, since whatever has been changed means you can't:

ping mqtt
ping mqtt.localdomain
ping mqtt.local

UNLESS you have an mDNS daemon on your computer. And that will only respect .local, since it's mDNS. And if you are not running an mDNS daemon the network doesn't respond if the router domain is .local, because that's reserved for mDNS. When using a .local router domain, mDNS is all that can be used which means it will no longer respect, lookup or return non-mDNS hostnames for reasons I'm not 100% aware of, but something about conflicts.

mDNS is not the only way of getting a hostname, it's fairly modern and it just makes things easier when it's added onto a network. By using .local for the router domain it makes it impossible to use regular DNS for hostnames. This is why localdomain exists as a convention.

A person was trying to use Alex's printer earlier but couldn't because octopi.localdomain is no longer accessible, because he's running an mDNS daemon, and mDNS doesn't see .localdomain, but if the router domain was anything else, be it .localdomain or .lan, it would return the address of that machine regardless since the mDNS would failover to the gateway's DNS resolver, which of course knows about it because of its DHCP lease.

Devices that do not have an mDNS daemon cannot participate their hostnames on the network in this configuration.

No mDNS daemon on your system = can't see anything
mDNS = Can only see .local

octopi.localdomain works even if the device is not running avahi, because hostnames are transferred via DHCP without any mDNS functionality, which is great. This functionality is made impossible when the router domain is .local as PFSense and OpenWRT outline.

@johnmckerrell
Copy link
Member

@MatthewCroughan given you were talking about ARP records yesterday it seems like this is new knowledge to you too. I have already made the changes to mostly re-enable what we had previously just with a network domain of localdomain rather than the conflicting local and did so before your recent comments. Can you maybe now wait until you've been able to test before trying to teach me about this?

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Sep 27, 2019

@johnmckerrell I'm not trying to teach you about anything. I've just been discussing it all night with a friend online and am coming to realise why localdomain is a thing. I'll curb the enthusiasm, sorry :)

The arp record comment yesterday was made before reading into any of this, or looking at my own PFSense and reading their documentation on how mDNS, caching options and more work. The Ubiquiti firmware looks like it has way more niche and non-standard features though, so there's probably a million things that are going on that I have on idea about.

@amcewen
Copy link
Member Author

amcewen commented Oct 1, 2019

@johnmckerrell, you said:

WiFi b8:27:eb:cb:96:8c - was configured to be 10.0.100.1 in the DHCP - now configured to 10.0.100.2
Wired b8:27:eb:9e:c3:d9 - wasn't configured, now configured to be 10.0.100.1

Does that mean that mqtt.local should be resolving to one (or both) of those IP addresses? At present neither of those IP addresses is responding to pings, and it seems to be resolving to 10.0.30.194 at the moment?!?

$ ping mqtt.local
PING mqtt.local (10.0.30.194) 56(84) bytes of data.
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=1 ttl=64 time=3.86 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=2 ttl=64 time=6.38 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=3 ttl=64 time=2.50 ms
64 bytes from 10.0.30.194 (10.0.30.194): icmp_seq=4 ttl=64 time=5.46 ms

@johnmckerrell
Copy link
Member

@amcewen I also said "And.. on further discussion I've removed those IP addresses from the DHCP configuration"

It seemed like the box was statically configured and to help with portability elsewhere we thought that would be best, but it seems like it might not be the case.

@ajlennon
Copy link
Contributor

ajlennon commented Oct 1, 2019

It seemed like the box was statically configured and to help with portability elsewhere we thought that would be best, but it seems like it might not be the case.

Not by me. @goatchurchprime? @MatthewCroughan ?

@MatthewCroughan
Copy link
Member

@ajlennon @johnmckerrell Are we saying that there's a box somewhere with an mDNS Daemon mqtt.local that is statically configured, that is not @ajlennon's balena pi that we're otherwise not aware of?

@johnmckerrell
Copy link
Member

No, I don't think so.

WiFi b8:27:eb:cb:96:8c - was configured to be 10.0.100.1 in the DHCP - now configured to 10.0.100.2
Wired b8:27:eb:9e:c3:d9 - wasn't configured, now configured to be 10.0.100.1

I think when we looked, the wired interface had 10.0.100.1, and the WiFi one was trying to get it and having issues so we figured that the wired one was manually configured. It seems like that might not be the case?

@ajlennon
Copy link
Contributor

ajlennon commented Oct 1, 2019

Historically mqtt.local has changed its IP address - I think you found this @amcewen

My understanding is that it's changed its IP address again.

My belief is that it is picking up an IP address from the DHCP server on the network unless somebody else has been in there and changed things around.

I can double check this tomorrow.

@johnmckerrell
Copy link
Member

johnmckerrell commented Oct 1, 2019 via email

@MatthewCroughan
Copy link
Member

@johnmckerrell this caching issue is happening again.
image

@MatthewCroughan
Copy link
Member

image

Despite the fact that the Pi is running avahi-daemon, it is not returning .local

I believe this is because whatever this feature is, it prevents mDNS discovery when a .localdomain addr is cached. I really hope this can get solved.

Whatever the case, not providing .local or .localdomain when interacting and only providing the hostname seems to work. ping ender3-octoprint will still work, which is all that matters.

@amcewen
Copy link
Member Author

amcewen commented Oct 14, 2019

An additional datapoint...

I haven't had any problems talking to a number of Pis with my Museum in a Box stuff over the past week or two. They're all configured with a hostname of box - there's the one on the bookcase by the main door, which has been up for 9 days now - and then three more Pis which have been on and off repeatedly while I've been testing things (although only one of them on at any one time, but I've been switching between them lots)

I haven't had any problems talking to them with ssh [email protected] and ssh [email protected], and similarly talking to the Node RED instances in a browser. The one on the bookcase has been both box.local and box-2.local at various points, but the other Pi (the one I've been trying to contact during the testing) has always responded at the other name. I basically run uptime when I've logged in to double-check I'm on the right Pi.

I don't ever try connecting to them without the .local bit, and haven't ever tried .localdomain until just now, when it worked fine.

@ajlennon
Copy link
Contributor

OK so I have restarted mqtt.local with only the wired interface supported. It appears to be responding to mqtt.local on the expected IP address

@johnmckerrell
Copy link
Member

I can ping ender3-octoprint.local, also .localdomain, I can't ping it without those because then it tries to resolve to my work vpn network.

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Oct 14, 2019

@johnmckerrell My understanding is that if you have an avahi-daemon running, /etc/resolv.conf is going to be pointing to some sort of private network which is the avahi-daemon. If that fails it'll then query the router DNS to see if the machine exists (the default if you don't have an avahi-daemon). The problem is that the ubiquiti feature I think is masking .local some of the time for the same reason it sometimes provides the wrong hostname.

@johnmckerrell
Copy link
Member

3.14. Host Name Option
This option specifies the name of the client. The name may or may not be qualified with the local domain name

Well all I'm wondering is if the device is telling the router that it is foo.local and the router is then reporting this back, but I'm unclear on whether the documentation above (from the RFC) just means "when you later try to use this hostname it may or may not be qualified with the local domain name" or does it mean "you can pass a domain name in with the hostname". I would expect the former really.

Just to confirm, the router has its domain set to localdomain so it "shouldn't" be trying to do anything with the .local domain, unless as it says it is being told this by things requesting DHCP leases and then reporting that back out.

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Oct 14, 2019

@johnmckerrell My understanding is that outside of mDNS the device requests an IP and gives a hostname. The hostname that is given is usually specified in /etc/hosts like so:

127.0.0.1       localhost.local localhost       thinkpad
::1             localhost.local localhost       thinkpad

If I chose to request localhost.lan then ping thinkpad.lan should respond with the ip of my machine.

My theory is that this is the first thing that the router's feature caches, in the same way that Sams-Iphone.localdomain was causing a problem, it is returning .localdomain some of the time rather than allowing mDNS responses all of the time if both parties have an mDNS daemon.

This might still come down to your personal machine's configuration too. Since theoretically the mDNS daemon should be the first query, then the router's dns, but this may not be happening everywhere.

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Mar 1, 2020

@johnmckerrell After following this, I've got it working on my laptop. For some reason mqtt.local now returns an ipv6 address, whereas I believe I saw on @amcewen's machine it returns an ipv6 address. It all comes down to one's client configuration, which is actually really disappointing since it seems to vary so much between even two installations of Ubuntu.

https://unix.stackexchange.com/questions/43762/how-do-i-get-to-use-local-hostnames-with-arch-linux

The configuration in question is in /etc/nsswitch.conf

Configuration before following the guide:
hosts: files mymachines myhostname resolve [!UNAVAIL=return] dns
Configuration after following the guide, fixes it:
hosts: files mdns_minimal [NOTFOUND=return] dns myhostname

@MatthewCroughan
Copy link
Member

MatthewCroughan commented Mar 1, 2020

MDNS also works just fine on the Vinyl Cutter pc, though there is some strange behaviour that I think is related to the wifi.

Discovery of mDNS on the vinyl cutter pc is strangely intermittent. I can't recreate it exactly, but I did observe it.

If I execute ping mqtt.local it will take some time (around 5 seconds) to resolve it. This will sometimes fail. Though after succeeding once it has no issue resolving subsequently. It will fail to resolve if the system were brought out of hibernation, but will work if you probe it enough.

This failure to resolve and massive resolve delay is not true of pinging the IP address of the machine directly, so it's definitely an mDNS related issue, whether that's down to configuration or the wifi hardware being slow. I do notice that the system has a massively variant ping response time when pinging local addresses. Pinging the router will result in anywhere from 10ms to 262ms.

The configuration of /etc/nsswitch.conf on that machine which is a fresh Ubuntu 19.04 is:
hosts: files mdns4_minimal [NOTFOUND=return] dns

and it returns ipv4 addresses for all .local addresses. This is due to mdns4_minimal, as I tried switching it to mdns_minimal. I later discovered that this obviously means ipv4 explicitly.

https://askubuntu.com/questions/843943/how-to-replace-mdns4-minimal-with-bind

This gives us all the details related to what the different possible configurations are.

@MatthewCroughan
Copy link
Member

I've checked on Arthur's Win10 laptop, and it also seems to work. It returns Ipv6 addresses. The same was not true however of my Win10 virtual machine until I enabled the avahi-daemon on the host machine, which is very interesting to me, not sure I understand what's happening there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants