-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More frequent disconnects of HomeKit Thread devices in 2022.12 (CoAP POST returned unexpected code) #83739
Comments
Hey there @Jc2k, @bdraco, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) homekit_controller documentation |
CC @roysjosh Off the top of my head the only changes this month that are likely to be relevant are the RST fix in CoAP and some fixes to how we get the ipv6 address from zeroconf if the cached address doesn't work at boot. It could be the old behaviour was causing frequent connection drops, so frequent key exchanges, and this was masking whatever process is now making the connection go wonky. By which I mean, bug was already there but frequently RST the connection meant we never hit it. Encryption relies on a counter. Every message we encrypt or decrypt bumps the counter. If we are out of sync, the counter is wrong. To account for ordering issues @roysjosh tries some adjacent counter values. Do you have debug logs for aiohomekit during this? We should see how hard it tries to recover if you do. Also, is the eve light switch battery powered? It'd be interesting to see if these errors were limited to battery powered stuff (we know the ACK timeout issue impacts battery powered devices as they have a sleep interval). |
The Eve light switches are mains powered and act as routers, so they shouldn't have the long latencies of a battery device. I think what I'm seeing is some instability with my only border router (an Apple TV 4K 2nd gen). I've seen a few notifications from HomeKit on my phone that my hub was offline (followed by a back online notification). I restarted it earlier today, which totally trashed the entire Thread network (none of the devices were showing up in mdns). They appear to be back now. We'll see how things evolve. I've heard from other sources that Thread is still a bit shaky. iOS 16.2 should drop next week and that will bring its own set of changes. |
I don't have debug logs enabled right now. I'm planning to flip that on prior to said iOS update. |
I'm also seeing this issue on 2022.12.3. |
Took me a while to understand, but it seems that I have the same issue. |
@roysjosh has potentially fixed this. Hoping to have time to review his PRs this week so they can go into 2023.1.0... |
I assume you mean Jc2k/aiohomekit#272. It does indeed look promising. |
I think I have the same problem. I‘m on 2023.1.4 and use two Eve Thermo Thread devices. They become unavailable ever few minutes and after 5 minutes or so it is working again. The logs always show the same 3 errors:
Can you confirm it‘s the same issue or should I open another one? Thanks, Paul |
It's not, your device is sending data for an old encryption key that homeassistant isn't using at the time it's received. It could be related but it's distinct enough it needs a separate ticket. |
I think I have the same issue, I have to power cycle my homepods mini almost everyday to get the thread network back up :( |
You probably have a different error. The code to recover from this particular bug has been out for a while, and restarting the HomePod wouldn't have helped with it. If restarting your HomePod helps I'd want to run some network diagnostics. I'd get a mdns/zeroconf browser and get the ipv6 address of your devices. Then ping6 them - from your HA environment. I'd also want to check your route table when everything is working and when everything breaks. If none of that makes some sense, im currently working on some tooling to help trace where faults are. Hang in there. |
Is it still worth to invest further effort here since there came in an update of the Matter integration which allows adding Thread/Matter devices as well? |
There are some devices that don't support matter and won't, but do support thread. Also eve devices will likely eventually work better with homekit, because matter doesn't have all the vendor specific features that eve built for homekit. Energy monitoring springs to mind, but I'm sure there are others. Also a lot of these issues are problems with thread, and will bite matter just as hard. It's not HAs fault some of you have border routers that create dead routes. |
I've also seen these eve thread devices become unavailable. I've noticed this behavior especially if I'm working with node-red and deploying flows. What I also notice is rebooting HA makes them available again. Could be an Apple thread problem that iOS 16.4 new HomeKit architecture may help but it makes me wonder because rebooting HA brings the devices back. Log entries are full of:
|
Looking forward to that diagnostic tool, let me know when it's ready so I can help by providing the logs needed |
The thread diagnostic download is in the current stable release. We now know that there are a multitude of bugs and config errors at play that all disrupt thread, not just in home assistant but with Linux in general. This post assumes that the most basic config errors have been resolved. For example, no vlans, and that WiFi repeaters aren't disrupting mdns or icmp6 (some are known to really break stuff). We know that older versions of NetworkManager (which most desktop Linux uses, and HAOS 9.5 does too) had a bug I call ghost routes. Whenever a BR changes link local address, the Linux box remembers the old address and adds the new one to its route table. In time you can end up with 10 ghost routes. Depending on other settings in other parts of the Linux stack these routes continue to be used as if they were valid. You can end up with a 10:1 chance of failure. Newer versions of HAOS use newer versions of NetworkManager. These solve the problem... by only allowing a single border router. Every time a new border router announces itself it forgets the current one. With 3 BRs we expect to see the route table churn once a minute on average. If those changes were not atomic (for example, a remove was done and then an add) then there would be a tiny window once a minute where the mesh would be inaccessible. This may not matter in practice but is a concern. More practicality, it does mean there is no hope of failover until the next announcement. HAOS 10 final release (it's not in rc2) will carry a patch to allow NM to track multiple BRs. Of course the next problem is that for some environments the kernels Neighbour Unreachable Detection is not working. When a neighbour is considered stale by the kernel it is probed by icmp6 packets. If 3 probes fail, it is marked as a failed neighbour. Failed neighbours are scored lower when making routing decisions. At least that's what's supposed to happen. If you have ip forwarding turned on (eg you are running an OTBR, or some container setups) the kernel disables this feature. In this scenario, with a working network manager, your network could go down for 30 minutes every time an ip changes. HAOS10 final (again not in rc2) will have a kernel patch to avoid this. This is not upstream yet. So it's broken for anyone running supervised on their own OS (if they have forwarding enabled). Potentially for people running the container directly. If you are running HA Core on a system without systemd-networkd or NetworkManager and you don't have forwarding enabled, you likely have a very reliable network for running thread BRs. Oh wait no. Because by default Linux actually drops route advertisements of the type Thread sends. So you need to manually configure sysctls. We have also seen weak mesh manifest in the same way. Turning off a Br with weak mesh (often an Apple TV in a closet) can spring a mesh back to life with no further intervention. Then there are the BR bugs. We are still seeing Brs rotating their mesh prefixes fairly rapidly. When everything keeps changing it's ip it's kinda hard to be stable. I have had HA core running like this since August and I still see a blip every one to two weeks. Restarting HA does help. That's probably a HA bug. But then again, sometimes waking a device (pushing a physical button) seems to get it back in line too. So it might not be. |
Thank you for the fantastic overview and really valuable analysis. I hope your fixes do make it upstream. Are the patches posted somewhere for review or integration? |
The kernel patch was sent to the kernel mailing list, and the network manager patch was posted on the NM tracker (in an issue, as we wanted feedback on the approach). Sorry I don't have links, on phone. |
No worries. It's home-assistant/operating-system#2434 (see also home-assistant/operating-system#2333 (reply in thread)). I can find the mailing list threads. |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
The problem
Since 2022.12, I seem to be experiencing more disconnects for my HomeKit Thread devices (2 Eve light switches, 1 WeMo Scene remote). There is a desync error in the logs that is much more frequent than before. There has been no big change to my network architecture.
Home Assistant 2022.12.1
Supervisor 2022.11.2
Operating System 9.3
Frontend 20221208.0 - latest
What version of Home Assistant Core has the issue?
2022.12.1
What was the last working version of Home Assistant Core?
2022.11
What type of installation are you running?
Home Assistant OS
Integration causing the issue
homekit_controller
Link to integration documentation on our website
https://www.home-assistant.io/integrations/homekit_controller
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
No response
The text was updated successfully, but these errors were encountered: