Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More frequent disconnects of HomeKit Thread devices in 2022.12 (CoAP POST returned unexpected code) #83739

Closed
jfroy opened this issue Dec 10, 2022 · 22 comments

Comments

@jfroy
Copy link
Contributor

jfroy commented Dec 10, 2022

The problem

Since 2022.12, I seem to be experiencing more disconnects for my HomeKit Thread devices (2 Eve light switches, 1 WeMo Scene remote). There is a desync error in the logs that is much more frequent than before. There has been no big change to my network architecture.

Home Assistant 2022.12.1
Supervisor 2022.11.2
Operating System 9.3
Frontend 20221208.0 - latest

What version of Home Assistant Core has the issue?

2022.12.1

What was the last working version of Home Assistant Core?

2022.11

What type of installation are you running?

Home Assistant OS

Integration causing the issue

homekit_controller

Link to integration documentation on our website

https://www.home-assistant.io/integrations/homekit_controller

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: aiohomekit.controller.coap.connection
Source: components/homekit_controller/connection.py:726
First occurred: 14:21:40 (23 occurrences)
Last logged: 14:40:48

Decryption failed, desynchronized? Counter=0/3
Pair verify timed out
Decryption failed, desynchronized? Counter=11/13
CoAP POST returned unexpected code <aiocoap.Message at 0x7f5a909390: ACK 4.04 Not Found (MID 36576, token 53b4) remote <UDP6EndpointAddress [fdbc:107:de33:0:2fbc:f566:6fe8:4bd3] (locally fd9e:af89:770f:40f8:dfd4:1c17:ed42:3138%eth0)>>
Failed flailing attempts to resynchronize, self-destructing in 3, 2, 1...

Additional information

No response

@home-assistant
Copy link

Hey there @Jc2k, @bdraco, mind taking a look at this issue as it has been labeled with an integration (homekit_controller) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of homekit_controller can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Change the title of the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign homekit_controller Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


homekit_controller documentation
homekit_controller source
(message by IssueLinks)

@Jc2k
Copy link
Member

Jc2k commented Dec 10, 2022

CC @roysjosh

Off the top of my head the only changes this month that are likely to be relevant are the RST fix in CoAP and some fixes to how we get the ipv6 address from zeroconf if the cached address doesn't work at boot.

It could be the old behaviour was causing frequent connection drops, so frequent key exchanges, and this was masking whatever process is now making the connection go wonky.

By which I mean, bug was already there but frequently RST the connection meant we never hit it.

Encryption relies on a counter. Every message we encrypt or decrypt bumps the counter. If we are out of sync, the counter is wrong. To account for ordering issues @roysjosh tries some adjacent counter values.

Do you have debug logs for aiohomekit during this? We should see how hard it tries to recover if you do.

Also, is the eve light switch battery powered? It'd be interesting to see if these errors were limited to battery powered stuff (we know the ACK timeout issue impacts battery powered devices as they have a sleep interval).

@jfroy
Copy link
Contributor Author

jfroy commented Dec 11, 2022

The Eve light switches are mains powered and act as routers, so they shouldn't have the long latencies of a battery device. I think what I'm seeing is some instability with my only border router (an Apple TV 4K 2nd gen). I've seen a few notifications from HomeKit on my phone that my hub was offline (followed by a back online notification). I restarted it earlier today, which totally trashed the entire Thread network (none of the devices were showing up in mdns). They appear to be back now. We'll see how things evolve. I've heard from other sources that Thread is still a bit shaky. iOS 16.2 should drop next week and that will bring its own set of changes.

@jfroy
Copy link
Contributor Author

jfroy commented Dec 11, 2022

I don't have debug logs enabled right now. I'm planning to flip that on prior to said iOS update.

@riley206
Copy link

I'm also seeing this issue on 2022.12.3.

@freps
Copy link

freps commented Dec 26, 2022

Took me a while to understand, but it seems that I have the same issue.
https://community.home-assistant.io/t/connection-issues-with-eve-motion-and-homepod-mini-over-thread/506584/2
HA > HomePod Mini > Eve Motion .. all latest versions, and also many connection issues, same as mentioned.

@Jc2k
Copy link
Member

Jc2k commented Dec 26, 2022

@roysjosh has potentially fixed this. Hoping to have time to review his PRs this week so they can go into 2023.1.0...

@jfroy
Copy link
Contributor Author

jfroy commented Dec 26, 2022

I assume you mean Jc2k/aiohomekit#272. It does indeed look promising.

@paull27
Copy link

paull27 commented Jan 16, 2023

I think I have the same problem. I‘m on 2023.1.4 and use two Eve Thermo Thread devices. They become unavailable ever few minutes and after 5 minutes or so it is working again. The logs always show the same 3 errors:

First occurred: 17:03:46 (8 occurrences) 
Last logged: 17:10:49

Decryption failed, desynchronized? Counter=19/22
Failed flailing attempts to resynchronize, self-destructing in 3, 2, 1...
Decryption failed, desynchronized? Counter=18/21
Decryption failed, desynchronized? Counter=14/17

Logger: homeassistant.components.homekit_controller.connection
Source: components/homekit_controller/connection.py:752 
Integration: HomeKit Controller (documentation, issues) 
First occurred: 17:03:46 (4 occurrences) 
Last logged: 17:10:49

Unexpected exception from <bound method HKDevice.async_update of <homeassistant.components.homekit_controller.connection.HKDevice object at 0x7fa88597b0>>
Unexpected exception from <bound method HKDevice.async_update of <homeassistant.components.homekit_controller.connection.HKDevice object at 0x7fa8990a00>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 118, in _decrypt_response
    return self.decrypt(response.payload)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 95, in decrypt
    dec_data = self.recv_ctx.decrypt(
  File "/usr/local/lib/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/aead.py", line 62, in decrypt
    return aead._decrypt(backend, self, nonce, data, [associated_data], 16)
  File "/usr/local/lib/python3.10/site-packages/cryptography/hazmat/backends/openssl/aead.py", line 207, in _decrypt
    raise InvalidTag
cryptography.exceptions.InvalidTag

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/debounce.py", line 109, in _handle_timer_finish
    await task
  File "/usr/src/homeassistant/homeassistant/components/homekit_controller/connection.py", line 716, in async_update
    new_values_dict = await self.get_characteristics(
  File "/usr/src/homeassistant/homeassistant/components/homekit_controller/connection.py", line 752, in get_characteristics
    return await self.pairing.get_characteristics(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/pairing.py", line 197, in get_characteristics
    return await self.connection.read_characteristics(characteristics)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 507, in read_characteristics
    pdu_results = await self.enc_ctx.post_all(OpCode.CHAR_READ, iids, data)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 195, in post_all
    res_pdu = await self.post_bytes(req_pdu)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 181, in post_bytes
    return await self._decrypt_response(response)
  File "/usr/local/lib/python3.10/site-packages/aiohomekit/controller/coap/connection.py", line 158, in _decrypt_response
    await self.coap_ctx.shutdown()
AttributeError: 'NoneType' object has no attribute 'shutdown'

Logger: aiohomekit.controller.coap.pdu
Source: components/homekit_controller/connection.py:752 
First occurred: 17:04:49 (4 occurrences) 
Last logged: 17:11:51

Transaction 0 failed with error 6 (Invalid request

Can you confirm it‘s the same issue or should I open another one?

Thanks, Paul

@Jc2k
Copy link
Member

Jc2k commented Jan 16, 2023

It's not, your device is sending data for an old encryption key that homeassistant isn't using at the time it's received. It could be related but it's distinct enough it needs a separate ticket.

@rimonhanna
Copy link

I think I have the same issue, I have to power cycle my homepods mini almost everyday to get the thread network back up :(

@Jc2k
Copy link
Member

Jc2k commented Feb 12, 2023

You probably have a different error. The code to recover from this particular bug has been out for a while, and restarting the HomePod wouldn't have helped with it.

If restarting your HomePod helps I'd want to run some network diagnostics. I'd get a mdns/zeroconf browser and get the ipv6 address of your devices. Then ping6 them - from your HA environment. I'd also want to check your route table when everything is working and when everything breaks.

If none of that makes some sense, im currently working on some tooling to help trace where faults are. Hang in there.

@freps
Copy link

freps commented Feb 12, 2023

Is it still worth to invest further effort here since there came in an update of the Matter integration which allows adding Thread/Matter devices as well?

@Jc2k
Copy link
Member

Jc2k commented Feb 12, 2023

There are some devices that don't support matter and won't, but do support thread.

Also eve devices will likely eventually work better with homekit, because matter doesn't have all the vendor specific features that eve built for homekit. Energy monitoring springs to mind, but I'm sure there are others.

Also a lot of these issues are problems with thread, and will bite matter just as hard. It's not HAs fault some of you have border routers that create dead routes.

@cayossarian
Copy link
Contributor

I've also seen these eve thread devices become unavailable. I've noticed this behavior especially if I'm working with node-red and deploying flows. What I also notice is rebooting HA makes them available again.

Could be an Apple thread problem that iOS 16.4 new HomeKit architecture may help but it makes me wonder because rebooting HA brings the devices back. Log entries are full of:

2023-03-28 20:58:01.658 WARNING (MainThread) [homeassistant.config_entries] Config entry 'Eve Door AD40' for homekit_controller integration not ready yet: failed to connect; Retrying in background
2023-03-28 20:58:02.447 WARNING (MainThread) [aiohomekit.controller.coap.pdu] Transaction 0 failed with error 6 (Invalid request

@rimonhanna
Copy link

You probably have a different error. The code to recover from this particular bug has been out for a while, and restarting the HomePod wouldn't have helped with it.

If restarting your HomePod helps I'd want to run some network diagnostics. I'd get a mdns/zeroconf browser and get the ipv6 address of your devices. Then ping6 them - from your HA environment. I'd also want to check your route table when everything is working and when everything breaks.

If none of that makes some sense, im currently working on some tooling to help trace where faults are. Hang in there.

Looking forward to that diagnostic tool, let me know when it's ready so I can help by providing the logs needed

@Jc2k
Copy link
Member

Jc2k commented Mar 31, 2023

The thread diagnostic download is in the current stable release.

We now know that there are a multitude of bugs and config errors at play that all disrupt thread, not just in home assistant but with Linux in general.

This post assumes that the most basic config errors have been resolved. For example, no vlans, and that WiFi repeaters aren't disrupting mdns or icmp6 (some are known to really break stuff).

We know that older versions of NetworkManager (which most desktop Linux uses, and HAOS 9.5 does too) had a bug I call ghost routes. Whenever a BR changes link local address, the Linux box remembers the old address and adds the new one to its route table. In time you can end up with 10 ghost routes. Depending on other settings in other parts of the Linux stack these routes continue to be used as if they were valid. You can end up with a 10:1 chance of failure.

Newer versions of HAOS use newer versions of NetworkManager. These solve the problem... by only allowing a single border router. Every time a new border router announces itself it forgets the current one. With 3 BRs we expect to see the route table churn once a minute on average. If those changes were not atomic (for example, a remove was done and then an add) then there would be a tiny window once a minute where the mesh would be inaccessible. This may not matter in practice but is a concern. More practicality, it does mean there is no hope of failover until the next announcement.

HAOS 10 final release (it's not in rc2) will carry a patch to allow NM to track multiple BRs.

Of course the next problem is that for some environments the kernels Neighbour Unreachable Detection is not working. When a neighbour is considered stale by the kernel it is probed by icmp6 packets. If 3 probes fail, it is marked as a failed neighbour. Failed neighbours are scored lower when making routing decisions. At least that's what's supposed to happen.

If you have ip forwarding turned on (eg you are running an OTBR, or some container setups) the kernel disables this feature. In this scenario, with a working network manager, your network could go down for 30 minutes every time an ip changes.

HAOS10 final (again not in rc2) will have a kernel patch to avoid this. This is not upstream yet. So it's broken for anyone running supervised on their own OS (if they have forwarding enabled). Potentially for people running the container directly.

If you are running HA Core on a system without systemd-networkd or NetworkManager and you don't have forwarding enabled, you likely have a very reliable network for running thread BRs. Oh wait no. Because by default Linux actually drops route advertisements of the type Thread sends. So you need to manually configure sysctls.

We have also seen weak mesh manifest in the same way. Turning off a Br with weak mesh (often an Apple TV in a closet) can spring a mesh back to life with no further intervention.

Then there are the BR bugs. We are still seeing Brs rotating their mesh prefixes fairly rapidly. When everything keeps changing it's ip it's kinda hard to be stable.

I have had HA core running like this since August and I still see a blip every one to two weeks. Restarting HA does help. That's probably a HA bug. But then again, sometimes waking a device (pushing a physical button) seems to get it back in line too. So it might not be.

@jfroy
Copy link
Contributor Author

jfroy commented Mar 31, 2023

Thank you for the fantastic overview and really valuable analysis. I hope your fixes do make it upstream.

Are the patches posted somewhere for review or integration?

@Jc2k
Copy link
Member

Jc2k commented Mar 31, 2023

The kernel patch was sent to the kernel mailing list, and the network manager patch was posted on the NM tracker (in an issue, as we wanted feedback on the approach). Sorry I don't have links, on phone.

@jfroy
Copy link
Contributor Author

jfroy commented Mar 31, 2023

No worries. It's home-assistant/operating-system#2434 (see also home-assistant/operating-system#2333 (reply in thread)). I can find the mailing list threads.

@jfroy
Copy link
Contributor Author

jfroy commented Mar 31, 2023

@issue-triage-workflows
Copy link

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@issue-triage-workflows issue-triage-workflows bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Aug 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants