Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi stack hangs (IDFGH-9724) (IDFGH-9725) #11060

Closed
3 tasks done
rojer opened this issue Mar 27, 2023 · 21 comments
Closed
3 tasks done

Wifi stack hangs (IDFGH-9724) (IDFGH-9725) #11060

rojer opened this issue Mar 27, 2023 · 21 comments
Assignees
Labels
Awaiting Response awaiting a response from the author Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@rojer
Copy link
Contributor

rojer commented Mar 27, 2023

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

4.4 release, tried latest 4.4 latest as well

Operating System used.

Linux

How did you build your project?

Command line with Make

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-PICO-KIT

Power Supply used.

USB

What is the expected behavior?

Wifi connects every time

What is the actual behavior?

Sometimes (approximately 1 in hundreds to a thousand) the wifi stack gets into an inconsistent state and is unable to connect or scan properly.
The tell-tale sign is appearance of W (67573) wifi:m f probe req l=0 messages on the console,
and additionally, wifi can no longer be properly stopped, after 5 seconds of blocking it prints the following:

[Mar 24 09:26:53.884] W (19912322) wifi:TX Q not empty: 500, TXQ_BLOCK=0
[Mar 24 09:26:53.884] W (19912322) wifi:force witi stop
[Mar 24 09:26:53.890] I (19912322) wifi:flush txq
[Mar 24 09:26:53.890] I (19912332) wifi:stop sw txq
[Mar 24 09:26:53.897] I (19912332) wifi:lmac stop hw txq
[Mar 24 09:26:53.897] W (19912332) wifi:sw tx 0 state not idle, potential error!
[Mar 24 09:26:53.903] W (19912342) wifi:force sw tx 0 state to idle, ebuf flag=3009

restarting does not solve the issue, and deinit causes a core dump due to what looks like a double-free on some internal buffer:

[Mar 24 09:26:53.909] I (19912342) wifi:Deinit lldesc rx mblock:4
[Mar 24 09:26:53.915].
[Mar 24 09:26:53.915] assert failed: tlsf_free heap_tlsf.c:964 (!block_is_free(block) && "block already marked as free")

Steps to reproduce.

I have firmware that reboots 5 seconds after wifi connects and it will get into this state within an hour or two.

Debug Logs.

No response

More Information.

No response

@rojer rojer added the Type: Bug bugs in IDF label Mar 27, 2023
@github-actions github-actions bot changed the title Wifi memory corruption Wifi memory corruption (IDFGH-9724) Mar 27, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Mar 27, 2023
@github-actions github-actions bot changed the title Wifi memory corruption (IDFGH-9724) Wifi memory corruption (IDFGH-9724) (IDFGH-9725) Mar 27, 2023
@rojer rojer changed the title Wifi memory corruption (IDFGH-9724) (IDFGH-9725) Wifi stack hangs (IDFGH-9724) (IDFGH-9725) Mar 27, 2023
@rojer
Copy link
Contributor Author

rojer commented Mar 27, 2023

here is a typical "good" boot log: https://gist.github.com/rojer/85c3abd0157a5b7935ca7249ff04ea1c
and a bad one (this is one, in total i have 3 occurences out of approx 8000 samples so far): https://gist.github.com/rojer/f2adba071ad424167bce5930d44f2f11
things to note:

  1. normally a device is online within a few seconds, not so in the bad case
  2. it does manage to associate but gets its wires crossed somewhere before obtaining a DHCP lease
  3. at line 138 wifi is restarted but it already exhibits the unhealthy behavior - "forced stop", "potential error", etc. and never recovers. scans are failing and eventually the "probe req l=0" message appears.

@espressif-bot espressif-bot added Resolution: Duplicate This issue or pull request already exists Status: Done Issue is done internally and removed Status: Opened Issue is new labels Mar 28, 2023
@rojer
Copy link
Contributor Author

rojer commented Mar 28, 2023

this seems to suggest that this has been fixed already. can i get a confirmation on whether that is the case?

@igrr
Copy link
Member

igrr commented Apr 5, 2023

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

@igrr igrr added Status: Opened Issue is new and removed Status: Done Issue is done internally Resolution: Duplicate This issue or pull request already exists labels Apr 5, 2023
@AxelLin
Copy link
Contributor

AxelLin commented May 2, 2023

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

The question is this issue is duplicated with which issue? I don't find the link.

@rojer
Copy link
Contributor Author

rojer commented May 3, 2023

@AxelLin it's not duplicated with anything, labeling was a mistake. it's still waiting for investigation.

@lszelejewski
Copy link

lszelejewski commented Jul 13, 2023

I can confirm the same issue but on 4.2.4

Exactly same symptoms related to frequent network connection and disconnection event. Happens more frequently when WiFi signal is weak compared to sites with better network coverage.

W (21654) wifi:TX Q not empty: 500, TXQ_BLOCK=0 W (21654) wifi:force witi stop I (21654) wifi:flush txq I (21654) wifi:stop sw txq I (21654) wifi:lmac stop hw txq W (21654) wifi:sw tx 0 state not idle, potential error! W (21654) wifi:force sw tx 0 state to idle, ebuf flag=0

esp_restart() does not help - only power cycle.

@AxelLin
Copy link
Contributor

AxelLin commented Jul 13, 2023

I can confirm the same issue but on 4.2.4

v4.2 is already EOL, maybe you can try with newer esp-idf version?

@lszelejewski
Copy link

v4.4 has been tried as well. Same issue.
The biggest problem is that esp_restart does not help and the only solution is to power cycle device manually - very problematic for our customers.

@AxelLin
Copy link
Contributor

AxelLin commented Jul 13, 2023

No, that "duplicate" label is just a bug of our issue tracking system. Somehow two internal issues got created for this GitHub issue; the second one was marked as "duplicate", and the resolution got synced back to this issue. I'll clear the label.

This issue has been reported for several months, but there is no response from other espressif developers?

@Xiehanxin
Copy link

hi @rojer , I have seen your log and I would like to confirm that if you have used the BLE and Wi-Fi coex? and could you share the minimum reproduce code to us

@Xiehanxin
Copy link

hi @lszelejewski could you describe your application scenario in details

@tomasznowik
Copy link

On behalf of @lszelejewski I will describe our problem. Majority of our devices uses v4.2.4.
We found this issue because due to our bug the device frees already freed pointer and calls esp_restart just after dropped wifi and reconnecting. We see

CORRUPT HEAP: multi_heap.c:540 detected at 0x3ffbb094
abort() was called at PC 0x40094971 on core 0

before the reboot.

But the problem is that WiFi can't connect anymore after startup. Software watchdog resets the device with esp_restart but issue persists until power cycle.
We can recreate this error only with weak wifi signal when connection is dropped from time to time so our bug corrupts the heap.

So it looks like this:

  1. device works in environment with low wifi signal
  2. before the issue the device lost wifi connection, connects again but obtaining address from DHCP didn't finish
  3. abort() and CORRUPT HEAP occurred due to double free call (our bug) and then reboot
  4. after reboot the device was unable to connect to wifi
  5. software watchdog resets the device each 3 minutes but it doesn't help
  6. powercycle helped.

The log:

E (1894) UDP Server: Socket created
I (1894) UDP Server: Socket bound, port 1234
I (1894) UDP Server: Waiting for data
I (1984) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (1984) wifi:state: init -> auth (b0)
I (2384) WATCHDOG: Subscribe watchdog to task esp_timer
I (2984) wifi:state: auth -> init (200)
I (2984) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
W (2984) WiFi: wifi event handler 5
I (2984) WiFi: Retry to connect to the AP
I (2994) EventHandlers: NETWORKING event ID 1
I (4294) wifi:new:<6,0>, old:<1,0>, ap:<255,255>, sta:<6,0>, prof:1
I (4294) wifi:state: init -> auth (b0)

01:44:06.678	I (5294) wifi:state: auth -> init (200)
I (5294) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (5294) WiFi: wifi event handler 5
I (5294) WiFi: Retry to connect to the AP
I (5294) EventHandlers: NETWORKING event ID 1

01:44:08.036	I (7704) wifi:new:<11,0>, old:<6,0>, ap:<255,255>, sta:<11,0>, prof:1
I (7704) wifi:state: init -> auth (b0)

01:44:09.395	I (8704) wifi:state: auth -> init (200)
I (8704) wifi:new:<11,0>, old:<11,0>, ap:<255,255>, sta:<11,0>, prof:1
W (8704) WiFi: wifi event handler 5
I (8714) WiFi: Retry to connect to the AP
I (8714) EventHandlers: NETWORKING event ID 1

01:44:12.419	W (12074) WiFi: wifi event handler 5
I (12074) WiFi: Retry to connect to the AP
I (12074) EventHandlers: NETWORKING event ID 1

01:44:13.783	I (12124) wifi:new:<1,0>, old:<11,0>, ap:<255,255>, sta:<1,0>, prof:1
I (12124) wifi:state: init -> auth (b0)
I (13124) wifi:state: auth -> init (200)
I (13124) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
W (13124) WiFi: wifi event handler 5
I (13124) WiFi: Retry to connect to the AP
I (13124) EventHandlers: NETWORKING event ID 1

01:44:15.135	I (14424) wifi:new:<6,0>, old:<1,0>, ap:<255,255>, sta:<6,0>, prof:1
I (14434) wifi:state: init -> auth (b0)

01:44:16.494	I (15434) wifi:state: auth -> init (200)
I (15434) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (15434) WiFi: wifi event handler 5
I (15434) WiFi: Retry to connect to the AP
I (15434) EventHandlers: NETWORKING event ID 1
W (15674) wifi:m f probe req l=0

W (15914) wifi:m f probe req l=0

W (16154) wifi:m f probe req l=0


01:44:17.846	W (16394) wifi:m f probe req l=0

W (16634) wifi:m f probe req l=0

W (16874) wifi:m f probe req l=0

W (17114) wifi:m f probe req l=0

W (17354) wifi:m f probe req l=0


01:44:19.192	W (17604) wifi:m f probe req l=0

W (17844) wifi:m f probe req l=0

I (17944) wifi:new:<11,0>, old:<6,0>, ap:<255,255>, sta:<11,0>, prof:1
I (17944) wifi:state: init -> auth (b0)

01:44:20.545	I (18944) wifi:state: auth -> init (200)
I (18944) wifi:new:<11,0>, old:<11,0>, ap:<255,255>, sta:<11,0>, prof:1
W (18944) WiFi: wifi event handler 5
W (18944) WiFi: Using fallback network

01:44:24.286	W (23954) wifi:TX Q not empty: 500, TXQ_BLOCK=0
W (23954) wifi:force witi stop
I (23954) wifi:flush txq
I (23954) wifi:stop sw txq
I (23954) wifi:lmac stop hw txq
W (23954) wifi:sw tx 0 state not idle, potential error!
W (23954) wifi:force sw tx 0 state to idle, ebuf flag=412

It looks like inconsistent state is kept during reset with esp_restart.

Is there a change that triggering hardware watchdog would reset WiFi better than esp_restart()?

@Xiehanxin
Copy link

hi @tomasznowik from your log, there might be somerhing wrong with phy or the Phy uses the wrong calibration data , could you tell me
1.how much device have this problem
2.how often the problem recurs and tell me the reproduce steps if it is convient.
3.provide your sdkconfig.

@tomasznowik
Copy link

  1. It seems that many random devices has this issues so it's rather software issue. We managed to create bad wifi environment that 4 out of 10 devices hanged due this issue in 24 hours. Normally it almost never happens.
  2. It's hard to recreate this problem but it can happen once a day when there's weak wifi signal. But we are sure it was triggered due to our bug in the code that corrupted heap memory and triggered esp_restart in other thread. Steps to reproduce are in previous message.
  3. Please see attachment:
    sdkconfig.zip

@tomasznowik
Copy link

@Xiehanxin We still observe the issue. After fixing our own bug the device doesn't reset so often so it looked like the issue is solved but it's not. After writing a configuration to flash we call esp_system_abort to reboot. And rarely the devices can't connect to wifi anymore because of the error I described previously.

W 19585 | wifi             force witi stop
I 19585 | wifi             flush txq
I 19585 | wifi             stop sw txq
I 19585 | wifi             lmac stop hw txq
W 19585 | wifi             sw tx 0 state not idle, potential error!
W 19585 | wifi             force sw tx 0 state to idle, ebuf flag=412

You wrote "there might be somerhing wrong with phy or the Phy uses the wrong calibration data". Is there anything we could try to fix this?
We use:

CONFIG_ESP32_PHY_CALIBRATION_AND_DATA_STORAGE=y
# CONFIG_ESP32_PHY_INIT_DATA_IN_PARTITION is not set
CONFIG_ESP32_PHY_MAX_WIFI_TX_POWER=20
CONFIG_ESP32_PHY_MAX_TX_POWER=20

@AxelLin
Copy link
Contributor

AxelLin commented Dec 24, 2023

@Xiehanxin Any further feedback about #11060 (comment) and #11060 (comment) ?

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Mar 22, 2024
@Espressif-liuuuu
Copy link
Collaborator

Hi @rojer , we find the cause of Wi-Fi connect issue after reset when coexisting with BLE scan. #13598 (comment)
If its similiar in your case, you can try to verify it on v5.2 as well.

@rojer
Copy link
Contributor Author

rojer commented Jun 27, 2024

@Espressif-liuuuu great news, thanks! we've been carrying a really hacky workarond for this for a while. i'll try to repro it with our current code (w/o workaround) and then try the fix.

@espressif-bot espressif-bot added Awaiting Response awaiting a response from the author Status: Reviewing Issue is being reviewed labels Jul 17, 2024
@AxelLin
Copy link
Contributor

AxelLin commented Aug 14, 2024

i'll try to repro it with our current code (w/o workaround) and then try the fix.

@rojer Does the fix work for you?

@Sherry616
Copy link
Collaborator

Thanks for reporting, will close due to short of feedback, feel free to reopen with more updates.
Thanks for using our Espressif product!

@Sherry616 Sherry616 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2024
@espressif-bot espressif-bot added Status: Opened Issue is new Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Reviewing Issue is being reviewed Status: Opened Issue is new labels Sep 5, 2024
@rojer
Copy link
Contributor Author

rojer commented Sep 5, 2024

yeah, i'm sorry - i have actually tried to reproduce it, but wasn't able to. something has changed somewhere, but the bottom line is we don't see these crashes anymore, and that's good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response awaiting a response from the author Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

10 participants