Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ESP32S3] v4.4 WIFI crashes (IDFGH-10466) #11713

Closed
3 tasks done
KonssnoK opened this issue Jun 20, 2023 · 69 comments
Closed
3 tasks done

[ESP32S3] v4.4 WIFI crashes (IDFGH-10466) #11713

KonssnoK opened this issue Jun 20, 2023 · 69 comments
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@KonssnoK
Copy link
Contributor

KonssnoK commented Jun 20, 2023

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

Hello,
we are going to use this issue to report multiple WIFI crashes our field devices are reporting.
v4.4 based on 3cec3a0

Since then, only one commit on the WIFI library, so these errors should be all still there.

Our application uses

  • Wifi
  • Mesh
  • MQTT
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jun 20, 2023
@github-actions github-actions bot changed the title [ESP32S3] v4.4 WIFI crashes [ESP32S3] v4.4 WIFI crashes (IDFGH-10466) Jun 20, 2023
@KonssnoK
Copy link
Contributor Author

Crash n.1

#0  block_is_free (block=0xdeadbeef) at /opt/esp-idf/components/heap/heap_tlsf_block_functions.h:90
#1  block_merge_prev (block=0xdeadbeef, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:344
#2  tlsf_free (tlsf=0x3d800014, ptr=0x3da28dc8) at /opt/esp-idf/components/heap/heap_tlsf.c:966
#3  0x4038c48d in multi_heap_free_impl (heap=0x3d800000, p=0x3da28dc8) at /opt/esp-idf/components/heap/multi_heap.c:212
#4  0x40376b0d in heap_caps_free (ptr=0x3da28dc8) at /opt/esp-idf/components/heap/heap_caps.c:382
#5  0x4038d2dc in free (ptr=0x3da28dc8) at /opt/esp-idf/components/newlib/heap.c:39
#6  0x420c5034 in mem_free (rmem=0x3da28dc8) at /opt/esp-idf/components/lwip/lwip/src/core/mem.c:264
#7  0x420c594d in pbuf_free (p=0x3da28dc8) at /opt/esp-idf/components/lwip/lwip/src/core/pbuf.c:789
#8  0x420c66f0 in tcp_seg_free (seg=0x3da2c8cc) at /opt/esp-idf/components/lwip/lwip/src/core/tcp.c:1628
#9  0x420c7ba7 in tcp_free_acked_segments (pcb=0x3da2a730, seg_list=0x0, dbg_list_name=<optimized out>, dbg_other_seg_list=0x0) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_in.c:1120
#10 0x420c7e24 in tcp_receive (pcb=0x3da2a730) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_in.c:1289
#11 0x420c8ca4 in tcp_process (pcb=0x3da2a730) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_in.c:987
#12 0x420c93ac in tcp_input (p=<optimized out>, inp=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_in.c:438
#13 0x420ce25d in ip4_input (p=0x3da2d3c4, inp=0x3da2095c) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:803
#14 0x420cfa20 in ethernet_input (p=0x3da2d3c4, netif=0x3da2095c) at /opt/esp-idf/components/lwip/lwip/src/netif/ethernet.c:186
#15 0x420c1ed0 in tcpip_thread_handle_msg (msg=0x3da2d43c) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:180
#16 0x420c1f53 in tcpip_thread (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:154
#17 0x40388805 in vPortTaskWrapper (pxCode=0x420c1f18 <tcpip_thread>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

wifi_crash_1.txt

@KonssnoK
Copy link
Contributor Author

Crash n.2

#0  insert_free_block (sl=29, fl=23, block=0x3da204d8, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:237
#1  block_insert (block=0x3da204d8, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:263
#2  tlsf_free (tlsf=0x3d800014, ptr=<optimized out>) at /opt/esp-idf/components/heap/heap_tlsf.c:968
#3  0x4038c48d in multi_heap_free_impl (heap=0x3d800000, p=0x3da2054c) at /opt/esp-idf/components/heap/multi_heap.c:212
#4  0x40376b0d in heap_caps_free (ptr=0x3da2054c) at /opt/esp-idf/components/heap/heap_caps.c:382
#5  0x4037f594 in esp_mbedtls_mem_free (ptr=0x3da2054c) at /opt/esp-idf/components/mbedtls/port/esp_mem.c:46
#6  0x420ee871 in mbedtls_free (ptr=0x3da2054c) at /opt/esp-idf/components/mbedtls/mbedtls/library/platform.c:66
#7  0x420ec366 in mbedtls_md_free (ctx=0x3fcf6f70) at /opt/esp-idf/components/mbedtls/mbedtls/library/md.c:330
#8  0x420be376 in pbkdf2_sha1 (passphrase=<optimized out>, ssid=0x3fcb62a4 <s_wifi_nvs+8> \"\", ssid_len=14, iterations=4096, buf=<optimized out>, buflen=32) at /opt/esp-idf/components/wpa_supplicant/src/crypto/crypto_mbedtls.c:687
#9  0x42146920 in wpa_set_passphrase (passphrase=<optimized out>, ssid=<optimized out>, ssid_len=<optimized out>) at /opt/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:2492
#10 0x42146db8 in wpa_set_bss (macddr=0x3fcf70a0 \"\254\\027T\", bssid=<optimized out>, pairwise_cipher=<optimized out>, group_cipher=<optimized out>, passphrase=0x3fcb62cb <s_wifi_nvs+47> \"\", ssid=0x3fcb62a4 <s_wifi_nvs+8> \"\", ssid_len=14) at /opt/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:2462
#11 0x4213f1f1 in wpa_config_bss (bssid=0x3fcb696c <g_cnxMgr+12> \"\") at /opt/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_wpa_main.c:92
#12 0x4213f207 in wpa_sta_connect (bssid=0x3fcb696c <g_cnxMgr+12> \"\") at /opt/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_wpa_main.c:174
#13 0x42132049 in ?? ()
#14 0x4213253d in ?? ()
#15 0x421332fe in ?? ()
#16 0x4213337b in cnx_start_handoff_cb ()
#17 0x4212d01f in ?? ()
#18 0x4212d496 in ?? ()
#19 0x42131a2e in chm_end_op ()
#20 0x42131a4d in chm_end_op_timeout_process ()
#21 0x421315af in ?? ()
#22 0x42131844 in ieee80211_timer_do_process ()
#23 0x4038d5b4 in ppTask ()
#24 0x40388805 in vPortTaskWrapper (pxCode=0x4038d50c <ppTask>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

wifi_crash_2.txt

@KonssnoK
Copy link
Contributor Author

Crash n.3 (internal 5)

#0  0x403763bd in panic_abort (details=0x3fcf6d8b \"assert failed: remove_free_block heap_tlsf.c:205 (prev && \\\"prev_free field can not be null\\\")\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x40384588 in esp_system_abort (details=0x3fcf6d8b \"assert failed: remove_free_block heap_tlsf.c:205 (prev && \\\"prev_free field can not be null\\\")\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038d2ac in __assert_func (file=0x3c1874ed \"heap_tlsf.c\", line=<optimized out>, func=<optimized out>, expr=0x3c187764 \"prev && \\\"prev_free field can not be null\\\"\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x4038ba2a in remove_free_block (sl=<optimized out>, fl=<optimized out>, block=0x3da2c5f0, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:205
#4  block_remove (block=0x3da2c5f0, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:255
#5  block_merge_next (block=<optimized out>, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:361
#6  tlsf_free (tlsf=0x3d800014, ptr=<optimized out>) at /opt/esp-idf/components/heap/heap_tlsf.c:967
#7  0x4038c48d in multi_heap_free_impl (heap=0x3d800000, p=0x3da2c58c) at /opt/esp-idf/components/heap/multi_heap.c:212
#8  0x40376b0d in heap_caps_free (ptr=0x3da2c58c) at /opt/esp-idf/components/heap/heap_caps.c:382
#9  0x4037f594 in esp_mbedtls_mem_free (ptr=0x3da2c58c) at /opt/esp-idf/components/mbedtls/port/esp_mem.c:46
#10 0x420ee871 in mbedtls_free (ptr=0x3da2c58c) at /opt/esp-idf/components/mbedtls/mbedtls/library/platform.c:66
#11 0x420ec352 in mbedtls_md_free (ctx=0x3fcf6f70) at /opt/esp-idf/components/mbedtls/mbedtls/library/md.c:324
#12 0x420be376 in pbkdf2_sha1 (passphrase=<optimized out>, ssid=0x3fcb62a4 <s_wifi_nvs+8> \"\", ssid_len=11, iterations=4096, buf=<optimized out>, buflen=32) at /opt/esp-idf/components/wpa_supplicant/src/crypto/crypto_mbedtls.c:687
#13 0x42146920 in wpa_set_passphrase (passphrase=<optimized out>, ssid=<optimized out>, ssid_len=<optimized out>) at /opt/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:2492
#14 0x42146db8 in wpa_set_bss (macddr=0x3fcf70a0 \"\254\\027T\", bssid=<optimized out>, pairwise_cipher=<optimized out>, group_cipher=<optimized out>, passphrase=0x3fcb62cb <s_wifi_nvs+47> \"\", ssid=0x3fcb62a4 <s_wifi_nvs+8> \"\", ssid_len=11) at /opt/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:2462
#15 0x4213f1f1 in wpa_config_bss (bssid=0x3fcb696c <g_cnxMgr+12> \"\") at /opt/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_wpa_main.c:92
#16 0x4213f207 in wpa_sta_connect (bssid=0x3fcb696c <g_cnxMgr+12> \"\") at /opt/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_wpa_main.c:174
#17 0x42132049 in ?? ()

wifi_crash_5.txt

@KonssnoK
Copy link
Contributor Author

i think we have more, we'll continue to check.

@KonssnoK
Copy link
Contributor Author

Crash n.4 (internal id 9)

#0  block_is_free (block=0x5378430) at /opt/esp-idf/components/heap/heap_tlsf_block_functions.h:90
#1  block_merge_prev (block=0x5378430, control=0x3d800014) at /opt/esp-idf/components/heap/heap_tlsf.c:344
#2  tlsf_free (tlsf=0x3d800014, ptr=0x3da292e0) at /opt/esp-idf/components/heap/heap_tlsf.c:966
#3  0x4038c48d in multi_heap_free_impl (heap=0x3d800000, p=0x3da292e0) at /opt/esp-idf/components/heap/multi_heap.c:212
#4  0x40376b0d in heap_caps_free (ptr=0x3da292e0) at /opt/esp-idf/components/heap/heap_caps.c:382
#5  0x4038d2dc in free (ptr=0x3da292e0) at /opt/esp-idf/components/newlib/heap.c:39
#6  0x4038d7c0 in esf_buf_recycle ()
#7  0x4003f3ce in ?? ()
#8  0x4003a280 in ?? ()
#9  0x4038d068 in sta_input ()
#10 0x4038d0a4 in sta_rx_cb ()
#11 0x4003f4c0 in ?? ()
#12 0x4038d5e1 in ppTask ()
#13 0x40388805 in vPortTaskWrapper (pxCode=0x4038d50c <ppTask>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

wifi_crash_9.txt

@KonssnoK
Copy link
Contributor Author

Crash n.5 (internal id 10)
@zhangyanjiaoesp this one seems related to mesh

#0  0x403763bd in panic_abort (details=0x3fcea71b \"assert failed: tlsf_free heap_tlsf.c:964 (!block_is_free(block) && \\\"block already marked as free\\\")\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x40384588 in esp_system_abort (details=0x3fcea71b \"assert failed: tlsf_free heap_tlsf.c:964 (!block_is_free(block) && \\\"block already marked as free\\\")\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038d2ac in __assert_func (file=0x3c1874ed \"heap_tlsf.c\", line=<optimized out>, func=<optimized out>, expr=0x3c187998 \"!block_is_free(block) && \\\"block already marked as free\\\"\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x4038b83a in tlsf_free (tlsf=0x3d800014, ptr=0x3da2759c) at /opt/esp-idf/components/heap/heap_tlsf.c:964
#4  0x4038c48d in multi_heap_free_impl (heap=0x3d800000, p=0x3da2759c) at /opt/esp-idf/components/heap/multi_heap.c:212
#5  0x40376b0d in heap_caps_free (ptr=0x3da2759c) at /opt/esp-idf/components/heap/heap_caps.c:382
#6  0x4038d2dc in free (ptr=0x3da2759c) at /opt/esp-idf/components/newlib/heap.c:39
#7  0x4211af7e in mesh_free ()
#8  0x4211b02c in esp_mesh_free_context ()
#9  0x420f97ec in esp_mesh_send ()
#10 0x42074f94 in mesh_driver_transmit_from_node_sta (h=<optimized out>, buffer=0x3da27562, len=54) at ../components/mesh/mesh_driver.c:148
#11 0x42074fb8 in mesh_driver_transmit_from_node_sta_wrap (h=0x3d817c7c, buffer=0x3da27562, len=54, netstack_buf=0x3da27548) at ../components/mesh/mesh_driver.c:157
#12 0x4216b1b0 in esp_netif_transmit_wrap (esp_netif=0x3d817bf4, data=0x3da27562, len=54, pbuf=0x3da27548) at /opt/esp-idf/components/esp_netif/lwip/esp_netif_lwip.c:961
#13 0x420d25a6 in low_level_output (netif=<optimized out>, p=0x3da27548) at /opt/esp-idf/components/lwip/port/esp32/netif/wlanif.c:137
#14 0x420cfaad in ethernet_output (netif=0x3da25c80, p=0x3da27548, src=<optimized out>, dst=0x3fcb4aec <arp_table+36>, eth_type=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/netif/ethernet.c:312
#15 0x420cd124 in etharp_output_to_arp_index (netif=0x3da25c80, q=0x3da27548, arp_idx=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:785
#16 0x420cd425 in etharp_output (netif=0x3da25c80, q=0x3da27548, ipaddr=<optimized out>) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:884
#17 0x420ce514 in ip4_output_if_opt_src (p=0x3da27548, src=<optimized out>, dest=0x3da2d5b8, ttl=255 '\\377', tos=0 '\\000', proto=<optimized out>, netif=netif@entry=0x3da25c80, ip_options=ip_options@entry=0x0, optlen=optlen@entry=0) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:1108
#18 0x420ce568 in ip4_output_if_opt (p=0x3da27548, src=0x3da2d5b4, dest=0x3da2d5b8, ttl=255 '\\377', tos=0 '\\000', proto=<optimized out>, netif=netif@entry=0x3da25c80, ip_options=ip_options@entry=0x0, optlen=optlen@entry=0) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:909
#19 0x420ce590 in ip4_output_if (p=0x3da27548, src=0x3da2d5b4, dest=0x3da2d5b8, ttl=255 '\\377', tos=0 '\\000', proto=6 '\\006', netif=netif@entry=0x3da25c80) at /opt/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:886
#20 0x420c9b45 in tcp_output_control_segment (pcb=0x3da2d5b4, p=0x3da27548, src=0x3da2d5b4, dst=0x3da2d5b8) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_out.c:1951
#21 0x420ca4b8 in tcp_send_empty_ack (pcb=0x3da2d5b4) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_out.c:2057
#22 0x420ca534 in tcp_output (pcb=0x3da2d5b4) at /opt/esp-idf/components/lwip/lwip/src/core/tcp_out.c:1277
#23 0x420c7194 in tcp_fasttmr () at /opt/esp-idf/components/lwip/lwip/src/core/tcp.c:1497
#24 0x420c7702 in tcp_tmr () at /opt/esp-idf/components/lwip/lwip/src/core/tcp.c:237
#25 0x420cab3e in tcpip_tcp_timer (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/core/timeouts.c:161
#26 0x420cac00 in sys_check_timeouts () at /opt/esp-idf/components/lwip/lwip/src/core/timeouts.c:411
#27 0x420c1e7d in tcpip_timeouts_mbox_fetch (mbox=0x3fcb4544 <tcpip_mbox>, msg=0x3fceac40) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:115
#28 0x420c1f39 in tcpip_thread (arg=0x0) at /opt/esp-idf/components/lwip/lwip/src/api/tcpip.c:148
#29 0x40388805 in vPortTaskWrapper (pxCode=0x420c1f18 <tcpip_thread>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

wifi_crash_10_mesh.txt

@igrr
Copy link
Member

igrr commented Jun 20, 2023

@KonssnoK all the posted backtraces do indicate that the program aborts due to a heap corruption. Please follow the recommendations listed in the docs to narrow down the root cause of heap corruption: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/heap_debug.html#heap-corruption-detection

Even though the backtrace originates from a Wi-Fi related function, this typically only means that the heap corruption has already occurred and that the next allocation has ran into the corrupted heap structure. Enabling heap debugging options can help detect the issue sooner and hopefully identify which part of the code is responsible for it.

@KonssnoK
Copy link
Contributor Author

@KonssnoK all the posted backtraces do indicate that the program aborts due to a heap corruption. Please follow the recommendations listed in the docs to narrow down the root cause of heap corruption: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/heap_debug.html#heap-corruption-detection

Even though the backtrace originates from a Wi-Fi related function, this typically only means that the heap corruption has already occurred and that the next allocation has ran into the corrupted heap structure. Enabling heap debugging options can help detect the issue sooner and hopefully identify which part of the code is responsible for it.

Hi @igrr , sadly there are a few issues:

  • these crashes are occurring on the field.. we cannot activate heap poisoning and monitoring on remote devices
  • we already found some heap corruption in the espressif closed source code (solved thanks to your mesh team) which was a pain to reproduce
  • the main allocators in our code are related to TCP/Mesh/WIFI
  • we have no idea of what is happening to trigger such conditions, we never seen them in the lab for now

Therefore we need:

  • some idea on how to reduce the scope of the error in remote devices so that we can try to reproduce locally

@KonssnoK
Copy link
Contributor Author

@igrr another wifi crash, not related to heap

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x42123db8  PS      : 0x00060630  A0      : 0x82124b7e  A1      : 0x3fcf71c0
0x42123db8: ieee80211_set_max_rate at ??:?

A2      : 0x00000000  A3      : 0x3d816ed0  A4      : 0x00000032  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x00000018  A8      : 0x00000035  A9      : 0x0000001d
A10     : 0x00000020  A11     : 0x3fcf7264  A12     : 0x00000006  A13     : 0x3fcf72c6
A14     : 0x00000001  A15     : 0x00000000  SAR     : 0x00000010  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000019  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff


Backtrace: 0x42123db5:0x3fcf71c0 0x42124b7b:0x3fcf72c0 0x42123b51:0x3fcf7320 0x4003f4dd:0x3fcf7340 |<-CORRUPTED
0x42123db5: ieee80211_set_max_rate at ??:?

0x42124b7b: ieee80211_parse_rsn at ??:?

0x42123b51: ieee80211_decap at ??:?

@zhangyanjiaoesp
Copy link
Collaborator

@igrr another wifi crash, not related to heap

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x42123db8  PS      : 0x00060630  A0      : 0x82124b7e  A1      : 0x3fcf71c0
0x42123db8: ieee80211_set_max_rate at ??:?

A2      : 0x00000000  A3      : 0x3d816ed0  A4      : 0x00000032  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x00000018  A8      : 0x00000035  A9      : 0x0000001d
A10     : 0x00000020  A11     : 0x3fcf7264  A12     : 0x00000006  A13     : 0x3fcf72c6
A14     : 0x00000001  A15     : 0x00000000  SAR     : 0x00000010  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000019  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff


Backtrace: 0x42123db5:0x3fcf71c0 0x42124b7b:0x3fcf72c0 0x42123b51:0x3fcf7320 0x4003f4dd:0x3fcf7340 |<-CORRUPTED
0x42123db5: ieee80211_set_max_rate at ??:?

0x42124b7b: ieee80211_parse_rsn at ??:?

0x42123b51: ieee80211_decap at ??:?

@KonssnoK Can you provide the .elf file on this failure test?

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK About the heap corruption issue, can you provide more logs before the problem happen? This will help us analyze in what scenario does this happen.

@KonssnoK
Copy link
Contributor Author

Hi @zhangyanjiaoesp ,

sadly the crashes are happening on field devices, so we don't really know what is going on. we do have logs but they are related to the device functionality, not the network itself.
We are investigating ways to increase repeatability and understand more!

Sorry i can't share the elf :( but that crash happened on one of our testing devices, so i might be able to reproduce it.
I'll post more when we'll have more data available.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Jul 5, 2023
@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp we are still working on this, we are bit puzzled because it seems that the heap poisoning works only for free blocks, ignoring allocated ones.
we are working trying to trigger the checks in a way that can be used on the field.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jul 25, 2023

Hello @zhangyanjiaoesp ,
while we are still working on isolating the heap corruptions we found another crash related to the MTX task.
Sadly we cannot look at it because it's part of the closed source code.

Please note that the whole mutex is handled internally by espressif, so our code is not involved

==================== CURRENT THREAD STACK =====================
#0  0x403763e9 in panic_abort (details=0x3fccd2cb \"assert failed: spinlock_acquire spinlock.h:123 ((result == SPINLOCK_FREE) == (lock->count == 0))\") at /opt/esp-idf/components/esp_system/panic.c:408
#1  0x40384688 in esp_system_abort (details=0x3fccd2cb \"assert failed: spinlock_acquire spinlock.h:123 ((result == SPINLOCK_FREE) == (lock->count == 0))\") at /opt/esp-idf/components/esp_system/esp_system.c:137
#2  0x4038d3b4 in __assert_func (file=0x3c181b47 \"spinlock.h\", line=<optimized out>, func=<optimized out>, expr=0x3c182a44 \"(result == SPINLOCK_FREE) == (lock->count == 0)\") at /opt/esp-idf/components/newlib/assert.c:85
#3  0x40388b2c in spinlock_acquire (timeout=<optimized out>, lock=0x3d819d44) at /opt/esp-idf/components/esp_hw_support/include/soc/spinlock.h:123
#4  xPortEnterCriticalTimeout (mux=0x3d819d44, timeout=<optimized out>) at /opt/esp-idf/components/freertos/port/xtensa/port.c:301
#5  0x403858f0 in vPortEnterCritical (mux=0x3d819d44) at /opt/esp-idf/components/freertos/port/xtensa/include/freertos/portmacro.h:578
#6  xQueueSemaphoreTake (xQueue=0x3d819cf0, xTicksToWait=<optimized out>) at /opt/esp-idf/components/freertos/queue.c:1563
#7  0x40385a53 in xQueueTakeMutexRecursive (xMutex=0x3d819cf0, xTicksToWait=4294967295) at /opt/esp-idf/components/freertos/queue.c:731
#8  0x40376780 in mutex_lock_wrapper (mutex=0x3d819cf0) at /opt/esp-idf/components/esp_wifi/esp32s3/esp_adapter.c:390
#9  0x4211e402 in mesh_mutex_lock ()
#10 0x4211be87 in mesh_tx_task_main ()
#11 0x4038890d in vPortTaskWrapper (pxCode=0x4211bae8 <mesh_tx_task_main>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

Attached the dump

v3_dump_FFFFAC17540004AC_1690286435.txt

We currently work on top of 4fc8964

No idea on how to reproduce.

@KonssnoK
Copy link
Contributor Author

Another mesh crash we cannot investigate

==================== CURRENT THREAD STACK =====================
#0  0x42106922 in esp_mesh_scan_done ()
#1  0x4210cb2d in mesh_nwk_task_main ()
#2  0x4038890d in vPortTaskWrapper (pxCode=0x4210c1fc <mesh_nwk_task_main>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

v3_dump_FFFFAC1754000474_1690351584.txt

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK Can you provide the .elf file and the backtrace info ?

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp the elf i would send it via mail, if you can provide it.
For the backtrace, do you mean that the file generated by coredump.py is not enough?

@Xiehanxin
Copy link

hi @KonssnoK you can send it to the following email [email protected] and include your elf file. Please make sure to mention the GitHub link and provide a brief description in the email. Alternatively, you can notify me (@mention) after sending the email, that we can get the information as soon as possible

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Jul 27, 2023

@KonssnoK coredump.py is not enough, can you provide more context for the logs?

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp sadly no. the coredumps are retrieved from devices in the field placed in customers houses, so this is all we can see...
We are also trying to trigger those crashes but no luck for now

What kind of context would you need?

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK We want to get the runtime logs until the crash, it may help us to analyze the crash scenario. It will help us also if you can provide the .elf file. The mail is as @Xiehanxin mentioned in the comment.

@KonssnoK
Copy link
Contributor Author

sorry, somehow i missed the @Xiehanxin comment!

I sent an initial email with one of the most recent crashes related to

==================== CURRENT THREAD STACK =====================
#0  0x42135e1d in ?? ()
#1  0x421364be in cnx_csa_fn_process ()
#2  0x42135063 in ?? ()
#3  0x4213520c in ieee80211_timer_do_process ()
#4  0x4038dbe4 in ppTask ()
#5  0x40388f05 in vPortTaskWrapper (pxCode=0x4038db3c <ppTask>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

let's start with one, we'll see the next ones.

Meanwhile, by tomorrow we should have some results of our heap-checking enhanced firmware.

@KonssnoK
Copy link
Contributor Author

Hello, @zhangyanjiaoesp , @Xiehanxin.

I have a question regarding one other crash

================== CURRENT THREAD REGISTERS ===================
exccause       0x9 (LoadStoreAlignmentCause)
excvaddr       0x4037ea59
epc1           0x420b815b

Apparently a misaligned access to the memory occurs in the WIFI task.

By looking at who is doing the access tho, it is still the allocation.

==================== CURRENT THREAD STACK =====================
#0  insert_free_block (sl=0, fl=24, block=0x3df24894, control=0x3df00014) at /opt/esp-idf/components/heap/heap_tlsf.c:237
#1  block_insert (block=0x3df24894, control=0x3df00014) at /opt/esp-idf/components/heap/heap_tlsf.c:263
#2  block_trim_free (size=<optimized out>, block=<optimized out>, control=0x3df00014) at /opt/esp-idf/components/heap/heap_tlsf.c:377
#3  block_prepare_used (size=<optimized out>, block=<optimized out>, control=0x3df00014) at /opt/esp-idf/components/heap/heap_tlsf.c:454
#4  tlsf_malloc (tlsf=0x3df00014, size=<optimized out>) at /opt/esp-idf/components/heap/heap_tlsf.c:861
#5  0x4038c78c in multi_heap_malloc_impl (heap=0x3df00000, size=24) at /opt/esp-idf/components/heap/multi_heap.c:187
#6  0x4038c944 in multi_heap_malloc (heap=0x3df00000, size=8) at /opt/esp-idf/components/heap/multi_heap_poisoning.c:289
#7  0x403769d0 in heap_caps_malloc_base (size=8, caps=5120) at /opt/esp-idf/components/heap/heap_caps.c:175
#8  0x40376b22 in heap_caps_malloc_prefer (size=8, num=1) at /opt/esp-idf/components/heap/heap_caps.c:290
#9  0x403766c0 in wifi_malloc (size=8) at /opt/esp-idf/components/esp_wifi/esp32s3/esp_adapter.c:71
#10 0x421352aa in ieee80211_timer_process ()
#11 0x421235cc in ?? ()
#12 0x4200a075 in timer_process_alarm (dispatch_method=ESP_TIMER_TASK) at /opt/esp-idf/components/esp_timer/src/esp_timer.c:396
#13 0x4200a0c8 in timer_task (arg=0x0) at /opt/esp-idf/components/esp_timer/src/esp_timer.c:422
#14 0x40388f05 in vPortTaskWrapper (pxCode=0x4200a0b4 <timer_task>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

Could it be that an aligned flag is missing in the wifi function allocating the memory?
Or a non-aligned malloc is never permitted?

Thanks

v3_dump_FFFFAC1754000054_1690791144.txt

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK All of these crash issues occurred on the same test version and code?

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp we'll try to check where we call it.
Meanwhile, we got the following

==================== CURRENT THREAD STACK =====================
#0  0x421365d5 in ?? ()
#1  0x42136c86 in cnx_csa_fn_process ()
#2  0x4213581b in ?? ()
#3  0x421359c4 in ieee80211_timer_do_process ()
#4  0x4038d870 in ppTask ()
#5  0x40388aad in vPortTaskWrapper (pxCode=0x4038d7c8 <ppTask>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

v3_dump_FFFFAC1754000A28_1693384425.txt

@zhangyanjiaoesp
Copy link
Collaborator

@zhangyanjiaoesp we'll try to check where we call it. Meanwhile, we got the following

==================== CURRENT THREAD STACK =====================
#0  0x421365d5 in ?? ()
#1  0x42136c86 in cnx_csa_fn_process ()
#2  0x4213581b in ?? ()
#3  0x421359c4 in ieee80211_timer_do_process ()
#4  0x4038d870 in ppTask ()
#5  0x40388aad in vPortTaskWrapper (pxCode=0x4038d7c8 <ppTask>, pvParameters=0x0) at /opt/esp-idf/components/freertos/port/xtensa/port.c:142

v3_dump_FFFFAC1754000A28_1693384425.txt

Do you have the elf file of this crash issue?

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 4, 2023

@zhangyanjiaoesp mmm i could send it to the esp mail

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 5, 2023

@zhangyanjiaoesp who should i reference the mail to?

@zhangyanjiaoesp
Copy link
Collaborator

#11713 (comment)

@KonssnoK This comment has given the contact email.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 5, 2023

@zhangyanjiaoesp done

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 5, 2023

@zhangyanjiaoesp well.. i had to resend the mail twice because :


Error:               550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization.                        Message     rejected by:               TYZPR04MB5654.apcprd04.prod.outlook.com | Error: | 550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization. | Message     rejected by: | TYZPR04MB5654.apcprd04.prod.outlook.com
-- | -- | -- | -- | --
Error: | 550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization.

how should i attach the files?
i tried a

  • zip -> refused
  • zip renamed txt -> refused
  • no extension -> refused
  • not compressed -> too big
  • 7z -> refused
  • only elf -> too big
  • only dump -> invalid file

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 5, 2023

i can attach the dump here, but not the elf of the firmware

v3_dump_FFFFAC1754000590_1693906362.zip

@zhangyanjiaoesp
Copy link
Collaborator

@zhangyanjiaoesp well.. i had to resend the mail twice because :


Error:               550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization.                        Message     rejected by:               TYZPR04MB5654.apcprd04.prod.outlook.com | Error: | 550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization. | Message     rejected by: | TYZPR04MB5654.apcprd04.prod.outlook.com
-- | -- | -- | -- | --
Error: | 550     5.0.350 One or more of the attachments in your email is of a file type that     is NOT allowed by the recipient's organization.

how should i attach the files? i tried a

  • zip -> refused
  • zip renamed txt -> refused
  • no extension -> refused
  • not compressed -> too big
  • 7z -> refused
  • only elf -> too big
  • only dump -> invalid file

@KonssnoK I remember you once sent an email to attach elf files, how did it work before? #11713 (comment)

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Sep 8, 2023

@zhangyanjiaoesp yeah, i remember sending an elf too.

I've now sent it directly to Caijin, let's see..

Edit:
no, same issue. Apparently it's impossible to send a zip or an elf to you guys.

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK Just received the elf file from my colleague

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK

Does this should be .txt file ?
image
And the corresponding .elf file is the firmware_1.1.7.elf, right?
image

I can't open the v3_dump_FFFFAC1754000590_1693906362.elf even I rename it to .txt file.
image

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp we parse the dump file with

python components\espcoredump\espcoredump.py --chip esp32s3 info_corefile --debug 0 --core "./v3_dump_FFFFAC1754000590_1693906362.elf" "./firmware_1.1.7.elf"

in this case i didn't upload the txt one

@zhangyanjiaoesp
Copy link
Collaborator

ok, I will try it.

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK

  1. Do you know what the board was doing when the crash happened?
  2. Is the firmware_1.1.7.elf file compiled directly?

@KonssnoK
Copy link
Contributor Author

  1. no, these are boards on the field, we cannot get wifi logs because they do not use the nominal IDF log module (which instead we can remap to our log function)
  2. yes, it's the direct result of the compilation

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK
I'm sorry, according to the disassembly results still cannot determine the root cause of the problem. It's hard to debug this issue if we don't know what the board was doing when the crash happened.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp
yes i can imagine.. i don't know what we can do to face this issue, consider that we currently have this version running on 120 devices and only some have this crash... :\

@arnoutdekimo
Copy link

FYI, I just reported #12261
Given that you guys also have crashes with esp_netif_list_lock corruption, chances are this is related.
The esp_netif_list lock design is just wrong.

@GerryBriggs
Copy link

I'm also getting heap corruption errors using pretty common ADF code - and it's only a problem with some HTTPS audio streams but never with simple HTTP streams. After a few seconds of stream on these bad HTTPS links I will get the error : assert failed: remove_free_block tlsf.c:330 (prev && "prev_free field can not be null") and then the core panic. This is on ESP32C3 so single core. IDF 5.01 environment

@zhangyanjiaoesp
Copy link
Collaborator

I'm also getting heap corruption errors using pretty common ADF code - and it's only a problem with some HTTPS audio streams but never with simple HTTP streams. After a few seconds of stream on these bad HTTPS links I will get the error : assert failed: remove_free_block tlsf.c:330 (prev && "prev_free field can not be null") and then the core panic. This is on ESP32C3 so single core. IDF 5.01 environment

@GerryBriggs Since you are based on ADF and your crash issue is not same as @KonssnoK 's,please create a new ticket to record your problem, thanks.

@KonssnoK
Copy link
Contributor Author

btw since the last lwip memory fix, our crashes considerably reduced.
(we still have few tho)

@GerryBriggs
Copy link

btw since the last lwip memory fix, our crashes considerably reduced. (we still have few tho)

Oh I will give that a try. I didnt know that the LWIP was updated. Most of my code is using IDF libraries not ADF. But the central events process is ADF audio "pipeline". I will start a breakout ticket tonight and post my results with the new LWIP. I will also give a list of HTTPS streams (connections) that cause the heap corruption, and a list of HTTPS streams that do not cause it.

@GerryBriggs
Copy link

OK wow that worked for me. The new LWIP seems to have got rid of my heap corruption core panics on various HTTPS connections. Upgraded to 5.1.2 from 5.0.1. Wow. Thank you

@KonssnoK
Copy link
Contributor Author

OK wow that worked for me. The new LWIP seems to have got rid of my heap corruption core panics on various HTTPS connections. Upgraded to 5.1.2 from 5.0.1. Wow. Thank you

Glad it helped!

@KonssnoK
Copy link
Contributor Author

i will close this for now and open a new one in case we find new issues

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress labels Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

7 participants