[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

kapi-no · 2024-04-30T13:07:07Z

This change works around the issue with the semaphore timeout during
the Bluetooth HCI driver initialization when the bt_enable function
is called in the context of the System Workqueue thread. This issue
only affects platform that use the IPC service and its ICBMsg backend
(e.g. the nRF54H20 DK target).

The bt_enable function, when called in the System Workqueue context,
results in a deadlock, as the waiting semaphore of the Bluetooth HCI
driver times out:

bt_hci_driver: Endpoint binding failed with -11

During the Bluetooth HCI driver open operation in the context of the
bt_enable function, the driver code waits using the semaphore for the
endpoint binding process of the IPC service module to finalize. The
issue occurs when the waiting occurs in the System Workqueue context.
The ICBMsg backend from the IPC service schedules a system work during
the endpoint registration, in which it finalizes the binding operation,
also in the System Workqueue context. As the Bluetooth HCI driver
with its wait operation keeps the System Workqueue context busy, the
endpoint binding cannot be completed by the ICBMsg backend before the
HCI driver semaphore timeout.

Upstream PR: zephyrproject-rtos/zephyr#72377

This change is a temporary workaround solution to streamline the development of applications for the nRF54H20 DK target that use the App Event Manager module and operate in the System Workqueue context (e.g. nRF Desktop application). The proper solution is planned for the Bluetooth initialization process that will replace the current workaround.

Ticket for a more long-term solution:

https://nordicsemi.atlassian.net/browse/NCSDK-27318

subsys/ipc/ipc_service/backends/ipc_icbmsg.c

MarekPieta · 2024-05-06T06:53:44Z

subsys/ipc/ipc_service/backends/ipc_icbmsg.c


 	dev_data->conf = conf;
 	dev_data->is_initiator = (conf->rx.blocks_ptr < conf->tx.blocks_ptr);
 	k_mutex_init(&dev_data->mutex);
+	k_work_queue_init(&dev_data->ep_bound_work_q);


Maybe we could keep the workq common among devices? (then we should keep it global and initialize once) Alternatively we would need to define separate stacks for the workqueues (current implementation may lead to really tricky errors I think)

With the common workq, we could also easily add a temporary Kconfig to avoid increasing memory usage if not needed (app that needs the workaround would explicitly enable it)

If we treat it as a temporary solution, I would avoid adding a new Kconfig option. With the new Kconfig option, we expose a new configuration API to the user which may be later on removed and will need to follow the deprecation process.

You could add an experimental Kconfig to avoid deprecation process I think (anyway I don't think it's necessary)

subsys/ipc/ipc_service/backends/ipc_icbmsg.c

doki-nordic

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

kapi-no · 2024-05-06T12:43:16Z

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

Great, please make sure to revert this change when you bring the proper solution upstream (to NCS)

carlescufi

why not fromlist?

kapi-no · 2024-05-06T14:49:40Z

why not fromlist?

Upstream PR is available here:

zephyrproject-rtos/zephyr#72377

I will soon adapt the commit description in this PR

hubertmis · 2024-05-07T06:08:52Z

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

I don't think the ICMsg module should expose workqueues. It's not the responsibility of the ICMsg module. ICMsg should be about passing messages between processors.
If we want to have a shared IPC workqueue, we should extract it from ICMsg and create a new module like ipc_workqueue or icmsg_workqueue used by both ICMsg and icbmsg, and potentially other IPC users. We would need to clearly define when blocking of this workqueue is permitted to avoid deadlocks like the one being fixed here.

doki-nordic · 2024-05-08T10:52:13Z

I don't think the ICMsg module should expose workqueues. It's not the responsibility of the ICMsg module. ICMsg should be about passing messages between processors. If we want to have a shared IPC workqueue, we should extract it from ICMsg and create a new module like ipc_workqueue or icmsg_workqueue used by both ICMsg and icbmsg, and potentially other IPC users. We would need to clearly define when blocking of this workqueue is permitted to avoid deadlocks like the one being fixed here.

I agree that taking work queue into some common module is the best approach. Looking at the current implementation of ICMsg and ICBMsg, we should assume that this new shared work queue cannot be blocked.

With that assumption in mind, the "bounded" callback of ipc_service becomes problematic. If we call it from our new work queue, we must clearly state that user cannot block in the "bounded" callback of ipc_service. If we redirect the callback to system work queue, we will end up with the same dead lock, because "bounded" callback from ipc_hci cannot be called from system work queue as kapi-no pointed out at the beginning of this PR.

…binding deadlock This change works around the issue with the semaphore timeout during the Bluetooth HCI driver initialization when the bt_enable function is called in the context of the System Workqueue thread. This issue only affects platform that use the IPC service and its ICBMsg backend (e.g. the nRF54H20 DK target). The bt_enable function, when called in the System Workqueue context, results in a deadlock, as the waiting semaphore of the Bluetooth HCI driver times out: bt_hci_driver: Endpoint binding failed with -11 During the Bluetooth HCI driver open operation in the context of the bt_enable function, the driver code waits using the semaphore for the endpoint binding process of the IPC service module to finalize. The issue occurs when the waiting occurs in the System Workqueue context. The ICBMsg backend from the IPC service schedules a system work during the endpoint registration, in which it finalizes the binding operation - also in the System Workqueue context. As the Bluetooth HCI driver with its wait operation keeps the System Workqueue context busy, the endpoint binding cannot be completed by the ICBMsg backend before the HCI driver semaphore timeout. Upstream PR: zephyrproject-rtos/zephyr#72377 Signed-off-by: Kamil Piszczek <[email protected]>

kapi-no requested review from pdunaj, MarekPieta, zycz, doki-nordic and alstrzebonski April 30, 2024 13:07

NordicBuilder added the area: IPC label Apr 30, 2024

kapi-no mentioned this pull request Apr 30, 2024

Desktop nrf54h nrfconnect/sdk-nrf#15080

Merged

NordicBuilder mentioned this pull request Apr 30, 2024

manifest: update zephyr to pull in ipc workaround nrfconnect/sdk-nrf#15142

Merged

MarekPieta requested changes May 6, 2024

View reviewed changes

kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from 13fbbf0 to a023240 Compare May 6, 2024 08:51

kapi-no requested a review from MarekPieta May 6, 2024 08:51

alstrzebonski approved these changes May 6, 2024

View reviewed changes

doki-nordic approved these changes May 6, 2024

View reviewed changes

zycz approved these changes May 6, 2024

View reviewed changes

MarekPieta approved these changes May 6, 2024

View reviewed changes

carlescufi requested changes May 6, 2024

View reviewed changes

kapi-no mentioned this pull request May 7, 2024

IPC service: ICBMsg backend: workaround for initialization in the System Workqueue context zephyrproject-rtos/zephyr#72377

Merged

kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from a023240 to ad0fa2f Compare May 8, 2024 14:50

kapi-no changed the title ~~[nrf noup] subsys: ipc: workaround endpoint register thread deadlock~~ [nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock May 8, 2024

kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from ad0fa2f to 9388fd0 Compare May 8, 2024 14:55

kapi-no requested a review from carlescufi May 8, 2024 17:38

carlescufi approved these changes May 8, 2024

View reviewed changes

carlescufi merged commit 5c19b37 into nrfconnect:main May 8, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

kapi-no commented Apr 30, 2024 •

edited

Loading

MarekPieta May 6, 2024

MarekPieta May 6, 2024

kapi-no May 6, 2024 •

edited

Loading

MarekPieta May 6, 2024

doki-nordic left a comment

kapi-no commented May 6, 2024

carlescufi left a comment

kapi-no commented May 6, 2024 •

edited

Loading

hubertmis commented May 7, 2024

doki-nordic commented May 8, 2024

[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

Conversation

kapi-no commented Apr 30, 2024 • edited Loading

MarekPieta May 6, 2024

Choose a reason for hiding this comment

MarekPieta May 6, 2024

Choose a reason for hiding this comment

kapi-no May 6, 2024 • edited Loading

Choose a reason for hiding this comment

MarekPieta May 6, 2024

Choose a reason for hiding this comment

doki-nordic left a comment

Choose a reason for hiding this comment

kapi-no commented May 6, 2024

carlescufi left a comment

Choose a reason for hiding this comment

kapi-no commented May 6, 2024 • edited Loading

hubertmis commented May 7, 2024

doki-nordic commented May 8, 2024

kapi-no commented Apr 30, 2024 •

edited

Loading

kapi-no May 6, 2024 •

edited

Loading

kapi-no commented May 6, 2024 •

edited

Loading