-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676
[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676
Conversation
|
||
dev_data->conf = conf; | ||
dev_data->is_initiator = (conf->rx.blocks_ptr < conf->tx.blocks_ptr); | ||
k_mutex_init(&dev_data->mutex); | ||
k_work_queue_init(&dev_data->ep_bound_work_q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could keep the workq common among devices? (then we should keep it global and initialize once) Alternatively we would need to define separate stacks for the workqueues (current implementation may lead to really tricky errors I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the common workq, we could also easily add a temporary Kconfig to avoid increasing memory usage if not needed (app that needs the workaround would explicitly enable it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we treat it as a temporary solution, I would avoid adding a new Kconfig option. With the new Kconfig option, we expose a new configuration API to the user which may be later on removed and will need to follow the deprecation process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add an experimental Kconfig to avoid deprecation process I think (anyway I don't think it's necessary)
13fbbf0
to
a023240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.
Great, please make sure to revert this change when you bring the proper solution upstream (to NCS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not fromlist
?
Upstream PR is available here: zephyrproject-rtos/zephyr#72377 I will soon adapt the commit description in this PR |
I don't think the ICMsg module should expose workqueues. It's not the responsibility of the ICMsg module. ICMsg should be about passing messages between processors. |
I agree that taking work queue into some common module is the best approach. Looking at the current implementation of ICMsg and ICBMsg, we should assume that this new shared work queue cannot be blocked. With that assumption in mind, the "bounded" callback of ipc_service becomes problematic. If we call it from our new work queue, we must clearly state that user cannot block in the "bounded" callback of ipc_service. If we redirect the callback to system work queue, we will end up with the same dead lock, because "bounded" callback from ipc_hci cannot be called from system work queue as kapi-no pointed out at the beginning of this PR. |
a023240
to
ad0fa2f
Compare
…binding deadlock This change works around the issue with the semaphore timeout during the Bluetooth HCI driver initialization when the bt_enable function is called in the context of the System Workqueue thread. This issue only affects platform that use the IPC service and its ICBMsg backend (e.g. the nRF54H20 DK target). The bt_enable function, when called in the System Workqueue context, results in a deadlock, as the waiting semaphore of the Bluetooth HCI driver times out: bt_hci_driver: Endpoint binding failed with -11 During the Bluetooth HCI driver open operation in the context of the bt_enable function, the driver code waits using the semaphore for the endpoint binding process of the IPC service module to finalize. The issue occurs when the waiting occurs in the System Workqueue context. The ICBMsg backend from the IPC service schedules a system work during the endpoint registration, in which it finalizes the binding operation - also in the System Workqueue context. As the Bluetooth HCI driver with its wait operation keeps the System Workqueue context busy, the endpoint binding cannot be completed by the ICBMsg backend before the HCI driver semaphore timeout. Upstream PR: zephyrproject-rtos/zephyr#72377 Signed-off-by: Kamil Piszczek <[email protected]>
ad0fa2f
to
9388fd0
Compare
This change works around the issue with the semaphore timeout during
the Bluetooth HCI driver initialization when the bt_enable function
is called in the context of the System Workqueue thread. This issue
only affects platform that use the IPC service and its ICBMsg backend
(e.g. the nRF54H20 DK target).
The bt_enable function, when called in the System Workqueue context,
results in a deadlock, as the waiting semaphore of the Bluetooth HCI
driver times out:
bt_hci_driver: Endpoint binding failed with -11
During the Bluetooth HCI driver open operation in the context of the
bt_enable function, the driver code waits using the semaphore for the
endpoint binding process of the IPC service module to finalize. The
issue occurs when the waiting occurs in the System Workqueue context.
The ICBMsg backend from the IPC service schedules a system work during
the endpoint registration, in which it finalizes the binding operation,
also in the System Workqueue context. As the Bluetooth HCI driver
with its wait operation keeps the System Workqueue context busy, the
endpoint binding cannot be completed by the ICBMsg backend before the
HCI driver semaphore timeout.
Upstream PR: zephyrproject-rtos/zephyr#72377
This change is a temporary workaround solution to streamline the development of applications for the nRF54H20 DK target that use the App Event Manager module and operate in the System Workqueue context (e.g. nRF Desktop application). The proper solution is planned for the Bluetooth initialization process that will replace the current workaround.
Ticket for a more long-term solution:
https://nordicsemi.atlassian.net/browse/NCSDK-27318