Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock #1676

Conversation

kapi-no
Copy link
Contributor

@kapi-no kapi-no commented Apr 30, 2024

This change works around the issue with the semaphore timeout during
the Bluetooth HCI driver initialization when the bt_enable function
is called in the context of the System Workqueue thread. This issue
only affects platform that use the IPC service and its ICBMsg backend
(e.g. the nRF54H20 DK target).

The bt_enable function, when called in the System Workqueue context,
results in a deadlock, as the waiting semaphore of the Bluetooth HCI
driver times out:

bt_hci_driver: Endpoint binding failed with -11

During the Bluetooth HCI driver open operation in the context of the
bt_enable function, the driver code waits using the semaphore for the
endpoint binding process of the IPC service module to finalize. The
issue occurs when the waiting occurs in the System Workqueue context.
The ICBMsg backend from the IPC service schedules a system work during
the endpoint registration, in which it finalizes the binding operation,
also in the System Workqueue context. As the Bluetooth HCI driver
with its wait operation keeps the System Workqueue context busy, the
endpoint binding cannot be completed by the ICBMsg backend before the
HCI driver semaphore timeout.

Upstream PR: zephyrproject-rtos/zephyr#72377

This change is a temporary workaround solution to streamline the development of applications for the nRF54H20 DK target that use the App Event Manager module and operate in the System Workqueue context (e.g. nRF Desktop application). The proper solution is planned for the Bluetooth initialization process that will replace the current workaround.

Ticket for a more long-term solution:

https://nordicsemi.atlassian.net/browse/NCSDK-27318


dev_data->conf = conf;
dev_data->is_initiator = (conf->rx.blocks_ptr < conf->tx.blocks_ptr);
k_mutex_init(&dev_data->mutex);
k_work_queue_init(&dev_data->ep_bound_work_q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could keep the workq common among devices? (then we should keep it global and initialize once) Alternatively we would need to define separate stacks for the workqueues (current implementation may lead to really tricky errors I think)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the common workq, we could also easily add a temporary Kconfig to avoid increasing memory usage if not needed (app that needs the workaround would explicitly enable it)

Copy link
Contributor Author

@kapi-no kapi-no May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we treat it as a temporary solution, I would avoid adding a new Kconfig option. With the new Kconfig option, we expose a new configuration API to the user which may be later on removed and will need to follow the deprecation process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an experimental Kconfig to avoid deprecation process I think (anyway I don't think it's necessary)

@kapi-no kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from 13fbbf0 to a023240 Compare May 6, 2024 08:51
@kapi-no kapi-no requested a review from MarekPieta May 6, 2024 08:51
Copy link
Contributor

@doki-nordic doki-nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

@kapi-no
Copy link
Contributor Author

kapi-no commented May 6, 2024

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

Great, please make sure to revert this change when you bring the proper solution upstream (to NCS)

Copy link
Contributor

@carlescufi carlescufi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not fromlist?

@kapi-no
Copy link
Contributor Author

kapi-no commented May 6, 2024

why not fromlist?

Upstream PR is available here:

zephyrproject-rtos/zephyr#72377

I will soon adapt the commit description in this PR

@hubertmis
Copy link
Contributor

This is acceptable as temporary solution. The long therm solution requires reusing dedicated work queue from the ICMsg module. I will implement it later in the upstream Zephyr.

I don't think the ICMsg module should expose workqueues. It's not the responsibility of the ICMsg module. ICMsg should be about passing messages between processors.
If we want to have a shared IPC workqueue, we should extract it from ICMsg and create a new module like ipc_workqueue or icmsg_workqueue used by both ICMsg and icbmsg, and potentially other IPC users. We would need to clearly define when blocking of this workqueue is permitted to avoid deadlocks like the one being fixed here.

@doki-nordic
Copy link
Contributor

I don't think the ICMsg module should expose workqueues. It's not the responsibility of the ICMsg module. ICMsg should be about passing messages between processors. If we want to have a shared IPC workqueue, we should extract it from ICMsg and create a new module like ipc_workqueue or icmsg_workqueue used by both ICMsg and icbmsg, and potentially other IPC users. We would need to clearly define when blocking of this workqueue is permitted to avoid deadlocks like the one being fixed here.

I agree that taking work queue into some common module is the best approach. Looking at the current implementation of ICMsg and ICBMsg, we should assume that this new shared work queue cannot be blocked.

With that assumption in mind, the "bounded" callback of ipc_service becomes problematic. If we call it from our new work queue, we must clearly state that user cannot block in the "bounded" callback of ipc_service. If we redirect the callback to system work queue, we will end up with the same dead lock, because "bounded" callback from ipc_hci cannot be called from system work queue as kapi-no pointed out at the beginning of this PR.

@kapi-no kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from a023240 to ad0fa2f Compare May 8, 2024 14:50
@kapi-no kapi-no changed the title [nrf noup] subsys: ipc: workaround endpoint register thread deadlock [nrf fromlist] ipc: ipc_service: icbmsg backend: workaround endpoint binding deadlock May 8, 2024
…binding deadlock

This change works around the issue with the semaphore timeout during
the Bluetooth HCI driver initialization when the bt_enable function
is called in the context of the System Workqueue thread. This issue
only affects platform that use the IPC service and its ICBMsg backend
(e.g. the nRF54H20 DK target).

The bt_enable function, when called in the System Workqueue context,
results in a deadlock, as the waiting semaphore of the Bluetooth HCI
driver times out:

bt_hci_driver: Endpoint binding failed with -11

During the Bluetooth HCI driver open operation in the context of the
bt_enable function, the driver code waits using the semaphore for the
endpoint binding process of the IPC service module to finalize. The
issue occurs when the  waiting occurs in the System Workqueue context.
The ICBMsg backend from the IPC service schedules a system work during
the endpoint registration, in which it finalizes the binding operation
- also in the System Workqueue context. As the Bluetooth HCI driver
with its wait operation keeps the System Workqueue context busy, the
endpoint binding cannot be completed by the ICBMsg backend before the
HCI driver semaphore timeout.

Upstream PR: zephyrproject-rtos/zephyr#72377

Signed-off-by: Kamil Piszczek <[email protected]>
@kapi-no kapi-no force-pushed the nrf_desktop_nrf54h20_ipc_sys_workq_workaround branch from ad0fa2f to 9388fd0 Compare May 8, 2024 14:55
@kapi-no kapi-no requested a review from carlescufi May 8, 2024 17:38
@carlescufi carlescufi merged commit 5c19b37 into nrfconnect:main May 8, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants