Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootutil: Fix device bricked after power failure during swap-move revert #2100

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

taltenbach
Copy link
Contributor

@taltenbach taltenbach commented Oct 16, 2024

This PR proposes a fix to #1966, which describes a scenario where a device can be bricked if a revert process is interrupted when using swap-move.

As suggested in this message, a very straightforward fix might be enough. The latter is implemented in this PR.

The idea is to perform a revert no matter the state of the magic number in the secondary slot's trailer, provided the copy-done
flag is set in the primary slot but the image-ok flag is not. The copy-done flag is set only after having completed an upgrade or
revert process so if the copy-done flag is set but the image-ok is unset, it is guaranteed an upgrade has been performed but the new image has not been confirmed, which implies a revert is needed.

That looks good to me but perhaps I missed some corner cases that would justify that BOOT_MAGIC_UNSET was used instead of BOOT_MAGIC_ANY. @utzig @d3zd3z do you have any input on that point?

Fixes #1966

@taltenbach taltenbach changed the title bootutil: Fix device brick after power failure during swap-move revert bootutil: Fix device bricked after power failure during swap-move revert Oct 16, 2024
@de-nordic de-nordic added the bug label Oct 25, 2024
@d3zd3z
Copy link
Member

d3zd3z commented Oct 29, 2024

My only request would be to see if we can come up with a test for the simulator that provokes this failure. The simulator should be interrupting between each flash operation, although because that is really slow, it might only just be doing it randomly.

Otherwise, the fix seems reasonable to me.

@de-nordic
Copy link
Collaborator

We need documentation update for that, because state description there will not longer match the code.

Let's suppose after an upgrade you have a non-functional image in the
primary slot. The image won't be confirmed, leading to a revert at next
boot. At the beginning of the revert process, fixup_revert is invoked,
which rewrites the trailer in the secondary slot so that the revert
looks like a permanent upgrade. Normally, after the execution of this
routine, the secondary slot has a valid trailer, in particular with a
valid magic number.

Let's imagine a power failure occurs during the writing of the trailer's
magic, i.e. in boot_write_magic. The magic number in the secondary slot
is in an undefined state and might be partially written, which implies
at next boot it will be considered in BOOT_MAGIC_BAD state.

So, at next boot, we have the following state:
Primary slot: magic=good, copy-done=set, image-ok=unset
Secondary slot: magic=bad, copy-done=unset, image-ok=set

This doesn't match any state leading to an upgrade or revert process to
be initiated, which means MCUboot will not perform the revert and
attempt to boot from the primary slot, containing a non-functional
image. Hence, the device is bricked unless it is possible to reflash the
secondary slot without a functional image.

To avoid this issue, a revert is performed no matter the state of the
magic number in the secondary slot's trailer, provided the copy-done
flag is set in the primary slot but the image-ok flag is not. The
copy-done flag is set only after having completed an upgrade or
revert process so if the copy-done flag is set but the image-ok is
unset, it is guaranteed an upgrade has been performed but the new image
has not been confirmed, which implies a revert is needed.

Signed-off-by: Thomas Altenbach <[email protected]>
@taltenbach taltenbach force-pushed the fix/swap-move-revert-brick branch from 0d58e4a to bf93d4e Compare November 2, 2024 19:32
@taltenbach
Copy link
Contributor Author

taltenbach commented Nov 2, 2024

My only request would be to see if we can come up with a test for the simulator that provokes this failure. The simulator should be interrupting between each flash operation, although because that is really slow, it might only just be doing it randomly.

@d3zd3z I created an issue (#2108) regarding that point with some ideas that we can discuss. I will try to find some time to implement a solution in the following weeks.

We need documentation update for that, because state description there will not longer match the code.

@de-nordic You're right, it should be better now :)

@nordicjm nordicjm merged commit 4f39356 into mcu-tools:main Dec 11, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

swap-move: Power failure during the writing of the magic could brick the device
4 participants