Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New OVERWRITE_ONLY_KEEP_BACKUP option #1906

Merged

Conversation

PetrBuchtaNXP
Copy link
Contributor

I needed to create s solution based on OVERWRITE_ONLY mode that would keep a backup of the active image in the secondary slot. The main motivation was a use of MCU that doesn't support available swap modes due to its flash features. The code modification was very simple, however, I'm not 100% sure these changes don't clash with something else.

I needed to cover these scenarios:

  • Both primary and secondary images are valid -> boot
  • Primary image invalid -> do a recovery from the backup in the secondary slot -> boot
  • Secondary image invalid -> do a backup recovery from the primary slot -> boot

The commit message explains further:

It builds on top of OVERWRITE_ONLY mode and uses secondary slot as a backup of the primary slot. The main difference is that after image copy to the primary slot the secondary slot is NOT erased. This is meant to be used together with BOOTSTRAP option that will recover the primary image from the backup in case it's not valid. The backup is checked on every boot. In case the backup is invalid the image from the primary slot is used to create a valid backup again.

@de-nordic de-nordic requested review from d3zd3z and nordicjm March 1, 2024 13:36
@de-nordic
Copy link
Collaborator

Why would backup get damaged?

@PetrBuchtaNXP
Copy link
Contributor Author

Why would backup get damaged?

For the same reason the main image gets damaged. I can think of several reasons:

  • flash bits get flipped, that's why we need ECC flash when required, not likely but possible, it surely happens in real life
  • bug in the running app can erase/damage the backup in the secondary slot
  • failed download of new image that may not repeat in the near future will leave the backup invalid

@PetrBuchtaNXP
Copy link
Contributor Author

Maybe the backup recovery routine should be under another config. I can image leaving the image in the secondary slot as a backup without the backup health checks would be enough in majority of scenarios. The backup health check and recovery only when desired. So it would be split under two defines: OVERWRITE_ONLY_KEEP_BACKUP + OVERWRITE_ONLY_BACKUP_RECOVERY

@de-nordic
Copy link
Collaborator

Why would backup get damaged?

For the same reason the main image gets damaged. I can think of several reasons:

* flash bits get flipped, that's why we need ECC flash when required, not likely but possible, it surely happens in real life

The same reason MCUboot can get damaged?

* bug in the running app can erase/damage the backup in the secondary slot

Unless you have MMU that can block write of the MCUboot and primary, it can brick your device; why would such bug select secondary, chance is that any can be picked.

* failed download of new image that may not repeat in the near future will leave the backup invalid

That actually invalidates implementation of an app feature that would be able to continue download after failed attempt. But, I guess, there is no chance to continue anything because the device you use requires erase by large blocks, so you have to basically start from the beginning.

This is also problematic because there is not check for reason of failure, which means that if the flash is dying and flickering for some other reason, this may keep updating backup on every start.

I am not sure whether this fixes anything.

@d3zd3z @nordicjm What do you think?

@PetrBuchtaNXP
Copy link
Contributor Author

Let's say that the part with the backup health check & recovery is questionable.

What about the part that prevents erasing the header of the secondary image after it has been copied to the primary slot, so it can stay as a backup. Does this make sense to you?

@nordicjm
Copy link
Collaborator

nordicjm commented Mar 4, 2024

Let's say that the part with the backup health check & recovery is questionable.

What about the part that prevents erasing the header of the secondary image after it has been copied to the primary slot, so it can stay as a backup. Does this make sense to you?

Not really no, because it won't be used unless it is marked for test/confirmation

@PetrBuchtaNXP
Copy link
Contributor Author

Let's say that the part with the backup health check & recovery is questionable.
What about the part that prevents erasing the header of the secondary image after it has been copied to the primary slot, so it can stay as a backup. Does this make sense to you?

Not really no, because it won't be used unless it is marked for test/confirmation

It will be used by the BOOTSTRAP mechanism (I described it above) if the primary image is invalid. There's no need to mark it. The BOOTSTRAP option takes whatever is valid in the secondary slot, the backup in this case.

@nordicjm
Copy link
Collaborator

nordicjm commented Mar 5, 2024

I vaguely remember something about the bootstrap mechanism allowing bypass of downgrades but might be wrong on that @d3zd3z might know. The whole point still stand though of if your flash gets corrupted in slot0 and it's before the lifespan of the device, then you can't trust any of the flash, and if it's after the lifespan of the device then the flash cells are gone in slot0 so no matter how many times you attempt to write the image to it, the image data will not be there.

@PetrBuchtaNXP
Copy link
Contributor Author

There are situations when flash bits get flipped in a healthy flash. I personally think that in a situation when the primary image is corrupted then to try at least something is better than to simply give up. Of course I can end up in a loop trying to recover the image from its backup but the result is effectively the same - dead device.

@nordicjm
Copy link
Collaborator

nordicjm commented Mar 6, 2024

There are situations when flash bits get flipped in a healthy flash

In NOR flash? I can understand this in NAND flash technology but not NOR. And if there is a read error in NAND flash, that can occur anywhere in the flash, be it the bootloader slot, slot 0 or slot 1 which is why MCUs that use NAND flash (which in my opinion is just wrong from the get go, it is the wrong technology) employ ECC to account for those issues.

I have heard of this before only in one vendor's chips from someone using them who was very unhappy and they had to fork MCUboot to add "hacks" to deal with the chips losing data and that was atmel

@PetrBuchtaNXP
Copy link
Contributor Author

PetrBuchtaNXP commented Mar 6, 2024

There are NOR flash memories with ECC so I take it it's there for some reason. However, I hear your arguments.

I created this patch for LPC55s69 that has sectors of the same size as pages, 512 bytes, which is also the minimal write size. Since it has ECC there's no chance for multiple writes in a single page after it had been erased That's why I cannot use the SWAP mode because the algorithm simply doesn't support flashes like this one. I'm fine with OVERWRITE_ONLY mode but there's a typical requirement from customers that they need to have a backup. That's why I wanted to prevent the erasing of the header after copying the image to the primary slot - to leave it there as a backup.

I'm aware that in the SWAP modes once the image in the primary slot is confirmed the image in the secondary slot is no longer considered as a backup, but together with the BOOTSTRAP option it can be used this way. With this PR I wanted the same behavior also for the OVERWRITE_ONLY mode.

I agree that the backup recovery routine may be redundant, but I still see a certain benefit in having a chance to decide what will happen with the image in the secondary slot once it's copied to the primary slot - to decide if I want to keep it there or have it erased.

@de-nordic
Copy link
Collaborator

de-nordic commented Mar 6, 2024

There are NOR flash memories with ECC so I take it it's there for some reason. However, I hear your arguments.

Yest. that is actually true, and other type of storage also does that.

LPC55s69

Oh, So you have basically a "block device".

I was researching possibility of doing swap algorithm for such devices but did not have time to follow with it.
My idea was to use some blocks, that in swap are used for storing swap record, to store array of CRC32 calculated for all previous blocks, so if you have N block slot, then you would have array of CRC32(0, 0), CRC32(0, 1), CRC32(0, 2) ... CRC32(0, N-1).
When swap starts you calculate such list to the swap status area, then you follow with swap-move algorithm, If anything breaks you start with looking at swap status and start calculating the crc, once you hit a difference you know that this block has not been yet moved so you replace it with the one from image you are swapping in and proceed with swapping. The position in CRC array would show you how far you did go with the swap.
I was still weighting whether you need two arrays, one for primary and one for secondary image, but I guess one would be enough.
If you find what I have written sane, maybe you could help with making this happen.

As you probably understand the difference between what I have described and what we currently have is that in above solution we calculate and store the outcome we expect, in current solution we mark how far we have already gone.

I guess this may be better for your customer as it actually makes the MCUboot work with what you offer.

And the CRC32 is just an idea, probably some better sum could be used.

@PetrBuchtaNXP
Copy link
Contributor Author

If you find what I have written sane, maybe you could help with making this happen.

Yes I was actually thinking about the same thing. Be it CRC32 or SHAxyz, the basic idea it still the same. It only got as far as a simple concept in my head because I have the same problem - not enough time. I might soon have some time to create at least some basic proof of concept so I could then share it with you for a discussion.
However, another problem with "block devices", as you described it, might be that 512 bytes is a bit far from 32 bytes that MCUBoot currently supports for aligning its trailer structures. So coming up with an optimal swapping algorithm based on hashing might not be the only problem.

@de-nordic
Copy link
Collaborator

de-nordic commented Mar 6, 2024

If you find what I have written sane, maybe you could help with making this happen.

Yes I was actually thinking about the same thing. Be it CRC32 or SHAxyz, the basic idea it still the same. It only got as far as a simple concept in my head because I have the same problem - not enough time. I might soon have some time to create at least some basic proof of concept so I could then share it with you for a discussion. However, another problem with "block devices", as you described it, might be that 512 bytes is a bit far from 32 bytes that MCUBoot currently supports for aligning its trailer structures. So coming up with an optimal swapping algorithm based on hashing might not be the only problem.

Not really. The alignment is required because that is how you write flags. If you pre-calculate the CRC, you can as well do that in your DFU app and mark the block as you wish. If it is confirm then the CRC block is removed by MCUboot once it successfully does swap. If it is test and app confirms, then the app removes the block (erases it). If you
This way MCUboot does not have to write any flags. Does this sound sane?

Alt he MCUboot will be doing is moving data by blocks, 512 bytes in your case, or erasing them.

Ah crap. Revert. Ok so once the MCUboot moves complete image, it has to replace the CRC block with the revert block (the CRC of secondary slot). We can figure this out.

@PetrBuchtaNXP
Copy link
Contributor Author

PetrBuchtaNXP commented Mar 6, 2024

Ah crap. Revert. Ok so once the MCUboot moves complete image, it has to replace the CRC block with the revert block (the CRC of secondary slot). We can figure this out.

Yes, that's what I meant - the way it handles swap_info, copy_done, image_ok flags in the trailer. They are currently aligned by the minimal write size up to 32 bytes. It would work to align them using the block size but it's far from optimal.

Of course we could collaborate. I will get back once I have the concept I currently have in mind. However, I don't want to highjack this PR so we could then create another thread.

Back to my previous question if I may - being able to decide using the OVERWIRTE_ONLY mode whether I want to keep the source image in the secondary slot or invalidate it (the current situation). It is just a matter of one line of code and provides for at least some kind of backup, although not perfect, but better than none.

Copy link
Collaborator

@nvlsianpu nvlsianpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general patch looks fine to me. I do understand rationale behind.

I'm expecting that bit flipping can happens everywhere in the flash with equal probability, so if this patch hardens the application which can covers 90% of executable memory then only 10% (mcuboot) is not hardened. So if there are bit flipping scenarios this aim to covers: gain on reliability can be calculated.
Moreover - mcuboot can be protected against modification by SW. Secondary slot might be also protected somehow.

If this would be accepted then I will ask for some documentation on that option in design document.

@butok
Copy link
Contributor

butok commented Mar 14, 2024

As an alternative.
We can add a Recovery mode HAL function, implemented in the TFM MCUBoot: https://git.trustedfirmware.org/TF-M/trusted-firmware-m.git/commit/?id=436326c7376d92d238bc78b4d11af88fdc2c95a6

Copy link
Collaborator

@nordicjm nordicjm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think this is useful because it's trying to resolve a theoretical issue and doesn't actually address the issue, if the flash had bit errors before the lifetime of the flash (endurance and expected retention lifetime of flash data which should be 40 years or more) then that indicates something is very wrong with it, and if that is the case then where you get a flash error is entirely random, it could occur in any sector in the flash and this is only protecting data in one of those parts. If the memory is not resilient enough and just loses the state of memory or bits flip at random in normal operation of the device then the device is either defective in that there is a manufacturing fault with it or the underlying design of the device is defective, especially when talking about NOR memory.

That being said I'm okay with this going in and am not blocking it

Comment on lines 1291 to 1294
assert(rc == 0);

rc = flash_area_open(FLASH_AREA_IMAGE_SECONDARY(image_index), &fap_secondary_slot);
assert(rc == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad as assert(rc == 0); will go to nothing if asserts are disabled


rc = boot_copy_region(state, fap_primary_slot, fap_secondary_slot, 0, 0, size);
if (rc != 0) {
BOOT_LOG_ERR("Failed to backup primary image!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this will still continue regardless

@butok
Copy link
Contributor

butok commented Mar 14, 2024

The article "Flash or EEPROM memory damage" describes some cases: https://microchip.my.site.com/s/article/Flash-Memory-Corruption

@nordicjm
Copy link
Collaborator

nordicjm commented Mar 15, 2024

Brownout not being enabled in an application where the Vdd operating voltage dips, typically in battery applications.
Code execution bugs which result in Flash or EEPROM being accidentally erased.
Device being operated out of specifications.
Unstable behavior for flash memory at high temperature. The Flash devices are failing / get corrupted while heated. After the unit cools down or a chip erase is executed, the issue disappears and the flash is working properly.

This sums up what I said, brownout would only apply if you attempted a flash write (and you should not be attempting to erase/write the slot image from the running image, that makes no sense), it would not happen during a flash read. The second should be fixed properly by use of MPU, third would be a bad design and forth is using device way outside the specification. None of these are things that should be expected in ordinary operation of a well-designed system so you're adding workaround for problems that do not actually exist

@butok
Copy link
Contributor

butok commented Mar 15, 2024

Brownout not being enabled in an application where the Vdd operating voltage dips, typically in battery applications.
Code execution bugs which result in Flash or EEPROM being accidentally erased.
Device being operated out of specifications.
Unstable behavior for flash memory at high temperature. The Flash devices are failing / get corrupted while heated. After the unit cools down or a chip erase is executed, the issue disappears and the flash is working properly.

This sums up what I said, brownout would only apply if you attempted a flash write (and you should not be attempting to erase/write the slot image from the running image, that makes no sense), it would not happen during a flash read. The second should be fixed properly by use of MPU, third would be a bad design and forth is using device way outside the specification. None of these are things that should be expected in ordinary operation of a well-designed system so you're adding workaround for problems that do not actually exist

OK ;)
How about Solar/Cosmic/Neutron activity and all other reasons described here
https://blog.robertelder.org/causes-of-bit-flips-in-computer-memory/

@nordicjm
Copy link
Collaborator

How about Solar/Cosmic/Neutron activity and all other reasons described here

That would be bad design caused by not using a radiation-hardened device, and the worst part of that would be use of a single CPU which would have bits flipped in its core without having other redundant cores (i.e. 2 backups) running the exact same code in order to identify such errors.

@PetrBuchtaNXP
Copy link
Contributor Author

I think the best approach now is to close this current PR and create a new one WITHOUT the backup recovery feature. It will only consist of the following change to be able to decide if I want to invalidate the image in the secondary slot after copying it to the primary slot or keep it valid as a backup.

#ifndef MCUBOOT_OVERWRITE_ONLY_KEEP_BACKUP
    /*
     * Erases header and trailer. The trailer is erased because when a new
     * image is written without a trailer as is the case when using newt, the
     * trailer that was left might trigger a new upgrade.
     */
    BOOT_LOG_DBG("erasing secondary header");
    rc = boot_erase_region(fap_secondary_slot,
                           boot_img_sector_off(state, BOOT_SECONDARY_SLOT, 0),
                           boot_img_sector_size(state, BOOT_SECONDARY_SLOT, 0));
    assert(rc == 0);
#endif

@nordicjm
Copy link
Collaborator

I think the best approach now is to close this current PR and create a new one WITHOUT the backup recovery feature. It will only consist of the following change to be able to decide if I want to invalidate the image in the secondary slot after copying it to the primary slot or keep it valid as a backup.

#ifndef MCUBOOT_OVERWRITE_ONLY_KEEP_BACKUP
    /*
     * Erases header and trailer. The trailer is erased because when a new
     * image is written without a trailer as is the case when using newt, the
     * trailer that was left might trigger a new upgrade.
     */
    BOOT_LOG_DBG("erasing secondary header");
    rc = boot_erase_region(fap_secondary_slot,
                           boot_img_sector_off(state, BOOT_SECONDARY_SLOT, 0),
                           boot_img_sector_size(state, BOOT_SECONDARY_SLOT, 0));
    assert(rc == 0);
#endif

Can update this PR

@PetrBuchtaNXP PetrBuchtaNXP force-pushed the overwrite_mode_with_backup_v2.0.0 branch from 919733e to 4fd9623 Compare March 25, 2024 13:45
@PetrBuchtaNXP
Copy link
Contributor Author

I updated the PR. Now it only consists of the MCUBOOT_OVERWRITE_ONLY_KEEP_BACKUP definition that will allow to skip erasing of the image header in the secondary slot to be able to keep it as a backup.

@PetrBuchtaNXP
Copy link
Contributor Author

Thanks for approval, @de-nordic. BTW - is it "normal" that some of the build checks fail?

@de-nordic
Copy link
Collaborator

Thanks for approval, @de-nordic. BTW - is it "normal" that some of the build checks fail?

No. We had board renames in Zephyr to improve how SoC or even single core is addressed during build, but MCUboot has not yet been updated. There is a PR to do that.

@nordicjm
Copy link
Collaborator

CI should be passing fine, you need to rebase

@PetrBuchtaNXP PetrBuchtaNXP force-pushed the overwrite_mode_with_backup_v2.0.0 branch from 4fd9623 to 9641172 Compare April 4, 2024 07:25
@PetrBuchtaNXP
Copy link
Contributor Author

Just rebased. Any tips on getting another approval? :) Thanks

It builds on top of OVERWRITE_ONLY mode and uses secondary slot
as a backup of the primary slot. The main difference is that after
image copy to the primary slot the secondary slot is not erased.
This is meant to be used together with BOOTSTRAP option that will
reinstall the primary image with the backup in case it's not valid.

Signed-off-by: Petr Buchta <[email protected]>
@PetrBuchtaNXP PetrBuchtaNXP force-pushed the overwrite_mode_with_backup_v2.0.0 branch from 9641172 to ff5ce1f Compare April 4, 2024 08:23
@de-nordic de-nordic requested review from nvlsianpu and nordicjm April 5, 2024 07:53
@d3zd3z d3zd3z merged commit c5a528b into mcu-tools:main Apr 5, 2024
55 checks passed
@butok
Copy link
Contributor

butok commented Apr 8, 2024

Hi All,
If it works for new OVERWRITE_ONLY_KEEP_BACKUP why it may not work for the reset of cases?
Can we delete this erasing code unconditionally?
So not needed to add additional configuration parameters.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants