-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New OVERWRITE_ONLY_KEEP_BACKUP option #1906
New OVERWRITE_ONLY_KEEP_BACKUP option #1906
Conversation
Why would backup get damaged? |
For the same reason the main image gets damaged. I can think of several reasons:
|
Maybe the backup recovery routine should be under another config. I can image leaving the image in the secondary slot as a backup without the backup health checks would be enough in majority of scenarios. The backup health check and recovery only when desired. So it would be split under two defines: OVERWRITE_ONLY_KEEP_BACKUP + OVERWRITE_ONLY_BACKUP_RECOVERY |
The same reason MCUboot can get damaged?
Unless you have MMU that can block write of the MCUboot and primary, it can brick your device; why would such bug select secondary, chance is that any can be picked.
That actually invalidates implementation of an app feature that would be able to continue download after failed attempt. But, I guess, there is no chance to continue anything because the device you use requires erase by large blocks, so you have to basically start from the beginning. This is also problematic because there is not check for reason of failure, which means that if the flash is dying and flickering for some other reason, this may keep updating backup on every start. I am not sure whether this fixes anything. |
Let's say that the part with the backup health check & recovery is questionable. What about the part that prevents erasing the header of the secondary image after it has been copied to the primary slot, so it can stay as a backup. Does this make sense to you? |
Not really no, because it won't be used unless it is marked for test/confirmation |
It will be used by the BOOTSTRAP mechanism (I described it above) if the primary image is invalid. There's no need to mark it. The BOOTSTRAP option takes whatever is valid in the secondary slot, the backup in this case. |
I vaguely remember something about the bootstrap mechanism allowing bypass of downgrades but might be wrong on that @d3zd3z might know. The whole point still stand though of if your flash gets corrupted in slot0 and it's before the lifespan of the device, then you can't trust any of the flash, and if it's after the lifespan of the device then the flash cells are gone in slot0 so no matter how many times you attempt to write the image to it, the image data will not be there. |
There are situations when flash bits get flipped in a healthy flash. I personally think that in a situation when the primary image is corrupted then to try at least something is better than to simply give up. Of course I can end up in a loop trying to recover the image from its backup but the result is effectively the same - dead device. |
In NOR flash? I can understand this in NAND flash technology but not NOR. And if there is a read error in NAND flash, that can occur anywhere in the flash, be it the bootloader slot, slot 0 or slot 1 which is why MCUs that use NAND flash (which in my opinion is just wrong from the get go, it is the wrong technology) employ ECC to account for those issues. I have heard of this before only in one vendor's chips from someone using them who was very unhappy and they had to fork MCUboot to add "hacks" to deal with the chips losing data and that was atmel |
There are NOR flash memories with ECC so I take it it's there for some reason. However, I hear your arguments. I created this patch for LPC55s69 that has sectors of the same size as pages, 512 bytes, which is also the minimal write size. Since it has ECC there's no chance for multiple writes in a single page after it had been erased That's why I cannot use the SWAP mode because the algorithm simply doesn't support flashes like this one. I'm fine with OVERWRITE_ONLY mode but there's a typical requirement from customers that they need to have a backup. That's why I wanted to prevent the erasing of the header after copying the image to the primary slot - to leave it there as a backup. I'm aware that in the SWAP modes once the image in the primary slot is confirmed the image in the secondary slot is no longer considered as a backup, but together with the BOOTSTRAP option it can be used this way. With this PR I wanted the same behavior also for the OVERWRITE_ONLY mode. I agree that the backup recovery routine may be redundant, but I still see a certain benefit in having a chance to decide what will happen with the image in the secondary slot once it's copied to the primary slot - to decide if I want to keep it there or have it erased. |
Yest. that is actually true, and other type of storage also does that.
Oh, So you have basically a "block device". I was researching possibility of doing swap algorithm for such devices but did not have time to follow with it. As you probably understand the difference between what I have described and what we currently have is that in above solution we calculate and store the outcome we expect, in current solution we mark how far we have already gone. I guess this may be better for your customer as it actually makes the MCUboot work with what you offer. And the CRC32 is just an idea, probably some better sum could be used. |
Yes I was actually thinking about the same thing. Be it CRC32 or SHAxyz, the basic idea it still the same. It only got as far as a simple concept in my head because I have the same problem - not enough time. I might soon have some time to create at least some basic proof of concept so I could then share it with you for a discussion. |
Not really. The alignment is required because that is how you write flags. If you pre-calculate the CRC, you can as well do that in your DFU app and mark the block as you wish. If it is confirm then the CRC block is removed by MCUboot once it successfully does swap. If it is test and app confirms, then the app removes the block (erases it). If you Alt he MCUboot will be doing is moving data by blocks, 512 bytes in your case, or erasing them. Ah crap. Revert. Ok so once the MCUboot moves complete image, it has to replace the CRC block with the revert block (the CRC of secondary slot). We can figure this out. |
Yes, that's what I meant - the way it handles swap_info, copy_done, image_ok flags in the trailer. They are currently aligned by the minimal write size up to 32 bytes. It would work to align them using the block size but it's far from optimal. Of course we could collaborate. I will get back once I have the concept I currently have in mind. However, I don't want to highjack this PR so we could then create another thread. Back to my previous question if I may - being able to decide using the OVERWIRTE_ONLY mode whether I want to keep the source image in the secondary slot or invalidate it (the current situation). It is just a matter of one line of code and provides for at least some kind of backup, although not perfect, but better than none. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general patch looks fine to me. I do understand rationale behind.
I'm expecting that bit flipping can happens everywhere in the flash with equal probability, so if this patch hardens the application which can covers 90% of executable memory then only 10% (mcuboot) is not hardened. So if there are bit flipping scenarios this aim to covers: gain on reliability can be calculated.
Moreover - mcuboot can be protected against modification by SW. Secondary slot might be also protected somehow.
If this would be accepted then I will ask for some documentation on that option in design document.
As an alternative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really think this is useful because it's trying to resolve a theoretical issue and doesn't actually address the issue, if the flash had bit errors before the lifetime of the flash (endurance and expected retention lifetime of flash data which should be 40 years or more) then that indicates something is very wrong with it, and if that is the case then where you get a flash error is entirely random, it could occur in any sector in the flash and this is only protecting data in one of those parts. If the memory is not resilient enough and just loses the state of memory or bits flip at random in normal operation of the device then the device is either defective in that there is a manufacturing fault with it or the underlying design of the device is defective, especially when talking about NOR memory.
That being said I'm okay with this going in and am not blocking it
boot/bootutil/src/loader.c
Outdated
assert(rc == 0); | ||
|
||
rc = flash_area_open(FLASH_AREA_IMAGE_SECONDARY(image_index), &fap_secondary_slot); | ||
assert(rc == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bad as assert(rc == 0);
will go to nothing if asserts are disabled
boot/bootutil/src/loader.c
Outdated
|
||
rc = boot_copy_region(state, fap_primary_slot, fap_secondary_slot, 0, 0, size); | ||
if (rc != 0) { | ||
BOOT_LOG_ERR("Failed to backup primary image!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this will still continue regardless
The article "Flash or EEPROM memory damage" describes some cases: https://microchip.my.site.com/s/article/Flash-Memory-Corruption |
This sums up what I said, brownout would only apply if you attempted a flash write (and you should not be attempting to erase/write the slot image from the running image, that makes no sense), it would not happen during a flash read. The second should be fixed properly by use of MPU, third would be a bad design and forth is using device way outside the specification. None of these are things that should be expected in ordinary operation of a well-designed system so you're adding workaround for problems that do not actually exist |
OK ;) |
That would be bad design caused by not using a radiation-hardened device, and the worst part of that would be use of a single CPU which would have bits flipped in its core without having other redundant cores (i.e. 2 backups) running the exact same code in order to identify such errors. |
I think the best approach now is to close this current PR and create a new one WITHOUT the backup recovery feature. It will only consist of the following change to be able to decide if I want to invalidate the image in the secondary slot after copying it to the primary slot or keep it valid as a backup.
|
Can update this PR |
919733e
to
4fd9623
Compare
I updated the PR. Now it only consists of the MCUBOOT_OVERWRITE_ONLY_KEEP_BACKUP definition that will allow to skip erasing of the image header in the secondary slot to be able to keep it as a backup. |
Thanks for approval, @de-nordic. BTW - is it "normal" that some of the build checks fail? |
No. We had board renames in Zephyr to improve how SoC or even single core is addressed during build, but MCUboot has not yet been updated. There is a PR to do that. |
CI should be passing fine, you need to rebase |
4fd9623
to
9641172
Compare
Just rebased. Any tips on getting another approval? :) Thanks |
It builds on top of OVERWRITE_ONLY mode and uses secondary slot as a backup of the primary slot. The main difference is that after image copy to the primary slot the secondary slot is not erased. This is meant to be used together with BOOTSTRAP option that will reinstall the primary image with the backup in case it's not valid. Signed-off-by: Petr Buchta <[email protected]>
9641172
to
ff5ce1f
Compare
Hi All, What do you think? |
I needed to create s solution based on OVERWRITE_ONLY mode that would keep a backup of the active image in the secondary slot. The main motivation was a use of MCU that doesn't support available swap modes due to its flash features. The code modification was very simple, however, I'm not 100% sure these changes don't clash with something else.
I needed to cover these scenarios:
The commit message explains further:
It builds on top of OVERWRITE_ONLY mode and uses secondary slot as a backup of the primary slot. The main difference is that after image copy to the primary slot the secondary slot is NOT erased. This is meant to be used together with BOOTSTRAP option that will recover the primary image from the backup in case it's not valid. The backup is checked on every boot. In case the backup is invalid the image from the primary slot is used to create a valid backup again.