Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash corruptioin on SAMD51 #217

Open
sjev opened this issue Jul 14, 2024 · 14 comments
Open

Flash corruptioin on SAMD51 #217

sjev opened this issue Jul 14, 2024 · 14 comments

Comments

@sjev
Copy link

sjev commented Jul 14, 2024

This issue is similar (and probably related) to #170

The main difference is that in this case it occurs on SAMD51 and memory corruption occurs at 0x4000 with the result that the board loses the circuitpython install.

Hardware used is based on Feather M4 CAN, schematics are here

I haven't done measurements of the power-up and -down voltage curves, but I suspect it's a manifistation of the brownout issue.

note: before I ran update-bootloader-feather_m4_can-v3.16.0.uf2 on v3.16 bootloader, multiple devices were bricking with memory corruptioin at 0x000000, same behavior as described in #170. After running the update, one board does not show any problems after ~200 power cycles, while this one particular board fails within 20. I'm not sure what changed exactly between 3.16..bin and update-.. .uf2

Symptopms

  1. device was 'soft bricked' by a power cycle.
  2. restored bootloader to v3.16 with programmer, device loaded UF2 bootloader and appeared as "FTHCANBOOT".
  3. updated bootloader with update_....uf2
  4. put circuitpython on it, code started running.
  5. removed usb, power-cycled approx 5 times. After 'normal' restarts device jumped into reset mode and appeared as "FTHRCANBOOT" again. Rebooting and pressing reset button has no effect. Essentially, python install was lost.
  6. put circuitpython on it, repated previous step with same result.

I have reproduced this several times, the issue usually occurs within 20 power cycles.

Clearing and setting BOD33_DIS bit with the programmer did not change anything.

Analysis

I've compared corrupted mememory dump to a working one. There is a difference at address 0x4000 . Just one line in intel hex files is different.

Summarized, data diff at address 0x4000

F8 FF 02 20 CD 57 00 00 75 DD 04 00 69 DF 04 00  (working)
F8 FF 02 00 00 00 00 00 75 DD 00 00 60 DF 04 00  (broken)

Below is output from my analysis script with details:

working

original hex file:
[line 1025] :10400000F8FF0220CD57000075DD040069DF0400D1

Decoded HEX line:
Byte Count: 16
Address: 0x4000
Record Type: 0 (Data)
Data: F8 FF 02 20 CD 57 00 00 75 DD 04 00 69 DF 04 00
Checksum: 0xD1 (Valid)

Decoded Data:
  Address 0x4000: 0xF8
  Address 0x4001: 0xFF
  Address 0x4002: 0x02
  Address 0x4003: 0x20
  Address 0x4004: 0xCD
  Address 0x4005: 0x57
  Address 0x4006: 0x00
  Address 0x4007: 0x00
  Address 0x4008: 0x75
  Address 0x4009: 0xDD
  Address 0x400A: 0x04
  Address 0x400B: 0x00
  Address 0x400C: 0x69
  Address 0x400D: 0xDF
  Address 0x400E: 0x04
  Address 0x400F: 0x00

broken

corrupted .hex file:
[line 1025] :10400000F8FF02000000000075DD000060DF040022

Decoded HEX line:
Byte Count: 16
Address: 0x4000
Record Type: 0 (Data)
Data: F8 FF 02 00 00 00 00 00 75 DD 00 00 60 DF 04 00
Checksum: 0x22 (Valid)

Decoded Data:
  Address 0x4000: 0xF8
  Address 0x4001: 0xFF
  Address 0x4002: 0x02
  Address 0x4003: 0x00
  Address 0x4004: 0x00
  Address 0x4005: 0x00
  Address 0x4006: 0x00
  Address 0x4007: 0x00
  Address 0x4008: 0x75
  Address 0x4009: 0xDD
  Address 0x400A: 0x00
  Address 0x400B: 0x00
  Address 0x400C: 0x60
  Address 0x400D: 0xDF
  Address 0x400E: 0x04
  Address 0x400F: 0x00

Follow-up

I've taken a look at main.c, there seems to be as section for brownout protection for SAMD51.

I'm willing to invest some time to fix this, but fiddling with bootloaders is not something that I've done before...

I'd like to discuss possible solutions here before I start (randomly) changing stuff.

@sjev
Copy link
Author

sjev commented Jul 15, 2024

A bit further on I found the cause of this and was able to reliably avoid and reproduce the issue.

Short description:

  1. measured 3.3V rise time during turn on- it was looking great with a linear rise time of 1.5 ms.
  2. looked at the fall time during turn off - the voltage fell within 5 ms to about 1V and then stayed there for a long time.
  3. added a 150 ohm load to 3.3V, this resulted in quicker fall after 1V level.
  4. tried to reproduce the issue with load - did not occur after around 100 cycles.
  5. removed load - issue occured within 10 cycles.

So the hypothesis is that brownout protection is not acting as it should during switch off.

In the screenshot below the traces show voltage curves at turn-off with and without 150 ohm load.

data_59266

@sjev
Copy link
Author

sjev commented Jul 15, 2024

image

current configuration bits.

Note: I've changed BOD33_ACTION to "RESET" later and was still able to cause the issue.

@sjev
Copy link
Author

sjev commented Jul 21, 2024

@dhalbert, I saw your commet on adafruit forum, as you suggested, let's discuss further in this thread.

by danhalbert »
Mon Jul 15, 2024 12:37 pm

Hi - are you literally using a Feather M4 CAN, or are you reproducing the design on your own board?

The "bricking" you describe is some kind of problem in the SAMD51 chip design: if there is a power glitch at the right time, an internal flash write or erase can occur that erases the first unprotected block in flash. This is not related to CircuitPython per se: if you wrote an Arduino program, for instance, it might have the same problem.

I would suggest doing oscilloscope monitoring of the power-on waveform, and also whether noise is getting into the line.

  • I'm using a design based on Feather CAN M4, schematics are open source. We've tried to keep the design in line with the feather, but the power design is diffferent as it needs to be powered from 24 V.
  • I did measure power-on and power-off curves on the oscilloscope. Power-on, was a perfect linear rise from 0 to 3.3V within 2 ms. No oscillations or noise.
  • Power-down curve is attached earlier in this thread and is determined to cause an issue without 150ohm bleed resistor (bright curve in the graph)

I'm pretty sure that corruption occurs on power-down cycle as I was able to reliably reproduce it without bleed resistor and coulde not with it. Still, a bleed resistor is just a work-around in the short term.

One possible cause that I can think of is current leak though one of the input pins, we'll remove it in next iteration of the design and see if it has any effect.

@dhalbert
Copy link

dhalbert commented Jul 21, 2024

Your testing is interesting.

What firmware are you using in normal operation? Is it CircuitPython, Arduino, or something else? Perhaps the firmware is changing with the brownout detection.

I looked at the bootloader code again. It enables a BOD33 level at around 2.7V. It does not enable hysteresis, which maybe it should. We could try bumping up the BOD33 and enabling hysteresis. On the SAMD21, hysteresis is just on or off. At around 2.7V BOD33, it looks like it's about 70mV. On the SAMD51, there is a a 4-bit field with 6mV steps. I don't have any experience in choosing this value but we could try about the same 70mV.

When I discussed this kind of problem with Microchip in the past, it was an issue about power glitches on power-up. That was the motivation for the current code, which is all about waiting enough time for the voltage to stabilize on power-up. Your problem seems to be on power-down. Your scope trace does not show any glitches on power down, but I wonder if the longer timebase chosen is hiding something, though I don't see any evidence of that.

What kind of power supply are you using? Have you tried a different power supply to see if that makes any difference?

Microchip also said they had seen this flash erase problem when there was insufficient decoupling capacitance on Vddcore. Are your decoupling caps close to the SAMD51 chip?

Are the power pins on the SAMD51 wired the same way as the Adafruit board, or are they somewhat different? We go by the reference designs in the datasheet.

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

Here is the Feather CAN power arrangement. There are more decoupling caps not shown here as well:

image

@sjev
Copy link
Author

sjev commented Jul 21, 2024

@dhalbert Thanks for your input!

What firmware are you using in normal operation? Is it CircuitPython, Arduino, or something else? Perhaps the firmware is changing with the brownout detection.

I'm using CircuitPython 9.0.5 with latest version of UF2 bootloader (3.16, updated with update-xxx.uf2. The code that I'm running is just a blinky on neopixel, no write access whatsoever and not touching the tuses.

...

When I discussed this kind of problem with Microchip in the past, it was an issue about power glitches on power-up. That was the motivation for the current code, which is all about waiting enough time for the voltage to stabilize on power-up. Your problem seems to be on power-down. Your scope trace does not show any glitches on power down, but I wonder if the longer timebase chosen is hiding something, though I don't see any evidence of that.

I'll record some longer traces, jsut to be sure.

What kind of power supply are you using? Have you tried a different power supply to see if that makes any difference?

I'm using two different lab supplies with same results. Important to note that I'm turning the device on and off in a rough manner, manually connecting and disconnecting power wires.

Microchip also said they had seen this flash erase problem when there was insufficient decoupling capacitance on Vddcore. Are your decoupling caps close to the SAMD51 chip?

Yes, but these may be smaller than on a feather. The schematics are here btw.

Are the power pins on the SAMD51 wired the same way as the Adafruit board, or are they somewhat different? We go by the reference designs in the datasheet.

Yes, we also try to follow reference and feather designes as closely as possible

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

This should be possible, but I'd need to somehow simulate the power-down curve on the feather. This should require some hacking. and I don't have a function generator atm that I could use for that.

@dhalbert
Copy link

Important to note that I'm turning the device on and off in a rough manner, manually connecting and disconnecting power wires.

That could cause power glitches, though you didn't trace any. I've seen that myself just bobbling a USB plug a bit.

The scope trace picture that you posted, is that TEST_VDDCORE1, or is it VCC3V3?

There is a lot going on in your power supply circuitry, and there is opportunity for noise, maybe pins going out of range. Is it possible to supply just 3.3V to the SAMD51 and see if you can duplicate the problem?

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

This should be possible, but I'd need to somehow simulate the power-down curve on the feather. This should require some hacking. and I don't have a function generator atm that I could use for that.

At least measure the power-down curve, and see if you can reproduce the flash erasure problem. But we haven't had reports of flash erasure since we re-did the bootloader.

CircuitPython does set BOD33, but it sets it to the same value as the bootloader setting.

Another thing to try would be to write an Arduino program that's equally simple and see if you get the same problem. Probably yes, but that would eliminate CircuitPython itself as cause.

@dhalbert
Copy link

Another small possibility: the BOD12 calibration value is set at the factory. From the datasheet:

Brown-out detector internal to the voltage regulator for VDDCORE. BOD12 is calibrated in production and its
calibration parameters are stored in the NVM User Row. This data should not be changed if the User Row is
written to in order to assure correct behavior.

If you have accidentally erased this value when doing initial chip programming, that might cause a problem.

@sjev
Copy link
Author

sjev commented Jul 24, 2024

@dhalbert thank you so much for these pointers. I'll definately investigate these further when I get back to this issue. That will probably be in a couple of weeks from now, as I'm waiting for more boards to be made.
I'll probably start with an automated setup that switches power with a relay and waits for some feedback from the board that is tested.

@sjev
Copy link
Author

sjev commented Sep 3, 2024

A quick update -
We've completed our 0-series and I've used one of them to randomly cycle power on 5 other boards. They went through 4k cycles without any issues. But just when I was about to call this issue an 'incident', one of the boards failed after 5.5k cycles.

The positive news is that with the latest bootloader the board is not getting bricked, just cpy install gets corrupted. Dropping a new uf2 file fixes the issue.
As a short-term solution we've added a bleed resistor that can be turned on with a solder jumper. I've enabled it on the board that has failed, we'll see how it holds.

20240902_194522

@sjev
Copy link
Author

sjev commented Sep 3, 2024

BTW, is it possible to protect flash memory where cpy resides?

@dhalbert
Copy link

dhalbert commented Sep 3, 2024

BTW, is it possible to protect flash memory where cpy resides?

We haven't provided a mechanism to do that. But the NVMCTRL.RUNLOCK NVM LOCKS bits on the User Page allow you to lock regions of flash. You could try changing these bits manually after loading CircuitPython. See sections 9.4 and 25.6.2 in the datasheet.

@sjev
Copy link
Author

sjev commented Sep 4, 2024

Another thing to try would be to write an Arduino program that's equally simple and see if you get the same problem. Probably yes, but that would eliminate CircuitPython itself as cause.

Done with expected result (flash corruption), so it's not circuitpython.

Quick summary of issue occurance (6 test boards):

  • 3 boards did not show anny issues (20k power cycles)
  • 2 boards failed within 5k power cycles, were patched with a 150ohm bleed resistor (3v3->gnd). No issues after 15k cycles.
  • 1 board (original issue trigger) keeps failing within ca. 20 cycles. This is actually good, because the issue can now be easily reproduced with this one.

@sjev
Copy link
Author

sjev commented Sep 4, 2024

You could try changing these bits manually after loading CircuitPython.

@dhalbert could you please provide a pointer on how to set the register from CP?

@dhalbert
Copy link

dhalbert commented Sep 4, 2024

@dhalbert could you please provide a pointer on how to set the register from CP?

There's no way to do that from CircuitPython code. These are "fuse" bits, so there's special setup needed to change them. There is code in the bootloader to do this: see the code that corrects errors in the fuses.

I meant that after the CircuitPython UF2 is loaded, you could connect to the board with, say, a hardware debugger and set those bits. For instance, I think you can write a script using a J-Link utility to do this. It's also possible from the MicroChip IDE's to do it by hand. Or you could make a special build of CircuitPython that checks the bits and sets them if necessary. And undoing them is also needed, if you want to be able to update CircuitPython. But I don't have a recipe for you off the bat.

It still sounds like there might be something marginal about the power supply or the decoupling capacitors, which is causing a power dip.

Is there any difference on the date codes of the SAMD51's that indicates the one bad board has a different rev chip?

I think this is also something you could bring up with MicroChip as a support case. They might have some advice for you. Also read the datasheet errata carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants