Corruption with multiple live object index pages #276

balazsracz · 2021-02-22T21:51:19Z

Hi,

We have a SPIFFs partition in the on-chip flash of a Texas Instruments microcontroller. It's a pretty small instance with the following parameters:

  cfg = {
    hal_read_f = 0x1036c0f <SPIFFS::flash_read(spiffs_t*, unsigned int, unsigned int, unsigned char*)>, 
    hal_write_f = 0x1036c25 <SPIFFS::flash_write(spiffs_t*, unsigned int, unsigned int, unsigned char*)>, 
    hal_erase_f = 0x1036c1d <SPIFFS::flash_erase(spiffs_t*, unsigned int, unsigned int)>, 
    phys_size = 262144, 
    phys_addr = 17563648, 
    phys_erase_block = 2048, 
    log_block_size = 16384, 
    log_page_size = 64
  },

After hundreds of devices operating in the field for 1+ years with no complaints, I found a fascinating filesystem corruption. At the API level we were getting SPIFFS_ERR_DELETED on read operations that were within the file bounds. I managed to extract the entire filesystem image for analysis.

For a given file (id 0x20) that receives a lot of reads and writes over the lifetime of the device, we ended up in a situation where two live pages claimed to be a given object index page. Both pages are marked as live in the block header (0x8020). Both were marked as used, final, non deleted index page in the page header (flags 0xf8).
The file is ~6 kbytes so there are five index pages (1 header + 4 index). Index page spix=3 has the issue.

Here are two blocks from different offsets in the flash:

0000960 8020 0003 fff8 ffff 077e 0762 0764 0796 0772 0774 0770 08ea 0798 000c 000d 000e 0950 07af 0008 07bd 07bf 07bb 08ff 09d4 09d6 030b 09e0 09e8 09ea 09f0 09f8 09fa
0035776 8020 0003 fff8 ffff 077e 0762 0764 0796 0772 0774 0770 022d 0798 09fc 09fd 0b15 0950 07af 000a 07bd 07bf 07bb 022e 09d4 09d6 030b 09e0 09e8 09ea 09f0 09f8 09fa

I produced this with od -Ad -w64 -t x2 spiffs.dat | grep '8020 0003 fff8'
There are about 150 other blocks which are also 8020 0003 but they are erased. This is expected as a lot of write operations are hitting this file.

It appears to me that the filesystem is not able to recover from this situation, because once the block is forked, the two instances will live their own individual life. The index lookup function picks one or the other in a non deterministic way, depending on what's in the cache variables. Further writes may end up cloning one version or the other depending on which one got picked.

It is not entirely clear to me how this situation came to be. The flash driver is pretty simple since we have an on-chip flash and we only need to call a simple function to program or erase the flash. The MTBF is pretty large considering the number of devices that have been operating correctly for a long while.

I can provide more information including the entire filesystem dump if that's interesting.

thanks
Balazs

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corruption with multiple live object index pages #276

Corruption with multiple live object index pages #276

balazsracz commented Feb 22, 2021

Corruption with multiple live object index pages #276

Corruption with multiple live object index pages #276

Comments

balazsracz commented Feb 22, 2021