Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation around userspace tool behaviour and safety #863

Merged
merged 8 commits into from
Nov 4, 2024

Conversation

csnover
Copy link
Contributor

@csnover csnover commented Aug 6, 2024

Hi there,

On the mailing list for the past few days I have been shouting into the void about some issues I encountered trying to deal with a hardware failure on a btrfs raid-1 filesystem. In my last mail I made several recommendations, and while I won’t be able to become familiar enough with the internals of the btrfs tools themselves to send patches for some of them, but I am at least capable of writing a few paragraphs of documentation. So, here are some patches for that.

In the scrub documentation change I made, I am not sure if there are other edge cases beyond the split-brain one I document where running scrub would be dangerous. I adapted the line about its safety from the comment at #816 (comment). I think that comment probably contains more useful information that could be integrated into the documentation too, but I am currently out of energy to make additional changes.

I also think I may be documenting some bugs here, like that btrfs-check looks at metadata from some place other than the specified block device which seems like it should not be doing since it does not take a filesystem path, but at least I would prefer the current behaviour to be documented than to have one other person be shocked when they see errors that go away after a read-write scrub.

I hope these changes are helpful. Please feel free to change or adapt as necessary, I do not anticipate focusing much more on this right now because I have other obligations.

Thank you for writing and maintaining btrfs and I hope you have a great week!

.. warning::
Files with the NOCOW (``chattr +C``) attribute imply ``nodatasum``.
Errors in these files cannot be detected or corrected by scrub. This means
the NOCOW attribute is not currently safe in the presence of replicas since
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the not safe claims is not ideal.
This means you're calling all file systems without data checksum unsafe.

And even for btrfs, there are known IO patterns (abuse) that can lead to mismatch data checksum directly.
Namely incorrectly handled writeback (modify the direct IO memory while it's still under writeback).

Copy link
Contributor Author

@csnover csnover Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me make sure there’s a mutual understanding of both the purpose of this warning and the operation of btrfs so I can update the wording to make sure it is communicating the right thing without being too inaccurate.

The purpose of this warning is to prevent users from experiencing data loss when they think btrfs is protecting their files against single replica corruption. It is not clear right now that simply having a healthy replica is insufficient to avoid data loss when using NOCOW. For example, this Reddit thread.

In my case, I had one journald file end up corrupted, and did not understand how that was possible unless my whole filesystem was corrupted due to a horrible btrfs defect, until someone else pointed out that journald sets those files to NOCOW, and that is why only that file ended up being irreparably damaged, and why it was that way despite having a healthy replica.

With NOCOW, my understanding is that btrfs does not even meet standards of traditional raid1 systems which would have marked the unhealthy replica as tainted and only read from the healthy one, thus avoiding corruption in this scenario. As such, I pick this language to express that at least with single profile, your file might be silently corrupted if there is a crash, but you know that already because you have no replica. With replicas, it is “not safe” because having a copy committed to a healthy replica does not (apparently, unless it’s a serious unfixed bug) prevent corruption with NOCOW.

It is not clear to me how btrfs handles NOCOW exactly since e.g. #134 talks about multiple copies being possible to restore, but I would expect that only to be true until btrfs notices the metadata is inconsistent and fixes it up using the one with the correct checksum/transid and marks the blocks from the bad metadata as free again. (Clearly I do not understand the architecture of btrfs very much at all and am trying to potentially apply knowledge from other domains that do not apply, so please correct and explain so I can understand and be smarter :-))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With NOCOW, my understanding is that btrfs does not even meet standards of traditional raid1 systems which would have marked the unhealthy replica as tainted and only read from the healthy one, thus avoiding corruption in this scenario.

For NOCOW, the other RAID1 implementations go marking the bad device with extra flags, and I believe the btrfs part can indeed be improved.

But if the data corruption happens without really triggering any read error (e.g. for whatever reason the data on-disk just changed, without triggering a read error), then it's no better than NODATACOW btrfs RAID1, since just like NODATACOW btrfs, the block layer RAID1 won't detect any error, thus go the same path as btrfs (to go load balance the read)

That's the main reason I won't call it unsafe, because for above case the block layer RAID1 implementation is just as bad as btrfs.

However I also won't decline that btrfs is pretty bad at detecting unreliable disks, thus it won't avoid reading that bad disk.

I'd prefer to make it clear that, btrfs doesn't have any ability to detect bad disks (which can be very complex by itself, and I'm very curious how other RAID1 handles it), so for that NODATACOW + bad disks case, users will get data corruption, that's correct.

But it doesn't mean the other RAID1 is that safe, they just mask their lack of data csum by detecting if the disk is bad.

It is not clear to me how btrfs handles NOCOW exactly since e.g. #134 talks about multiple copies being possible to restore, but I would expect that only to be true until btrfs notices the metadata is inconsistent and fixes it up using the one with the correct checksum/transid and marks the blocks from the bad metadata as free again.

First of all, NOCOW can only be applied to data, thus we always call it NODATACOW.
Metadata is always checksumed, and always COWed. So metadata is always safe.
(as long as the disks implement the barrier operations correctly)

And unfortunately even with RAID1C3/C4, btrfs won't compare all the mirrors and choose the correct one (the one matches the most mirrors), mostly because it will bring a huge performance drop.

It's back to the bad disks detection which is lacking in btrfs, unfortunately.

Copy link
Contributor Author

@csnover csnover Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NOCOW, the other RAID1 implementations go marking the bad device with extra flags, and I believe the btrfs part can indeed be improved.

Yes, if some enhancement like this is done then I think btrfs raid1 will be guaranteed to never be worse than traditional software raid1 in all failure modes, which would be great.

But if the data corruption happens without really triggering any read error (e.g. for whatever reason the data on-disk just changed, without triggering a read error), then it's no better than NODATACOW btrfs RAID1

Sure, this is why I personally opted to use btrfs raid1 instead of mdadm. :-) If I understand correctly, it is only the combination of NOCOW + media failure where btrfs ends up in a worse state due to what you mention about bad disk detection and then load balancing reads the wrong data. Unfortunately media failure is probably the most common failure mode. :-(

just like NODATACOW btrfs, the block layer RAID1 won't detect any error, thus go the same path as btrfs (to go load balance the read)

So I think this is right, but just to confirm: if a user purges OS file cache and retries, eventually they will read the good data, since the metadata is kept consistent on all replicas always, and the corruption is just due to how one disk did not write out the data blocks pointed by the metadata? This makes a lot of sense and is straightforward enough that I may provide this additional context in the documentation.

If I got that right, is it possible to force a read from one device (without mounting degraded with only the single healthy device connected)? It would be good to mention in the documentation if this is the case. It would be very sad if users are “losing” data in this situation only because they can’t tell btrfs to force read/overwrite the file from the known good disk. I understand #482 is a request to add something akin to this to scrub but wonder if it is possible to do it in an ugly way now?

I did not remove the reference to the bad journal file on my filesystem yet, so I wonder if I could test this. So far it seems like either the file got is cached, or I get unlucky and never hit the right PID to balance to the healthy device, or journald read the bad log on reboot, processed it into garbage, then deleted the original inode, so now it is actually a file that is just bad forever. I have no idea!

I'd prefer to make it clear that, btrfs doesn't have any ability to detect bad disks (which can be very complex by itself, and I'm very curious how other RAID1 handles it), so for that NODATACOW + bad disks case, users will get data corruption, that's correct.

OK, I think I can rewrite to be more like this since that feels better to you and it seems OK to me.

But it doesn't mean the other RAID1 is that safe, they just mask their lack of data csum by detecting if the disk is bad.

Yes, I understand. The “not safe” phrase was meant really to compare to btrfs with CoW data mode, not to any others. :-)

First of all, NOCOW can only be applied to data, thus we always call it NODATACOW.

Sure, this makes sense. I used the terminology NOCOW here because that is how I see the +C attribute described elsewhere (or else in the chattr man page, No_COW). I think it will be good to also be very clear in this documentation that checksum always apply to metadata, so will try to work this in too.

And unfortunately even with RAID1C3/C4, btrfs won't compare all the mirrors and choose the correct one (the one matches the most mirrors), mostly because it will bring a huge performance drop.

This also seems like a good information to write in the docs, just for anyone who wonders about it, so I will try to work that in too.

Changes should come tomorrow and I will drop a note when that happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user purges OS file cache and retries, eventually they will read the good data, since the metadata is kept consistent on all replicas always, and the corruption is just due to how one disk did not write out the data blocks pointed by the metadata?

Firstly for metadata read, if we hit any bad (csum/generation/level mismatch), we immediately try the other mirrors. Either we got a good one, then writeback the good content to the bad mirror. Or we exhaust all copies and return error.

Then for the data, if it has checksum (and checksum is stored in metadata), we go the same way above for repair.
If it has no checksum, we just return the content we read out (which mirror is based on the pid based load balance).

So for data with checksum, we will always grab a good mirror if possible, and a read on the bad mirror will also cure the bad mirror (if there is a good one).

If I got that right, is it possible to force a read from one device (without mounting degraded with only the single healthy device connected)?
Following my previous answer, if there is data checksum, no force needed, we immediately repair any bad data copy if we have any good mirror remaining.
(For metadata it's always repaired since it has csum/level/generation checks)

For data without csum, if you can get a good mirror really depends on the pid of the read process.
It's not as good as "forcing btrfs to read certain copy", but if you really want to go that path, it's still possible using "btrfs-map-logical" command to manually read certain mirror of a logical bytenr, and fetch the good copy.

But NODATACOW also means, we won't COW the data write, meaning if your process read out the bad content, modify some bytes, and write it back, the good copy is also gone now.

So for your journal case, I do not think it's a good chance to get the good mirror back, because later writes are definitely going to screw the good mirror.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data without csum, if you can get a good mirror really depends on the pid of the read process. It's not as good as "forcing btrfs to read certain copy", but if you really want to go that path, it's still possible using "btrfs-map-logical" command to manually read certain mirror of a logical bytenr, and fetch the good copy.

But NODATACOW also means, we won't COW the data write, meaning if your process read out the bad content, modify some bytes, and write it back, the good copy is also gone now.

So for your journal case, I do not think it's a good chance to get the good mirror back, because later writes are definitely going to screw the good mirror.

After finally performing device replacement to get rid of the unhealthy disk, the ‘corrupted’ file is readable again, so didn’t get modified at all. chattr -C result is “Operation not permitted” on this file and three others, not sure if that’s intended or why it would be restricted, no output anywhere says why. Luckily they are small enough to copy, and unimportant enough to lose, but maybe not the case for other users that get burnt by No_COW.

A command which accepts a file path and devid or block device path that outputs to stdout would have been good here since btrfs-map-logical was sufficiently non-trivial that I didn’t bother with it. For users who actually needed the data more than I did, it would be good to have a better story. I think this is tracked obliquely in #482 but would it be useful at all for a separate ticket to track enhancing a separate utility for this? Let me know, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chattr -C result is “Operation not permitted” on this file and three others, not sure if that’s intended or why it would be restricted, no output anywhere says why.

cow/nocow mode can not be toggled on files that have data. The man page for chattr states:

Note: For btrfs, the ’C’ flag should be set on new or empty files.

Copy link
Contributor Author

@csnover csnover Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The man page for chattr states: […]

Sure, but I was trying to clear the flag, not set it, you see :-)

By the way, thank you for your wiki content, I hope that much of it makes its way into the official documentation someday; it is quite well organised and the examples for how to handle disk replacement were the best instructions I found and I think the official documentation deserves that level of clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The man page for chattr states: […]

Sure, but I was trying to clear the flag, not set it, you see :-)

True, we could always improve documentation and be better at providing good UX for users.

By the way, thank you for your wiki content, I hope that much of it makes its way into the official documentation someday; it is quite well organised and the examples for how to handle disk replacement were the best instructions I found and I think the official documentation deserves that level of clarity.

Thanks! I am glad the wiki is useful. I started it mostly as a way to learn Btrfs and write down how to do things in an easy way. It's great to see so much improvement that has gone into the official documentation since then 🙏

With NOCOW, my understanding is that btrfs does not even meet standards of traditional raid1 systems which would have marked the unhealthy replica as tainted and only read from the healthy one.

IMHO, it may not always be optimal to mark a disk as bad when it clearly can read some data because it could happen that data on the good disk also becomes corrupt. Still, some logic to determine how and when a disk should be dropped/marked bad would be useful. Probably best implemented as a user-space daemon that monitors error rate, etc.

Also, to clarify that md/dm raid also would not detect corruption. They would detect read-errors, but that would Btrfs too.

And unfortunately even with RAID1C3/C4, btrfs won't compare all the mirrors and choose the correct one (the one matches the most mirrors), mostly because it will bring a huge performance drop.

Such a feature could probably be added to scrub. It could let users detect and choose which copy to preserve. #482

Copy link
Contributor Author

@csnover csnover Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it may not always be optimal to mark a disk as bad when it clearly can read some data because it could happen that data on the good disk also becomes corrupt. Still, some logic to determine how and when a disk should be dropped/marked bad would be useful. Probably best implemented as a user-space daemon that monitors error rate, etc.

Sure, I can see arguments on both sides of this, but it can only really be decided once btrfs has a functioning write-intent bitmap. Reads from a failed device—which mdadm will do during recovery when the failed device is still accessible to avoid stressing other disks into failure—are only safe with a write-intent bitmap. It is unsound for btrfs to continue reading from a disk with failed write because it has no write-intent bitmap to skip/fix outdated blocks. My understanding is that this is absolutely true for nodatasum data, but it is also true even for datasum data due to possibility of hash collision, as mentioned by @Zygo—though the risk is low enough in practice that you can probably get away with it a lot of the time (I managed).

Also, to clarify that md/dm raid also would not detect corruption. They would detect read-errors, but that would Btrfs too.

mdadm maintains write-intent bitmap, so does detect one form of “corruption” that btrfs does not, which is “corruption” due to writes that only made it to one device.

I tried to drive it home on my mailing list post, and I try to do the same here: btrfs needs write-intent bitmap urgently. Without write-intent bitmap, btrfs is like a super-advanced car with the most advanced airbags and collision-avoidance sensor technology, but no seatbelts.


Notably, `systemd sets NOCOW on journals by default <https://github.com/systemd/systemd/commit/11689d2a021d95a8447d938180e0962cd9439763>`_,
and `libvirt ≥ 6.6 sets NOCOW on storage pool directories by default <https://www.libvirt.org/news.html#v6-6-0-2020-08-02>`_.
Other distributions may also enable NOCOW on database files or directories to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's not only users' choice, but certain IO pattern can lead to csum mismatch by themselves (although that counts as an abuse though).

Copy link

@zatricky zatricky Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this "feature" being made a default is a huge failure on the part of the libvirt developers. Checksums are super important for data integrity and they are disabled by this. For me to only find out about it after 4 years thanks to this PR feels embarrassing.

I checked all my servers that use btrfs and found that most did not have this problem - but that is likely because the image subvolumes and files were created via automation before the storage pool configurations were created. However, whenever manual action was taken, there was always a chance that the storage pool configuration would create the path with nocow.

@adam900710
Copy link
Collaborator

The latest version looks good to me.

I'll still leave the PR open for any extra feedback, until it got merged.

replicas in a replicated filesystem. This is not a problem normally because
damage is detected by checksum validation and a mirror copy is used, but
because ``No_COW`` files are not protected by checksum, bad data may be
returned even if a good copy exists on another replica. Which replica is used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which replica is used is determined by the read policy, which can be changed through sysfs; however, the only currently implemented read policy in upstream Linux is 'pid'.

So this isn't wrong today, but one day it may be.

It might be better to say "The filesystem's configured read policy determines which replica will be used."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve reworded this to mention the read policy setting.

@csnover
Copy link
Contributor Author

csnover commented Sep 17, 2024

Why is this not merged yet?

@adam900710
Copy link
Collaborator

This is conflicting with the latest devel branch.

Mind to rebase the branch to the latest devel?

@csnover
Copy link
Contributor Author

csnover commented Oct 20, 2024

This PR was broken by whatever happened at #863 (commits) that caused extra stuff to be added to the branch by someone else. Per your request I have git rebase --skip over these and force pushed.

@csnover
Copy link
Contributor Author

csnover commented Oct 20, 2024

CI is failing because base branch is broken, not due to changes in this PR.

from disk (:doc:`Tree-checker`) but it's not extensive and cannot substitute
full :doc:`btrfs-check` run.
Scrub is not a filesystem checker (fsck). It can only detect filesystem damage
using the (:doc:`Tree-checker`) and checksum validation, and it can only repair
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately scrub doesn not utilize tree-checker for metadata.

Thus it only checks:

  • Bytenr
  • Fsid and chunk tree uuid
  • Checksum
  • Generation

So it's not even as strong as tree-checker.

information loss. However, if a valid copy exists on another replica, btrfs will
transparently correct the read error, and running ``btrfs scrub`` in read-write
mode will fix the error permanently by copying the valid metadata block over the
invalid one. Otherwise, the recovery tool ``btrfs restore`` is able to ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays we have -o rescue=all,ro mount option, which is a more useful solution compared to btrfs restore

Thus we may want to promote the new solution, and use btrfs restore as the last option (and under most cases it's not that better compared to rescue=all)

Even though the filesystem checker requires a device argument, it scans for all
devices belonging to the same filesystem and may report metadata errors from other
devices that are correctable by :command:`btrfs scrub`. In this case, run scrub
first to ensure any correctable metadata errors are fixed to avoid false-positives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this note is needed.

Btrfs check handles metadata error just like kernel, if a mirror is corrupted but still have another good one available, btrfs check won't report it as an error. E.g:

Opening filesystem to check...
checksum verify failed on 30490624 wanted 0xcdcdcdcd found 0x86f5c8d8 << Mirror 1 corrupted for csum root
Checking filesystem on test.img
UUID: f01a583a-df1e-414b-a24c-7fe8bf2ef019
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
[6/8] checking only csums items (without verifying data)
[7/8] checking root refs
[8/8] checking quota groups skipped (not enabled on this FS)
found 147456 bytes used, no error found <<< Still no error.
total csum bytes: 0
total tree bytes: 147456
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 140595
file data blocks allocated: 0
 referenced 0

So there is no false positive. And btrfs scrub has a much higher bar to clear (RW mount), meanwhile btrfs check is always the most comprehensive check tool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directly contradicts my experience when using btrfs-progs 6.6.3. In which version was this changed/fixed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it's always the case from the beginning.

Mind to give an example where btrfs-check is reporting recoverable corrupted mirrors as an critical error?
You can use btrfs-map-logical to find out the physical address of a metadata copy and corrupted it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history of the problem is linked in the PR description. Briefly: a disk failed and temporarily dropped from the array. Therefore, the two disks were no longer in sync after rebooting. btrfs check claimed the filesystem was corrupted and printed messages to that end.

In consultation with @Zygo on IRC I manually verified by btrfs inspect-internals that btrfs check was reporting invalid node types/values that existed only on the bad disk. The kernel logs showed the kernel driver was noticing the invalid metadata nodes on the bad disk and switching to the good one, but btrfs check was claiming filesystem corruption even though the good disk was not corrupted and the filesystem metadata was fine. There was no indication by btrfs check that it was checking (only? due to PID load-balancing?) the bad device even when explicitly given the path to the good disk. Only after read-write btrfs scrub fixed up the metadata on the bad disk did btrfs check report no errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still have that fs?

I'm checking the btrfs_read_extent_buffer() function and still can not find out a good way where split-brain case can make btrfs-check to go the bad copy without trying the good one.

I think it's more like a bug other than an expected behavior.

Copy link
Contributor Author

@csnover csnover Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still have that fs?

I do not, btrfs replace was used to replace the defective hardware a long time ago. However, until something is done to fix the lost writes problem (like this MVP idea from Oct 2), it seems like a synthetic reproducer should be trivial: create a raid1 fs from two disks, forcibly remove one disk, ensure btrfs continues writing to the remaining disk for a little bit, then unmount the fs, reconnect the forcibly removed disk, and remount.

I'm checking the btrfs_read_extent_buffer() function and still can not find out a good way where split-brain case can make btrfs-check to go the bad copy without trying the good one.

If the problem can’t be found in the latest version then maybe it was fixed between 6.6.3 and current head? This was btrfs-progs 6.6.3 (which is still the latest available packaged version for Debian) on kernel 6.9.10.

I think it's more like a bug other than an expected behavior.

Sure, this seems likely to me.

@adam900710
Copy link
Collaborator

Although I have merged this series, I'm going to do a lot of updates on the words and of-course add SoB lines to all involved patches, re-order patches (some patches are fixing things of previous one, which should not happen for out-of-tree patches).

I'm a little too optimistic of this merge, and github web is not a good way to review patches, it hides the SoB line of the author, thus makes it harder to find a missing one.

I'll let you informed when the devel branch is updated so you can verify if the modification is fine.

@adam900710
Copy link
Collaborator

The branch is now updated (devel branch). I have done the following modifications:

  • Add an SoB line for all patches
    I'm using <[email protected]> as your mail address.
    If you want to use a specified address, please let me know.

  • Merge all scrub related commits into a single one

  • Remove btrfs-check related warnings
    As I explained, btrfs-check is designed to choose the good copy (for metadata at least) just like kernel.
    So the warnings that bad mirror will cause false alerts is at least not the expected behavior

  • Explain scrub in more details
    It's done on a per-device basis, and full fs scrub is done by scrubbing each device in parallel.

    Furthermore the behavior of scrub repair (using a good mirror) is also done for all the metadata/data reads,
    There is really nothing that special about scrub, thus a lot of recommendation to run btrfs scrub before btrfs-check
    is removed.

  • Remove the pid based load balance patch
    It's at the bad timing where we're also implementing other read policies.
    I'd prefer to remove all the pid based read balance mention for now.

@csnover
Copy link
Contributor Author

csnover commented Nov 5, 2024

To get the easier administrative part out of the way: I didn’t actually sign off on whatever was changed, so you probably shouldn’t list me in such a field, but I also don’t really care, so using that email address is totally fine if you need the field to be there to get these changes published.

Regarding the other feedback: I’m starting to wonder if there is an irreconcilable difference in philosophy here, but I will try advocating for my position one last time in the hope of at least achieving a mutual understanding of why I don’t think the outright removal of some of the notes I added is OK.

First, please know that it is not my intent to attack anyone or their work personally. I understand the many limitations caused by the under-resourced and voluntary nature of OSS work. If my feedback ever feels overly harsh or critical, it is because of my conviction that this is extremely important to get across, and so I am speaking very directly.

So: What is the purpose of end-user documentation? Is it supposed to help users to understand and operate the system safely as it actually is, or is it supposed to be a white paper that describes an idealised future version of the system? I believe it to be the former. The feedback I’m receiving on this ticket suggests a belief that it is closer to the latter.

Just because something will be changed in the future doesn’t mean its current or past behaviour should be left undocumented. btrfs is not an evergreen SaaS. People will be using the old versions for years and they need to know how those old versions work too. If the behaviour changes, then the documentation changes to read “in versions before N”. Descriptions of old behaviour can be moved to an “obsolete” page when N-1 is not still in supported LTS releases, but retaining such information somewhere official is necessary to clarify whatever outdated information is still floating out there on the internet.

Similarly, if the purpose is to help understanding and safe operation, then just because something is a bug doesn’t mean it should not be documented. There is even an industry term for this: errata. Even when a bug is self-evident (e.g. a crash), it still warrants documenting if there is a known workaround. In the case of btrfs check returning false positives, there are two reasons why I find it more critical to document than a run-of-the-mill defect:

  1. btrfs check appears to be working when it is not, thus it is not self-evident there is a bug causing false positives; and
  2. The need for clarity is heightened in high-stress situations like potential data loss/disaster recovery scenarios.

Whether the behaviour is intended is irrelevant to what the end user needs to know to understand and operate the system, and such information needs to be written down somewhere, in a place where end-users will think to look, which is official and up-to-date.

By way of suggested alternatives: if the btrfs bug tracker wasn’t such an unkempt graveyard, I’d be more receptive to the argument that tickets provide a reasonable stand-in for errata. But, as it stands, there are at least three separate bug trackers I’m aware of (and I am guessing SUSE has an internal one, which would make four); the kernel tracker alone has 296 open bugs, 97% of are still in NEW status. Many of them, like “Add option to check progress/status of btrfs-replace job”, are clearly obsolete. If no one is doing the minimum necessary work of project management to ensure bug trackers are reliable sources of information, they simply can’t be used as alternatives to documentation.

To look at this another way: Who is harmed by these notes? Do they suggest performing a dangerous operation? Do they enhance, or harm, the user’s understanding? Do they overwhelm other more important details, and if so, how can the same information be presented in a way that doesn’t do that, rather than just omitting unpleasant reality?

Given the inability to immediately find a code path that could be causing incorrect diagnostics from btrfs check and the false nature of many user reports of issues, I recognise the scepticism about my report that btrfs check did report errors on an array with only one corrupted mirror, which went away after btrfs scrub corrected the errors on the bad mirror. I don’t know what to say other than to fall back on my reputation as not being the kind of person who is generally incapable of telling whether a command is printing out errors or not. Trying to run btrfs check in lowmem mode even segfaulted! Surely that wouldn’t happen if the metadata it was reading was from the good mirror?

Let me know your thoughts, if you’d like.

Best,

@adam900710
Copy link
Collaborator

I think you miss one point, that an RW mount on a possible corrupted can cause problem by itself.
E.g. for a corrupted extent tree/free space cache/free space tree, which breaks COW, just RW mounting with the extra update on the atime or whatever can already do damage further.

Thus I do not recommend RW scrub as a way to fix things, as it has the possibility to further corrupt the filesystem.

Except certain out-of-sync devices, the common corruption like mismatch transid can not really be repaired by using scrub. It's more possible some device are not honoring FLUSH/FUA correctly, and that will be a bigger problem by itself.

Furthermore if scrub can fix it, then there should be no observable problem except the dmesg error messages.
As I have explained in the updated doc, kernel data/metadata read itself will do the repair, and we have several test cases to verify that behavior (including for scrub and for regular read).

So the idea of running RW scrub to self-heal is not any better than just running the filesystem as usual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Changes in documentation or help text
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants