ZFS out-of-order writes after System crash. ZIL partial txg replay ? #12572
Unanswered
shantanukshire
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem: User application issues pwrites to certain zvol offsets. System crash occurs while application IO is ongoing. After the system is back online, zvol data is inconsistent (not in order). This issue occurs specifically for System crash during an in-flight flush transaction.
Here is an example of application writes/fsync issued to zvolume.
Post crash startup ZIL is replayed. However, since flush was incomplete we basically see replay of a partial txg. I get below offset data after parsing ZIL content.
My understanding is that post flush call, ZFS starts writing down all incomplete txgs to ZIL (in above scenario it is txg 29 and 30).
The data "BBA" should be written only after previous writes i.e. 0x1342fe0000, 0x1342ff0000, 0x1343000000, but this is not true! The data content for the common block (0x1300000000) seems to be taken from Open txg. This is a problem with ordering.
Note that if we disable ZIL replay then issue does not occur. So there is certainly a problem with partial txg replay.
I'm new to ZFS internals so have a few question.
Note that we don't do sync=always. The fsync() calls are used sparingly for performance reasons
(The ZFS version in context is 0.7. Linux Kernel: 3.10.0-1062.18.1.el7.x86_64)
Beta Was this translation helpful? Give feedback.
All reactions