Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux: syncfs(2) should sync all cached files #16818

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Nov 28, 2024

Motivation and Context

While debugging problem with data not being synced on umount for #16817, we've noticed that even syncfs(2) doesn't help to get the data written out correctly. It is because it doesn't actually sync any of the live znodes.

Description

I've used zpl_fsync unlike filemap_write_and_wait in #16817 since that one calls zil_commit and this syncfs(2) is about on-disk permanence - I think that if the file is opened with O_SYNC, all the dirtied pages should be flushed right away so perhaps filemap_write_and_wait would suffice, but I'm not 100% sure (and this is how we fixed it for the time being on our infra)

edit: it seems that writes to mmaped pages aren't flushed immediately, unless msync(2) is called - there is the MAP_SYNC mmap flag, but it's not supported on FS that doesn't support DAX

edit: see discussion below please

How Has This Been Tested?

Locally along with #16817

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Nov 28, 2024
@snajpa snajpa changed the title Linux: syncfs(2) should sync all open files Linux: syncfs(2) should sync all cached files Nov 28, 2024
@snajpa snajpa marked this pull request as ready for review November 28, 2024 22:19
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Nov 28, 2024
@snajpa snajpa force-pushed the fix-syncfs branch 2 times, most recently from 67d6731 to 88d1c8e Compare November 29, 2024 21:29
@snajpa
Copy link
Contributor Author

snajpa commented Nov 29, 2024

This is from a run where I still had mistakenly used zfs_fsync directly instead of zpl_fsync as I wanted, but I think it shows already that this approach won't work or I'd need to modify zil_commit{,_impl} to accept a flag to enable it to bail if zilog->zl_suspend or when zil_commit_writer fails.

 [ 4554.954887] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_status
  [ 4648.819602] WARNING: Pool 'testpool1' has encountered an uncorrectable I/O failure and has been suspended.
  [ 4793.115702] INFO: task txg_sync:451710 blocked for more than 120 seconds.
  [ 4793.119178]       Tainted: P           OE     -------- -  - 4.18.0-553.30.1.el8_10.x86_64 #1
  [ 4793.123190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 4793.126920] task:txg_sync        state:D stack:0     pid:451710 ppid:2      flags:0x80004080
  [ 4793.130844] Call Trace:
  [ 4793.132278]  __schedule+0x2d1/0x870
  [ 4793.134255]  ? __wake_up_common+0x7a/0x190
  [ 4793.136426]  schedule+0x55/0xf0
  [ 4793.138112]  schedule_timeout+0x19f/0x320
  [ 4793.140215]  ? __next_timer_interrupt+0xf0/0xf0
  [ 4793.142444]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.144809]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.147180]  ? taskq_dispatch+0xab/0x380 [spl]
  [ 4793.149671]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.152164]  io_schedule_timeout+0x19/0x40
  [ 4793.154317]  __cv_timedwait_common+0x19e/0x2c0 [spl]
  [ 4793.156958]  ? finish_wait+0x80/0x80
  [ 4793.158901]  __cv_timedwait_io+0x15/0x20 [spl]
  [ 4793.161258]  zio_wait+0x1ad/0x4f0 [zfs]
  [ 4793.165534]  dsl_pool_sync_mos+0x37/0x130 [zfs]
  [ 4793.168155]  dsl_pool_sync+0x510/0x6c0 [zfs]
  [ 4793.170547]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
  [ 4793.173798]  spa_sync+0x362/0x8f0 [zfs]
  [ 4793.175966]  txg_sync_thread+0x27a/0x3b0 [zfs]
  [ 4793.178396]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
  [ 4793.181052]  ? spl_assert.constprop.0+0x20/0x20 [spl]
  [ 4793.183600]  thread_generic_wrapper+0x63/0x90 [spl]
  [ 4793.186095]  kthread+0x134/0x150
  [ 4793.187923]  ? set_kthread_struct+0x50/0x50
  [ 4793.190093]  ret_from_fork+0x35/0x40
  [ 4793.192111] INFO: task zfs:453754 blocked for more than 120 seconds.
  [ 4793.195206]       Tainted: P           OE     -------- -  - 4.18.0-553.30.1.el8_10.x86_64 #1
  [ 4793.199137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 4793.202858] task:zfs             state:D stack:0     pid:453754 ppid:451674 flags:0x00004080
  [ 4793.206945] Call Trace:
  [ 4793.208339]  __schedule+0x2d1/0x870
  [ 4793.210158]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.212460]  schedule+0x55/0xf0
  [ 4793.214165]  io_schedule+0x12/0x40
  [ 4793.215949]  cv_wait_common+0x108/0x280 [spl]
  [ 4793.218189]  ? finish_wait+0x80/0x80
  [ 4793.220082]  txg_wait_synced_impl+0x12b/0x2b0 [zfs]
  [ 4793.222720]  txg_wait_synced+0xc/0x40 [zfs]
  [ 4793.225076]  zil_commit_impl+0x10f/0x120 [zfs]
  [ 4793.227504]  zfs_fsync+0x66/0x90 [zfs]
  [ 4793.229637]  zpl_sync_fs+0xc5/0x260 [zfs]
  [ 4793.231990]  __sync_filesystem+0x2c/0x60
  [ 4793.234068]  sync_filesystem+0x27/0x40
  [ 4793.236016]  generic_shutdown_super+0x22/0x110
  [ 4793.238349]  kill_anon_super+0x14/0x30
  [ 4793.240319]  deactivate_locked_super+0x34/0x70
  [ 4793.242613]  cleanup_mnt+0x3b/0x70
  [ 4793.244327]  task_work_run+0x8a/0xb0
  [ 4793.246402]  exit_to_usermode_loop+0xef/0x100
  [ 4793.248733]  do_syscall_64+0x195/0x1a0
  [ 4793.250710]  entry_SYSCALL_64_after_hwframe+0x66/0xcb

While debugging problem with data not being synced on umount, we've
noticed that even syncfs(2) doesn't help to get the data written
out. It is because it doesn't actually sync any of the live znodes.

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa
Copy link
Contributor Author

snajpa commented Nov 29, 2024

So I'll use filemap_write_and_wait in the end, if nobody objects - making the change to be able to commit to zil directly here seems pretty intrusive. The problem is that we can't call wait_txg_synced in this path as it deadlocks with dsl_pool_sync_mos.

I was probably overthinking it anyway, as it seems that filemap_write_and_wait will result in our zpl_writepages called with wbc->sync_mode = WB_SYNC_ALL, thus doing the zil_commit I wanted, O_SYNC or not.

@robn
Copy link
Member

robn commented Nov 29, 2024

Yes, don't go down that road on zil_commit(): it takes you places you do not want to go (I did it on 2.1 for a customer project; still getting around to reworking it for upstream (I did a whole presentation on it)).

I'm not sure I understand the premise here. Granted, not easy to know what syncfs() is supposed to be exactly, because it's Linux specific. The manpage suggest "like sync() but for a single filesystem", in which case it's a request to sync, not wait until completed. But, the manpage also says Linux has always waited for sync(), so maybe it should wait.

If it's just a request, seems like what we have should be fine. If its a wait, then why can we not just call txg_wait_synced()?

Though, it sounds like you might be talking about getting mmap'd pages out as well? The only relevant ones are MAP_SHARED+PROT_WRITE, and for those I'm pretty sure you have to msync() or munmap() to get them committed; there's no standard interaction between maps and fsync(), and definitely none with sync().

Ahh, now I see zpl_fsync() is flushing maps as well, so maybe for consistency that's the right thing to do.

Yeah ok, I think your original is the right shape then (haven't thought about the locking). Flush out the maps if we can, then the rest of the filesystem.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 29, 2024

@robn getting the dirty pages onto the ZPL so it can deal with them <somehow> would be what I would expect syncfs(2) should do - the question is what does that mean, which actions - but since filemap_write_and_wait in combination with zpl_writepages already conveniently results in one common zil_commit, I think it's good enough from consistency point of view. The potential problem with this might be that it's always good enough for consistency even if you don't need it...

It would seem that it doesn't matter if a file was open with O_SYNC for mmap, there's MAP_SYNC for that - but the FS has to support DAX, if I understand it correctly, so MAP_SYNC can be used, it's not supported otherwise.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 29, 2024

There's also this issue with snapshots - if you're taking a snapshot, wouldn't you expect it to contain all the dirty data of that moment, when the snapshot was taken?

@snajpa
Copy link
Contributor Author

snajpa commented Nov 29, 2024

Hmm there's another option, I could just call zpl_writepages with custom wbc to initiate write_cache_pages with wbc->sync_mode = WB_SYNC_NONE or wbc->sync_mode = WB_SYNC_ALL as needed and use that around the codebase to make sure dirty pages are synced as expected. For snapshots, ALL, for syncfs(2), probably NONE. Ideas/thoughts?

@robn
Copy link
Member

robn commented Nov 30, 2024

@robn getting the dirty pages onto the ZPL so it can deal with them would be what I would expect syncfs(2) should do.

I wouldn't necessarily from the manpage, but I've just read the Linux code and (via sync_filesystem) I think that's what is expected. Fine.

It would seem that it doesn't matter if a file was open with O_SYNC for mmap, there's MAP_SYNC for that - but the FS has to support DAX, if I understand it correctly, so MAP_SYNC can be used, it's not supported otherwise.

Yes, I don't think MAP_SYNC is relevant here.

As for O_SYNC (& O_DSYNC), the interactions between maps and the underlying file aren't really defined. However, on Linux, msync(MS_SYNC) will end up in vfs_sync_range(), which for us is zpl_fsync(). Meanwhile, msync(MS_ASYNC) is an effective no-op on Linux; the underlying filesystem doesn't get involved at all.

So that's good: seems like we're doing the right thing wrt msync() and munmap(). So the question is simply about other operations that do or do not sync maps as a side-effect. Now I'm digging into the in-kernel filesystems to see what they think.

sync() and syncfs() both arrive in sync_filesystem(). That opens with writeback_inodes_sb(WB_REASON_SYNC) and later, sync_inodes_sb(). Both of these go through sb->s_bdi to issue any writeback work to the underlying device, interleaving calls to sync_fs. We barely set up s_bdi, definitely not enough for writeback_inodes_sb() to do anything, but I'm not sure if sync_inodes_sb() will still push out dirty pages without any particular setup. More investigation needed there (out of time right this minute), but I suspect if we accounted for dirty pages properly in s_bdi, we'd get page flushes "for free" on sync_fs().

There's also this issue with snapshots - if you're taking a snapshot, wouldn't you expect it to contain all the dirty data of that moment, when the snapshot was taken?

I mean, I wouldn't expect it to contain anything that hasn't been explicitly committed, but I've got a bee in my bonnet about fsync(). But I suppose right now, it does wait until the txg is completed before creating the snapshot (dsl_sync_task(.., dsl_dataset_snapshot_sync)), so it will get all the regular file traffic.

So I guess it's easy to argue that a txg sync should also push out dirty mapped pages. It's also easy to argue that the existing semantics are clear: until you msync() or munmap(), those changes don't "exist". Or maybe you want to argue a middle ground, that snapshots are special.

Personally, I would be fine with flushing maps on syncfs(), since that appears to be the expected behaviour on Linux, but I wouldn't change anything else. Snapshots are about sticking a pin in what's already on-disk, and if you haven't explicitly committed, then you have no expectation that those changes are on disk.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 30, 2024

sync() and syncfs() both arrive in sync_filesystem(). That opens with writeback_inodes_sb(WB_REASON_SYNC) and later, sync_inodes_sb(). Both of these go through sb->s_bdi to issue any writeback work to the underlying device, interleaving calls to sync_fs. We barely set up s_bdi, definitely not enough for writeback_inodes_sb() to do anything, but I'm not sure if sync_inodes_sb() will still push out dirty pages without any particular setup. More investigation needed there (out of time right this minute), but I suspect if we accounted for dirty pages properly in s_bdi, we'd get page flushes "for free" on sync_fs().

I'll look into this, thx!

Personally, I would be fine with flushing maps on syncfs(), since that appears to be the expected behaviour on Linux, but I wouldn't change anything else. Snapshots are about sticking a pin in what's already on-disk, and if you haven't explicitly committed, then you have no expectation that those changes are on disk.

Agreed.

@snajpa
Copy link
Contributor Author

snajpa commented Nov 30, 2024

@robn tbh I don't see why it wouldn't work from the code, what info are we supposed to fill out? I'll give it more than a few tens of minutes, of course - but if you already know why it could save me some time

Comment on lines +124 to +134
mutex_enter(&zfsvfs->z_znodes_lock);
for (zp = list_head(&zfsvfs->z_all_znodes); zp;
zp = list_next(&zfsvfs->z_all_znodes, zp)) {
if (zp->z_sa_hdl)
error = zpl_writepages(ZTOI(zp)->i_mapping, &wbc);
if (error != 0)
break;
}
mutex_exit(&zfsvfs->z_znodes_lock);
if (error == 0)
error = -zfs_sync(sb, wait, cr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking on anything else, I wonder if it would be better to sync as much as we can even in case of errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably yeah, I'll just need to figure out a way that doesn't call zil_commit at all, because that can deadlock if there's an IO error, apparently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (admittedly rare) situation of my external USB drive cages occasionally dropping a connection to a disk, causing an I/O error, which then freezes the pool and leaves me unable to reboot without a power cycle, isn't great, so thank you for trying to avoid deadlocks on IO errors.

@snajpa snajpa marked this pull request as draft December 2, 2024 19:14
@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Dec 2, 2024
@robn
Copy link
Member

robn commented Dec 10, 2024

I've been reading code this afternoon, but it's very slow, and I don't have much time or brain for it, so I might be a while yet. Mostly, I'm trying to make sure we're definitely hold mmap and friends correctly.

Is there a small, isolated reproduction available? We'll need something we can put in the test suite anyway, but also, I haven't seen this myself, and I see mentions of Docker and/or containers in relation to this, which I'm not set up for.

But, I also wonder if it is only seen in containers. that might be part of the problem. Nothing concrete yet, but it does seem like there's a bit more complication in the kernel if an alternate memcg is in play. Enough to make me wonder if hmm, is this only about containers?

So yeah, some easy repro would be great, and I'll do more reading when I can.

@snajpa
Copy link
Contributor Author

snajpa commented Dec 20, 2024

@robn thanks for looing into it; a repro for this would be along the lines of this, only with syncfs(2) called on the dataset before umount - given that syncfs(2) didn't help to get the data written out, I'd say it probably deserves some further attention. I haven't given up on this yet, I just needed to take a break for a few days :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants