Linux: syncfs(2) should sync all cached files #16818

snajpa · 2024-11-28T22:10:33Z

Motivation and Context

While debugging problem with data not being synced on umount for #16817, we've noticed that even syncfs(2) doesn't help to get the data written out correctly. It is because it doesn't actually sync any of the live znodes.

Description

I've used zpl_fsync unlike filemap_write_and_wait in #16817 since that one calls zil_commit and this syncfs(2) is about on-disk permanence - I think that if the file is opened with O_SYNC, all the dirtied pages should be flushed right away so perhaps filemap_write_and_wait would suffice, but I'm not 100% sure (and this is how we fixed it for the time being on our infra)

edit: it seems that writes to mmaped pages aren't flushed immediately, unless msync(2) is called - there is the MAP_SYNC mmap flag, but it's not supported on FS that doesn't support DAX

edit: see discussion below please

How Has This Been Tested?

Locally along with #16817

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

snajpa · 2024-11-29T21:41:31Z

This is from a run where I still had mistakenly used zfs_fsync directly instead of zpl_fsync as I wanted, but I think it shows already that this approach won't work or I'd need to modify zil_commit{,_impl} to accept a flag to enable it to bail if zilog->zl_suspend or when zil_commit_writer fails.

 [ 4554.954887] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import_status
  [ 4648.819602] WARNING: Pool 'testpool1' has encountered an uncorrectable I/O failure and has been suspended.
  [ 4793.115702] INFO: task txg_sync:451710 blocked for more than 120 seconds.
  [ 4793.119178]       Tainted: P           OE     -------- -  - 4.18.0-553.30.1.el8_10.x86_64 #1
  [ 4793.123190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 4793.126920] task:txg_sync        state:D stack:0     pid:451710 ppid:2      flags:0x80004080
  [ 4793.130844] Call Trace:
  [ 4793.132278]  __schedule+0x2d1/0x870
  [ 4793.134255]  ? __wake_up_common+0x7a/0x190
  [ 4793.136426]  schedule+0x55/0xf0
  [ 4793.138112]  schedule_timeout+0x19f/0x320
  [ 4793.140215]  ? __next_timer_interrupt+0xf0/0xf0
  [ 4793.142444]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.144809]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.147180]  ? taskq_dispatch+0xab/0x380 [spl]
  [ 4793.149671]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.152164]  io_schedule_timeout+0x19/0x40
  [ 4793.154317]  __cv_timedwait_common+0x19e/0x2c0 [spl]
  [ 4793.156958]  ? finish_wait+0x80/0x80
  [ 4793.158901]  __cv_timedwait_io+0x15/0x20 [spl]
  [ 4793.161258]  zio_wait+0x1ad/0x4f0 [zfs]
  [ 4793.165534]  dsl_pool_sync_mos+0x37/0x130 [zfs]
  [ 4793.168155]  dsl_pool_sync+0x510/0x6c0 [zfs]
  [ 4793.170547]  spa_sync_iterate_to_convergence+0xcb/0x310 [zfs]
  [ 4793.173798]  spa_sync+0x362/0x8f0 [zfs]
  [ 4793.175966]  txg_sync_thread+0x27a/0x3b0 [zfs]
  [ 4793.178396]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
  [ 4793.181052]  ? spl_assert.constprop.0+0x20/0x20 [spl]
  [ 4793.183600]  thread_generic_wrapper+0x63/0x90 [spl]
  [ 4793.186095]  kthread+0x134/0x150
  [ 4793.187923]  ? set_kthread_struct+0x50/0x50
  [ 4793.190093]  ret_from_fork+0x35/0x40
  [ 4793.192111] INFO: task zfs:453754 blocked for more than 120 seconds.
  [ 4793.195206]       Tainted: P           OE     -------- -  - 4.18.0-553.30.1.el8_10.x86_64 #1
  [ 4793.199137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 4793.202858] task:zfs             state:D stack:0     pid:453754 ppid:451674 flags:0x00004080
  [ 4793.206945] Call Trace:
  [ 4793.208339]  __schedule+0x2d1/0x870
  [ 4793.210158]  ? srso_alias_return_thunk+0x5/0xfcdfd
  [ 4793.212460]  schedule+0x55/0xf0
  [ 4793.214165]  io_schedule+0x12/0x40
  [ 4793.215949]  cv_wait_common+0x108/0x280 [spl]
  [ 4793.218189]  ? finish_wait+0x80/0x80
  [ 4793.220082]  txg_wait_synced_impl+0x12b/0x2b0 [zfs]
  [ 4793.222720]  txg_wait_synced+0xc/0x40 [zfs]
  [ 4793.225076]  zil_commit_impl+0x10f/0x120 [zfs]
  [ 4793.227504]  zfs_fsync+0x66/0x90 [zfs]
  [ 4793.229637]  zpl_sync_fs+0xc5/0x260 [zfs]
  [ 4793.231990]  __sync_filesystem+0x2c/0x60
  [ 4793.234068]  sync_filesystem+0x27/0x40
  [ 4793.236016]  generic_shutdown_super+0x22/0x110
  [ 4793.238349]  kill_anon_super+0x14/0x30
  [ 4793.240319]  deactivate_locked_super+0x34/0x70
  [ 4793.242613]  cleanup_mnt+0x3b/0x70
  [ 4793.244327]  task_work_run+0x8a/0xb0
  [ 4793.246402]  exit_to_usermode_loop+0xef/0x100
  [ 4793.248733]  do_syscall_64+0x195/0x1a0
  [ 4793.250710]  entry_SYSCALL_64_after_hwframe+0x66/0xcb

While debugging problem with data not being synced on umount, we've noticed that even syncfs(2) doesn't help to get the data written out. It is because it doesn't actually sync any of the live znodes. Signed-off-by: Pavel Snajdr <[email protected]>

snajpa · 2024-11-29T22:32:16Z

So I'll use filemap_write_and_wait in the end, if nobody objects - making the change to be able to commit to zil directly here seems pretty intrusive. The problem is that we can't call wait_txg_synced in this path as it deadlocks with dsl_pool_sync_mos.

I was probably overthinking it anyway, as it seems that filemap_write_and_wait will result in our zpl_writepages called with wbc->sync_mode = WB_SYNC_ALL, thus doing the zil_commit I wanted, O_SYNC or not.

robn · 2024-11-29T22:55:19Z

Yes, don't go down that road on zil_commit(): it takes you places you do not want to go (I did it on 2.1 for a customer project; still getting around to reworking it for upstream (I did a whole presentation on it)).

I'm not sure I understand the premise here. Granted, not easy to know what syncfs() is supposed to be exactly, because it's Linux specific. The manpage suggest "like sync() but for a single filesystem", in which case it's a request to sync, not wait until completed. But, the manpage also says Linux has always waited for sync(), so maybe it should wait.

If it's just a request, seems like what we have should be fine. If its a wait, then why can we not just call txg_wait_synced()?

Though, it sounds like you might be talking about getting mmap'd pages out as well? The only relevant ones are MAP_SHARED+PROT_WRITE, and for those I'm pretty sure you have to msync() or munmap() to get them committed; there's no standard interaction between maps and fsync(), and definitely none with sync().

Ahh, now I see zpl_fsync() is flushing maps as well, so maybe for consistency that's the right thing to do.

Yeah ok, I think your original is the right shape then (haven't thought about the locking). Flush out the maps if we can, then the rest of the filesystem.

snajpa · 2024-11-29T23:12:04Z

@robn getting the dirty pages onto the ZPL so it can deal with them <somehow> would be what I would expect syncfs(2) should do - the question is what does that mean, which actions - but since filemap_write_and_wait in combination with zpl_writepages already conveniently results in one common zil_commit, I think it's good enough from consistency point of view. The potential problem with this might be that it's always good enough for consistency even if you don't need it...

It would seem that it doesn't matter if a file was open with O_SYNC for mmap, there's MAP_SYNC for that - but the FS has to support DAX, if I understand it correctly, so MAP_SYNC can be used, it's not supported otherwise.

snajpa · 2024-11-29T23:16:13Z

There's also this issue with snapshots - if you're taking a snapshot, wouldn't you expect it to contain all the dirty data of that moment, when the snapshot was taken?

snajpa · 2024-11-29T23:28:14Z

Hmm there's another option, I could just call zpl_writepages with custom wbc to initiate write_cache_pages with wbc->sync_mode = WB_SYNC_NONE or wbc->sync_mode = WB_SYNC_ALL as needed and use that around the codebase to make sure dirty pages are synced as expected. For snapshots, ALL, for syncfs(2), probably NONE. Ideas/thoughts?

Signed-off-by: Pavel Snajdr <[email protected]>

robn · 2024-11-30T01:22:10Z

@robn getting the dirty pages onto the ZPL so it can deal with them would be what I would expect syncfs(2) should do.

I wouldn't necessarily from the manpage, but I've just read the Linux code and (via sync_filesystem) I think that's what is expected. Fine.

It would seem that it doesn't matter if a file was open with O_SYNC for mmap, there's MAP_SYNC for that - but the FS has to support DAX, if I understand it correctly, so MAP_SYNC can be used, it's not supported otherwise.

Yes, I don't think MAP_SYNC is relevant here.

As for O_SYNC (& O_DSYNC), the interactions between maps and the underlying file aren't really defined. However, on Linux, msync(MS_SYNC) will end up in vfs_sync_range(), which for us is zpl_fsync(). Meanwhile, msync(MS_ASYNC) is an effective no-op on Linux; the underlying filesystem doesn't get involved at all.

So that's good: seems like we're doing the right thing wrt msync() and munmap(). So the question is simply about other operations that do or do not sync maps as a side-effect. Now I'm digging into the in-kernel filesystems to see what they think.

sync() and syncfs() both arrive in sync_filesystem(). That opens with writeback_inodes_sb(WB_REASON_SYNC) and later, sync_inodes_sb(). Both of these go through sb->s_bdi to issue any writeback work to the underlying device, interleaving calls to sync_fs. We barely set up s_bdi, definitely not enough for writeback_inodes_sb() to do anything, but I'm not sure if sync_inodes_sb() will still push out dirty pages without any particular setup. More investigation needed there (out of time right this minute), but I suspect if we accounted for dirty pages properly in s_bdi, we'd get page flushes "for free" on sync_fs().

There's also this issue with snapshots - if you're taking a snapshot, wouldn't you expect it to contain all the dirty data of that moment, when the snapshot was taken?

I mean, I wouldn't expect it to contain anything that hasn't been explicitly committed, but I've got a bee in my bonnet about fsync(). But I suppose right now, it does wait until the txg is completed before creating the snapshot (dsl_sync_task(.., dsl_dataset_snapshot_sync)), so it will get all the regular file traffic.

So I guess it's easy to argue that a txg sync should also push out dirty mapped pages. It's also easy to argue that the existing semantics are clear: until you msync() or munmap(), those changes don't "exist". Or maybe you want to argue a middle ground, that snapshots are special.

Personally, I would be fine with flushing maps on syncfs(), since that appears to be the expected behaviour on Linux, but I wouldn't change anything else. Snapshots are about sticking a pin in what's already on-disk, and if you haven't explicitly committed, then you have no expectation that those changes are on disk.

snajpa · 2024-11-30T01:36:22Z

sync() and syncfs() both arrive in sync_filesystem(). That opens with writeback_inodes_sb(WB_REASON_SYNC) and later, sync_inodes_sb(). Both of these go through sb->s_bdi to issue any writeback work to the underlying device, interleaving calls to sync_fs. We barely set up s_bdi, definitely not enough for writeback_inodes_sb() to do anything, but I'm not sure if sync_inodes_sb() will still push out dirty pages without any particular setup. More investigation needed there (out of time right this minute), but I suspect if we accounted for dirty pages properly in s_bdi, we'd get page flushes "for free" on sync_fs().

I'll look into this, thx!

Personally, I would be fine with flushing maps on syncfs(), since that appears to be the expected behaviour on Linux, but I wouldn't change anything else. Snapshots are about sticking a pin in what's already on-disk, and if you haven't explicitly committed, then you have no expectation that those changes are on disk.

Agreed.

snajpa · 2024-11-30T13:50:00Z

@robn tbh I don't see why it wouldn't work from the code, what info are we supposed to fill out? I'll give it more than a few tens of minutes, of course - but if you already know why it could save me some time

amotin · 2024-11-30T17:22:08Z

module/os/linux/zfs/zpl_super.c

+	mutex_enter(&zfsvfs->z_znodes_lock);
+	for (zp = list_head(&zfsvfs->z_all_znodes); zp;
+	    zp = list_next(&zfsvfs->z_all_znodes, zp)) {
+		if (zp->z_sa_hdl)
+			error = zpl_writepages(ZTOI(zp)->i_mapping, &wbc);
+		if (error != 0)
+			break;
+	}
+	mutex_exit(&zfsvfs->z_znodes_lock);
+	if (error == 0)
+		error = -zfs_sync(sb, wait, cr);


Without looking on anything else, I wonder if it would be better to sync as much as we can even in case of errors?

probably yeah, I'll just need to figure out a way that doesn't call zil_commit at all, because that can deadlock if there's an IO error, apparently

The (admittedly rare) situation of my external USB drive cages occasionally dropping a connection to a disk, causing an I/O error, which then freezes the pool and leaves me unable to reboot without a power cycle, isn't great, so thank you for trying to avoid deadlocks on IO errors.

robn · 2024-12-10T07:17:11Z

I've been reading code this afternoon, but it's very slow, and I don't have much time or brain for it, so I might be a while yet. Mostly, I'm trying to make sure we're definitely hold mmap and friends correctly.

Is there a small, isolated reproduction available? We'll need something we can put in the test suite anyway, but also, I haven't seen this myself, and I see mentions of Docker and/or containers in relation to this, which I'm not set up for.

But, I also wonder if it is only seen in containers. that might be part of the problem. Nothing concrete yet, but it does seem like there's a bit more complication in the kernel if an alternate memcg is in play. Enough to make me wonder if hmm, is this only about containers?

So yeah, some easy repro would be great, and I'll do more reading when I can.

snajpa · 2024-12-20T14:39:18Z

@robn thanks for looing into it; a repro for this would be along the lines of this, only with syncfs(2) called on the dataset before umount - given that syncfs(2) didn't help to get the data written out, I'd say it probably deserves some further attention. I haven't given up on this yet, I just needed to take a break for a few days :)

snajpa · 2025-06-11T11:21:47Z

I'm going to close this with a link to this comment which explains how we got ourselves to this situation and why no change is needed upstream (shot myself in the foot there :D)

github-actions bot added the Status: Work in Progress Not yet ready for general review label Nov 28, 2024

snajpa force-pushed the fix-syncfs branch from 8fc424e to 5666433 Compare November 28, 2024 22:11

snajpa changed the title ~~Linux: syncfs(2) should sync all open files~~ Linux: syncfs(2) should sync all cached files Nov 28, 2024

snajpa force-pushed the fix-syncfs branch from 5666433 to e5e3ac5 Compare November 28, 2024 22:17

snajpa marked this pull request as ready for review November 28, 2024 22:19

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Nov 28, 2024

snajpa force-pushed the fix-syncfs branch 2 times, most recently from 67d6731 to 88d1c8e Compare November 29, 2024 21:29

snajpa force-pushed the fix-syncfs branch from 88d1c8e to f5b732d Compare November 29, 2024 22:34

wip: do custom zpl_writepages on syncfs(2)

1a7214e

Signed-off-by: Pavel Snajdr <[email protected]>

snajpa force-pushed the fix-syncfs branch from 41cdc80 to 1a7214e Compare November 29, 2024 23:38

amotin reviewed Nov 30, 2024

View reviewed changes

snajpa marked this pull request as draft December 2, 2024 19:14

github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Dec 2, 2024

snajpa closed this Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linux: syncfs(2) should sync all cached files #16818

Linux: syncfs(2) should sync all cached files #16818

Uh oh!

snajpa commented Nov 28, 2024 •

edited

Loading

Uh oh!

snajpa commented Nov 29, 2024 •

edited

Loading

Uh oh!

snajpa commented Nov 29, 2024 •

edited

Loading

Uh oh!

robn commented Nov 29, 2024 •

edited

Loading

Uh oh!

snajpa commented Nov 29, 2024 •

edited

Loading

Uh oh!

snajpa commented Nov 29, 2024

Uh oh!

snajpa commented Nov 29, 2024 •

edited

Loading

Uh oh!

robn commented Nov 30, 2024 •

edited

Loading

Uh oh!

snajpa commented Nov 30, 2024

Uh oh!

snajpa commented Nov 30, 2024

Uh oh!

amotin Nov 30, 2024

Uh oh!

snajpa Nov 30, 2024

Uh oh!

satmandu Dec 9, 2024

Uh oh!

robn commented Dec 10, 2024

Uh oh!

snajpa commented Dec 20, 2024

Uh oh!

snajpa commented Jun 11, 2025

Uh oh!

Uh oh!

Linux: syncfs(2) should sync all cached files #16818

Linux: syncfs(2) should sync all cached files #16818

Uh oh!

Conversation

snajpa commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

snajpa commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snajpa commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snajpa commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snajpa commented Nov 29, 2024

Uh oh!

snajpa commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snajpa commented Nov 30, 2024

Uh oh!

snajpa commented Nov 30, 2024

Uh oh!

amotin Nov 30, 2024

Choose a reason for hiding this comment

Uh oh!

snajpa Nov 30, 2024

Choose a reason for hiding this comment

Uh oh!

satmandu Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

robn commented Dec 10, 2024

Uh oh!

snajpa commented Dec 20, 2024

Uh oh!

snajpa commented Jun 11, 2025

Uh oh!

Uh oh!

snajpa commented Nov 28, 2024 •

edited

Loading

snajpa commented Nov 29, 2024 •

edited

Loading

snajpa commented Nov 29, 2024 •

edited

Loading

robn commented Nov 29, 2024 •

edited

Loading

snajpa commented Nov 29, 2024 •

edited

Loading

snajpa commented Nov 29, 2024 •

edited

Loading

robn commented Nov 30, 2024 •

edited

Loading