Exploding memory usage in zio_buf [FreeBSD zfs 2.1.4+] #13601

bdrewery · 2022-06-28T16:54:23Z

bdrewery
Jun 28, 2022

I would open an issue but I don't have enough data yet to describe the problem. I just have a small home server for Nextcloud and a few vms. Nothing too special.

I've ran into numerous panics over the last few weeks with 2.1.4 and 2.1.5 where specifically zio_buf_1048576 explodes from 0 to 20G in what seems like seconds. (Some data below). This isn't ARC; I had that limited to 12G. Note that I have partially reverted 309c32c to split zio_data_buf back out on its own after running into this problem earlier. When this happened the system went into OOM, swapped out everything and made no progress on anything. I had to reboot. I did mange to get a kernel dump on the last time.

Before this I was using whatever ZFS FreeBSD 12 had, not ZoL/zfs 2. It was stable.

Looking at my system now and over the last 24 hours there have been 0 allocations of zio_buf_1048576. ~~I've been running dtrace -n 'fbt::zio_buf_alloc:entry { if (arg0 == 10485760) stack(); }' with no hits.~~ (I do see hits with proper size 1048576 but the total allocations in vmstat -z remain very low outside of this issue)

My question is what exactly would cause zio_buf_1048576 to be allocated so rarely and so quickly? How might I find what is using them in the kernel dump?

Current top. The 10G free is close to what pre-explode looks like. I explicitly set vfs.zfs.arc_free_target to keep 10G pages free.

376 processes: 1 running, 351 sleeping, 24 zombie                        
CPU:  0.5% user,  1.2% nice,  7.3% system,  0.0% interrupt, 91.1% idle   
Mem: 726M Active, 4638M Inact, 15G Wired, 10G Free                       
ARC: 6569M Total, 2787M MFU, 2486M MRU, 12M Anon, 218M Header, 983M Other
     4117M Compressed, 13G Uncompressed, 3.32:1 Ratio                    
Swap: 32G Total, 32G Free                                                
# zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jun 28 10:13:20 2022
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                70.31   k
        Mutex Misses:                           54
        Evict Skips:                            917

ARC Size:                               58.36%  6.42    GiB
        Target Size: (Adaptive)         62.13%  6.83    GiB
        Min Size (Hard Limit):          9.08%   1022.63 MiB
        Max Size (High Water):          11:1    11.00   GiB
        Compressed Data Size:                   4.02    GiB
        Decompressed Data Size:                 13.36   GiB
        Compression Factor:                     3.32

ARC Size Breakdown:
        Recently Used Cache Size:       83.50%  5.71    GiB
        Frequently Used Cache Size:     16.50%  1.13    GiB

ARC Hash Breakdown:
        Elements Max:                           1.81    m
        Elements Current:               99.85%  1.81    m
        Collisions:                             4.95    m
        Chain Max:                              7
        Chains:                                 294.35  k

------------------------------------------------------------------------

(There is some dedup but it is ancient data and relatively small)

NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank                                            3.62T  2.42T  1.20T        -         -    47%    66%  1.08x    ONLINE  -
  mirror-0                                       928G   623G   305G        -         -    46%  67.1%      -    ONLINE
    gptid/978f11d1-22b3-ea49-ed66-866ef603c6da      -      -      -        -         -      -      -      -    ONLINE
    gptid/8d1bc971-e7cd-11e9-9e30-003048f865a6      -      -      -        -         -      -      -      -    ONLINE
  mirror-1                                       928G   624G   304G        -         -    48%  67.2%      -    ONLINE
    gptid/734a694d-9a85-dde3-f650-c551dabbf86a      -      -      -        -         -      -      -      -    ONLINE
    gptid/0df0e41b-cf10-634e-a8f6-8ca31d5935f9      -      -      -        -         -      -      -      -    ONLINE
  mirror-2                                       928G   611G   317G        -         -    48%  65.9%      -    ONLINE
    gptid/1dc556eb-53cd-7de9-9688-e9a41d7e0495      -      -      -        -         -      -      -      -    ONLINE
    gptid/83598e25-a581-11e8-a4e4-003048f865a6      -      -      -        -         -      -      -      -    ONLINE
  mirror-3                                       928G   623G   305G        -         -    48%  67.1%      -    ONLINE
    gptid/cdf44e0f-0968-d644-ba4f-a4a7da85163a      -      -      -        -         -      -      -      -    ONLINE
    gptid/71c94755-19fc-0dc4-800b-ca73194d1d94      -      -      -        -         -      -      -      -    ONLINE
logs                                                -      -      -        -         -      -      -      -  -
  mirror-5                                      1.88G  7.78M  1.87G        -         -     0%  0.40%      -    ONLINE
    nda0p3                                          -      -      -        -         -      -      -      -    ONLINE
    nda1p3                                          -      -      -        -         -      -      -      -    ONLINE
cache                                               -      -      -        -         -      -      -      -  -
  nda1p1                                         100G  44.0G  56.0G        -         -     0%  44.0%      -    ONLINE

  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.
  scan: scrub canceled on Tue Jun 21 21:23:25 2022
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/978f11d1-22b3-ea49-ed66-866ef603c6da  ONLINE       0     0     0
            gptid/8d1bc971-e7cd-11e9-9e30-003048f865a6  ONLINE       0     0     0  block size: 512B configured, 4096B native
          mirror-1                                      ONLINE       0     0     0
            gptid/734a694d-9a85-dde3-f650-c551dabbf86a  ONLINE       0     0     0
            gptid/0df0e41b-cf10-634e-a8f6-8ca31d5935f9  ONLINE       0     0     0
          mirror-2                                      ONLINE       0     0     0
            gptid/1dc556eb-53cd-7de9-9688-e9a41d7e0495  ONLINE       0     0     0
            gptid/83598e25-a581-11e8-a4e4-003048f865a6  ONLINE       0     0     0  block size: 512B configured, 4096B native
          mirror-3                                      ONLINE       0     0     0
            gptid/cdf44e0f-0968-d644-ba4f-a4a7da85163a  ONLINE       0     0     0
            gptid/71c94755-19fc-0dc4-800b-ca73194d1d94  ONLINE       0     0     0
        logs
          mirror-5                                      ONLINE       0     0     0
            nda0p3                                      ONLINE       0     0     0
            nda1p3                                      ONLINE       0     0     0
        cache
          nda1p1                                        ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 21858, size 85.3K on disk, 105K in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    18.3K    413M    412M    412M    18.3K    413M    412M    412M
     2    2.86K   37.8M   37.8M   37.8M    5.76K   76.6M   76.6M   76.6M
     4      189    128K    128K    128K      926    601K    601K    601K
     8       31   16.5K   16.5K   16.5K      309    168K    168K    168K
    16        3   1.50K   1.50K   1.50K       62     31K     31K     31K
    64        1    512B    512B    512B       71   35.5K   35.5K   35.5K
 Total    21.3K    450M    450M    450M    25.4K    490M    490M    490M

Some customizations:

vfs.zfs.txg.timeout="15" # lowered to 10 in current
vfs.zfs.l2arc.noprefetch=0
vfs.zfs.arc.max="13G" # lowered to 11 in current
vfs.zfs.arc_free_target=2621440 # 10G (verified this was set from vmcore)

From vmstat -z of last panic from dump (parsed)

ITEM                SIZE     USED    TOTAL        TOTAL(M)
zio_buf_1048576       1048576  23179   24304943104  23179
abd_chunk             4096     72567   297234432    283
zio_data_buf_1048576  1048576  234     245366784    234
zio_buf_16384         16384    11232   184025088    175
zio_cache             1232     108538  133718816    127
zio_buf_131072        131072   962     126091264    120

And a previous time from a dump

ITEM                SIZE     USED     TOTAL        TOTAL(M)
zio_buf_comb_1048576  1048576  18922    19841155072  18922
abd_chunk             4096     1443559  5912817664   5638
sio_cache_0           136      9364859  1273620824   1214
zio_buf_comb_1024     1024     589322   603465728    575
zio_buf_comb_3072     3072     190546   585357312    558
arc_buf_hdr_t_full    240      2419271  580625040    553
zio_buf_comb_1536     1536     282070   433259520    413
zio_buf_comb_2560     2560     164353   420743680    401
sio_cache_2           168      2303285  386951880    369
zio_buf_comb_3584     3584     104234   373574656    356
zio_buf_comb_4096     4096     73875    302592000    288
malloc-128            128      2207754  282592512    269
zio_buf_comb_131072   131072   2101     275382272    262
zio_buf_comb_2048     2048     131455   269219840    256
zio_buf_comb_16384    16384    12511    204980224    195
sio_cache_1           152      1237414  188086928    179
zio_buf_comb_917504   917504   159      145883136    139
zio_cache             1232     115570   142382240    135
zio_buf_comb_512      512      213261   109189632    104

Current. No 1M allocations.

ITEM                SIZE                                                            USED     TOTAL       TOTAL(M)
vm pgcache            4096                                                            1837559  7526641664  7177
abd_chunk             4096                                                            944504   3868688384  3689
zio_buf_16384         16384                                                           54814    898072576   856
vm pgcache            4096                                                            198982   815030272   777
dnode_t               816                                                             715530   583872480   556
zio_buf_3072          3072                                                            102161   313838592   299
VNODE                 488                                                             621609   303345192   289
zfs_znode_cache       472                                                             616067   290783624   277
malloc-384            384                                                             715333   274687872   261
malloc-4096           4096                                                            54895    224849920   214
dmu_buf_impl_t        296                                                             755725   223694600   213
sa_cache              288                                                             616067   177427296   169
zio_buf_131072        131072                                                          1210     158597120   151
arc_buf_hdr_t_l2only  96                                                              1460038  140163648   133
zio_data_buf_131072   131072                                                          920      120586240   115
UMA Slabs 0           80                                                              1367358  109388640   104

Some interesting threads from gdb.
Is there a way for me to find what is using all of the 1M zio_buf allocations?

thread 186
#17 abd_borrow_buf_copy (abd=0xfffff80469c02480, n=844800) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:652
#18 0xffffffff821b459d in vdev_geom_io_start (zio=0xfffff8014afc19a0, zio@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:1234
#19 0xffffffff8235a434 in zio_vdev_io_start (zio=0xfffff8014afc19a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3878
#20 0xffffffff8235448f in __zio_execute (zio=0xfffff8014afc19a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#21 zio_nowait (zio=<optimized out>, zio@entry=0xfffff8014afc19a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2299
#22 0xffffffff822b524b in vdev_queue_io_done (zio=zio@entry=0xfffff807d8469000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/vdev_queue.c:954
#23 0xffffffff8235a6c8 in zio_vdev_io_done (zio=0xfffff807d8469000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3901
#24 0xffffffff8235535f in __zio_execute (zio=0xfffff807d8469000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#25 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123

thread 187
#17 abd_borrow_buf_copy (abd=0xfffff80208710b80, n=880128) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:652
#18 0xffffffff821b459d in vdev_geom_io_start (zio=0xfffff804299454d0, zio@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:1234
#19 0xffffffff8235a434 in zio_vdev_io_start (zio=0xfffff804299454d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3878
#20 0xffffffff8235448f in __zio_execute (zio=0xfffff804299454d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#21 zio_nowait (zio=<optimized out>, zio@entry=0xfffff804299454d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2299
#22 0xffffffff822b524b in vdev_queue_io_done (zio=zio@entry=0xfffff800bc5a69a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/vdev_queue.c:954
#23 0xffffffff8235a6c8 in zio_vdev_io_done (zio=0xfffff800bc5a69a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3901
#24 0xffffffff8235535f in __zio_execute (zio=0xfffff800bc5a69a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#25 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123

thread 188
#16 0xffffffff821e4cee in abd_borrow_buf (abd=0xfffff80469369000, n=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:641
#17 abd_borrow_buf_copy (abd=0xfffff80469369000, n=844800) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:652
#18 0xffffffff821b459d in vdev_geom_io_start (zio=0xfffff8069c9fd000, zio@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:1234
#19 0xffffffff8235a434 in zio_vdev_io_start (zio=0xfffff8069c9fd000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3878
#20 0xffffffff8235448f in __zio_execute (zio=0xfffff8069c9fd000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#21 zio_nowait (zio=<optimized out>, zio@entry=0xfffff8069c9fd000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2299
#22 0xffffffff822b524b in vdev_queue_io_done (zio=zio@entry=0xfffff8017d97f000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/vdev_queue.c:954
#23 0xffffffff8235a6c8 in zio_vdev_io_done (zio=0xfffff8017d97f000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3901
#24 0xffffffff8235535f in __zio_execute (zio=0xfffff8017d97f000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212

thread 189
#17 abd_borrow_buf_copy (abd=0xfffff800577f2200, n=873984) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:652
#18 0xffffffff821b459d in vdev_geom_io_start (zio=0xfffff800d16e2000, zio@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:1234
#19 0xffffffff8235a434 in zio_vdev_io_start (zio=0xfffff800d16e2000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3878
#20 0xffffffff8235448f in __zio_execute (zio=0xfffff800d16e2000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#21 zio_nowait (zio=<optimized out>, zio@entry=0xfffff800d16e2000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2299
#22 0xffffffff822b524b in vdev_queue_io_done (zio=zio@entry=0xfffff807690144d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/vdev_queue.c:954
#23 0xffffffff8235a6c8 in zio_vdev_io_done (zio=0xfffff807690144d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3901
#24 0xffffffff8235535f in __zio_execute (zio=0xfffff807690144d0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#25 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123

thread 190
#17 abd_borrow_buf_copy (abd=0xfffff80150195800, n=852480) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/abd.c:652
#18 0xffffffff821b459d in vdev_geom_io_start (zio=0xfffff80185a469a0, zio@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/vdev_geom.c:1234
#19 0xffffffff8235a434 in zio_vdev_io_start (zio=0xfffff80185a469a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3878
#20 0xffffffff8235448f in __zio_execute (zio=0xfffff80185a469a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#21 zio_nowait (zio=<optimized out>, zio@entry=0xfffff80185a469a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2299
#22 0xffffffff822b524b in vdev_queue_io_done (zio=zio@entry=0xfffff803aa5d3000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/vdev_queue.c:954
#23 0xffffffff8235a6c8 in zio_vdev_io_done (zio=0xfffff803aa5d3000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:3901
#24 0xffffffff8235535f in __zio_execute (zio=0xfffff803aa5d3000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#25 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123


1M allocations:

thread 177
#15 0xffffffff82357444 in zio_buf_alloc (size=1048576) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:320
#16 zio_write_compress (zio=0xfffff807668819a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:1701
#17 0xffffffff8235535f in __zio_execute (zio=0xfffff807668819a0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#18 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123

thread 178
#15 0xffffffff82357444 in zio_buf_alloc (size=1048576) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:320
#16 zio_write_compress (zio=0xfffff8070d9a4000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:1701
#17 0xffffffff8235535f in __zio_execute (zio=0xfffff8070d9a4000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#18 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123
(kgdb) p/u ticks - td->td_slptick
$39 = 672026

thread 179
#15 0xffffffff82357444 in zio_buf_alloc (size=1048576) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:320
#16 zio_write_compress (zio=0xfffff8073b9fa000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:1701
#17 0xffffffff8235535f in __zio_execute (zio=0xfffff8073b9fa000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#18 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123

thread 180
#15 0xffffffff82357444 in zio_buf_alloc (size=1048576) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:320
#16 zio_write_compress (zio=0xfffff800dab93000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:1701
#17 0xffffffff8235535f in __zio_execute (zio=0xfffff800dab93000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#18 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123
$41 = 672026

thread 181
#15 0xffffffff82357444 in zio_buf_alloc (size=1048576) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:320
#16 zio_write_compress (zio=0xfffff804464a5000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:1701
#17 0xffffffff8235535f in __zio_execute (zio=0xfffff804464a5000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2212
#18 zio_execute (zio=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zio.c:2123
(kgdb) p *zio
$25 = {io_bookmark = {zb_objset = 1083858, zb_object = 1333750, zb_level = 0, zb_blkid = 309}, io_prop = {zp_checksum = ZIO_CHECKSUM_FLETCHER_4, zp_compress = ZIO_COMPRESS_LZ4, zp_complevel = 0 '\000', zp_type = DMU_OT_PLAIN_FILE_CONTENTS, zp_level = 0 '\000', zp_copies = 1 '\001',
    zp_dedup = 0, zp_dedup_verify = 0, zp_nopwrite = 0, zp_encrypt = 0, zp_byteorder = 1, zp_salt = "\000\000\000\000\000\000\000", zp_iv = '\000' <repeats 11 times>, zp_mac = '\000' <repeats 15 times>, zp_zpl_smallblk = 0}, io_type = ZIO_TYPE_WRITE,
  io_child_type = ZIO_CHILD_LOGICAL, io_trim_flags = 0, io_cmd = 0, io_priority = ZIO_PRIORITY_SYNC_WRITE, io_reexecute = 0 '\000', io_state = "\000", io_txg = 82172020, io_spa = 0xfffffe0191cb5000, io_bp = 0xfffffe03aa4ba678, io_bp_override = 0x0, io_bp_copy = {blk_dva = {{
        dva_word = {0, 0}}, {dva_word = {0, 0}}, {dva_word = {0, 0}}}, blk_prop = 0, blk_pad = {0, 0}, blk_phys_birth = 0, blk_birth = 0, blk_fill = 0, blk_cksum = {zc_word = {0, 0, 0, 0}}}, io_parent_list = {list_size = 48, list_offset = 16, list_head = {
      list_next = 0xfffff80029943df0, list_prev = 0xfffff80029943df0}}, io_child_list = {list_size = 48, list_offset = 32, list_head = {list_next = 0xfffff804464a5158, list_prev = 0xfffff804464a5158}}, io_logical = 0xfffff804464a5000, io_transform_stack = 0x0,
  io_ready = 0xffffffff8220f080 <dmu_sync_late_arrival_ready>, io_children_ready = 0x0, io_physdone = 0x0, io_done = 0xffffffff8220f130 <dmu_sync_late_arrival_done>, io_private = 0xfffff803a52fcd60, io_prev_space_delta = 0, io_bp_orig = {blk_dva = {{dva_word = {0, 0}}, {dva_word = {
          0, 0}}, {dva_word = {0, 0}}}, blk_prop = 0, blk_pad = {0, 0}, blk_phys_birth = 0, blk_birth = 0, blk_fill = 0, blk_cksum = {zc_word = {0, 0, 0, 0}}}, io_lsize = 1048576, io_abd = 0xfffff80725391380, io_orig_abd = 0xfffff80725391380, io_size = 1048576,
  io_orig_size = 1048576, io_vd = 0x0, io_vsd = 0x0, io_vsd_ops = 0x0, io_metaslab_class = 0x0, io_offset = 0, io_timestamp = 0, io_queued_timestamp = 82414072856722, io_target_timestamp = 0, io_delta = 0, io_delay = 0, io_queue_node = {avl_child = {0x0, 0x0}, avl_pcb = 0},
  io_offset_node = {avl_child = {0x0, 0x0}, avl_pcb = 0}, io_alloc_node = {avl_child = {0x0, 0x0}, avl_pcb = 0}, io_alloc_list = {zal_list = {list_size = 72, list_offset = 0, list_head = {list_next = 0xfffff804464a52f8, list_prev = 0xfffff804464a52f8}}, zal_size = 0},
  io_flags = ZIO_FLAG_CANFAIL, io_stage = ZIO_STAGE_WRITE_COMPRESS,
  io_pipeline = (ZIO_STAGE_WRITE_BP_INIT | ZIO_STAGE_ISSUE_ASYNC | ZIO_STAGE_WRITE_COMPRESS | ZIO_STAGE_ENCRYPT | ZIO_STAGE_CHECKSUM_GENERATE | ZIO_STAGE_DVA_THROTTLE | ZIO_STAGE_DVA_ALLOCATE | ZIO_STAGE_READY | ZIO_STAGE_VDEV_IO_START | ZIO_STAGE_VDEV_IO_DONE | ZIO_STAGE_VDEV_IO_ASSESS | ZIO_STAGE_DONE), io_orig_flags = ZIO_FLAG_CANFAIL, io_orig_stage = ZIO_STAGE_OPEN,
  io_orig_pipeline = (ZIO_STAGE_WRITE_BP_INIT | ZIO_STAGE_ISSUE_ASYNC | ZIO_STAGE_WRITE_COMPRESS | ZIO_STAGE_ENCRYPT | ZIO_STAGE_CHECKSUM_GENERATE | ZIO_STAGE_DVA_THROTTLE | ZIO_STAGE_DVA_ALLOCATE | ZIO_STAGE_READY | ZIO_STAGE_VDEV_IO_START | ZIO_STAGE_VDEV_IO_DONE | ZIO_STAGE_VDEV_IO_ASSESS | ZIO_STAGE_DONE), io_pipeline_trace = (ZIO_STAGE_OPEN | ZIO_STAGE_WRITE_BP_INIT | ZIO_STAGE_ISSUE_ASYNC | ZIO_STAGE_WRITE_COMPRESS), io_error = 0, io_child_error = {0, 0, 0, 0}, io_children = {{0, 0}, {0, 0}, {0, 0}, {0, 0}}, io_child_count = 0, io_phys_children = 0,
  io_parent_count = 1, io_stall = 0x0, io_gang_leader = 0x0, io_gang_tree = 0x0, io_executor = 0xfffffe01b936f900, io_waiter = 0x0, io_bio = 0x0, io_lock = {lock_object = {lo_name = 0xffffffff8240d0ce "zio->io_lock", lo_flags = 577830912, lo_data = 0, lo_witness = 0x0},
    sx_lock = 1}, io_cv = {cv_description = 0xffffffff82408497 "zio->io_cv", cv_waiters = 0}, io_allocator = 0, io_cksum_report = 0x0, io_ena = 0, io_tqent = {tqent_task = {ta_link = {stqe_next = 0xfffff801fb8ff410}, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000',
      ta_func = 0xffffffff821aaa80 <taskq_run_ent>, ta_context = 0xfffff804464a5410}, tqent_timeout_task = {q = 0x0, t = {ta_link = {stqe_next = 0x0}, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0x0, ta_context = 0x0}, c = {c_links = {le = {le_next = 0x0,
            le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 0, c_cpu = 0}, f = 0}, tqent_func = 0xffffffff823552c0 <zio_execute>,
    tqent_arg = 0xfffff804464a5000, tqent_id = 0, tqent_hash = {cle_next = 0x0, cle_prev = 0x0}, tqent_type = 0 '\000', tqent_registered = 0 '\000', tqent_cancelled = 0 '\000', tqent_rc = 0}}



thread 703 (mariadb)
#5  0xffffffff8234fe0d in zil_commit_waiter (zilog=0xfffff80088ff4000, zcw=0xfffff803f6b7e320) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:2807
#6  zil_commit_impl (zilog=<optimized out>, zilog@entry=0x0, foid=foid@entry=18446735279914909696) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:3062
#7  0xffffffff8234ee3f in zil_commit (zilog=zilog@entry=0xfffff80088ff4000, foid=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:3024
#8  0xffffffff8234c657 in zfs_write (zp=<optimized out>, uio=<optimized out>, uio@entry=0xfffffe022bef4948, ioflag=<optimized out>, cr=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zfs_vnops.c:757
#9  0xffffffff821ca519 in zfs_freebsd_write (ap=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4464
#10 0xffffffff81144309 in VOP_WRITE_APV (vop=0xffffffff82488340 <zfs_vnodeops>, a=a@entry=0xfffffe022bef4a80) at vnode_if.c:1169
#11 0xffffffff80ca477b in VOP_WRITE (vp=0xfffff802ab1cdb70, uio=0xfffffe022bef4dc0, ioflag=129, cred=<unavailable>) at ./vnode_if.h:600
#12 vn_write (fp=0xfffff8000e9f5280, uio=0xfffffe022bef4dc0, active_cred=<optimized out>, flags=<optimized out>, td=<optimized out>) at /usr/base/b/src/sys/kern/vfs_vnops.c:1178
#13 0xffffffff80ca4133 in vn_io_fault_doio (args=args@entry=0xfffffe022bef4cb0, uio=uio@entry=0xfffffe022bef4dc0, td=td@entry=0xfffffe0207b2f3a0) at /usr/base/b/src/sys/kern/vfs_vnops.c:1243
#14 0xffffffff80c9fadc in vn_io_fault1 (vp=<optimized out>, uio=uio@entry=0xfffffe022bef4dc0, args=args@entry=0xfffffe022bef4cb0, td=td@entry=0xfffffe0207b2f3a0) at /usr/base/b/src/sys/kern/vfs_vnops.c:1361
#15 0xffffffff80c9cf60 in vn_io_fault (fp=0xfffff8000e9f5280, fp@entry=<error reading variable: value is not available>, uio=0xfffffe022bef4dc0, uio@entry=<error reading variable: value is not available>, active_cred=<optimized out>,
    active_cred@entry=<error reading variable: value is not available>, flags=<optimized out>, flags@entry=<error reading variable: value is not available>, td=<optimized out>, td@entry=<error reading variable: value is not available>) at /usr/base/b/src/sys/kern/vfs_vnops.c:1483
#16 0xffffffff80c14dd8 in fo_write (fp=0xfffff8000e9f5280, uio=0xfffffe022bef4dc0, active_cred=<unavailable>, flags=<unavailable>, flags@entry=1, td=0xfffffe0207b2f3a0) at /usr/base/b/src/sys/sys/file.h:345
#17 dofilewrite (td=td@entry=0xfffffe0207b2f3a0, fd=fd@entry=9, fp=0xfffff8000e9f5280, auio=auio@entry=0xfffffe022bef4dc0, offset=offset@entry=17560064, flags=<optimized out>, flags@entry=1) at /usr/base/b/src/sys/kern/sys_generic.c:565
#18 0xffffffff80c14b92 in kern_pwritev (td=0xfffffe0207b2f3a0, fd=9, auio=auio@entry=0xfffffe022bef4dc0, offset=17560064) at /usr/base/b/src/sys/kern/sys_generic.c:537
#19 0xffffffff80c14a8b in kern_pwrite (fd=<unavailable>, buf=<optimized out>, nbyte=<optimized out>, offset=<unavailable>, td=<optimized out>) at /usr/base/b/src/sys/kern/sys_generic.c:446
(kgdb) p/u ticks - td->td_slptick
$49 = 699932

thread 818 smbd
#5  0xffffffff8234fe0d in zil_commit_waiter (zilog=0xfffff80088a3c400, zcw=0xfffff803f6b7d000) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:2807
#6  zil_commit_impl (zilog=<optimized out>, zilog@entry=0xfffffe01b92da6d8, foid=foid@entry=0) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:3062
#7  0xffffffff8234ee3f in zil_commit (zilog=<optimized out>, foid=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zil.c:3024
#8  0xffffffff8234b82f in zfs_fsync (zp=0xfffff800293be3b0, syncflag=<optimized out>, cr=<optimized out>) at /usr/base/b/src/sys/contrib/openzfs/module/zfs/zfs_vnops.c:71
#9  0xffffffff80c9a377 in VOP_FSYNC (vp=0xfffff802c6099000, waitfor=1, td=0xfffffe01b92da3a0) at ./vnode_if.h:768
#10 kern_fsync (td=0xfffffe01b92da3a0, fd=<optimized out>, fullsync=true) at /usr/base/b/src/sys/kern/vfs_syscalls.c:3550
(kgdb) p/u ticks - td->td_slptick
$50 = 692513

rincebrain · 2022-06-28T22:00:18Z

rincebrain
Jun 28, 2022
Collaborator

Your dtrace command appears to be off by an order of magnitude.

Thinking about it, I can't immediately come up with any changes in 2.0 or 2.1 that would drastically change the allocation behavior.

7 replies

rincebrain Jun 28, 2022
Collaborator

I am not familiar with the IO allocation paths on FBSD, compared to Linux, but AFAIK it's an upper bound, so any allocation between (512*1024)+1 and 1048576 would hit it.

I think almost nothing uses the zio codepaths except, well, zios - are you rolling recordsize >= 1M, or doing a bunch of IO that could be combined into hunks that large maybe? (I and a few other people have experimental patches that use the zio buffers for other things, but I don't think any of them have been merged...)

If it's doing writes to those 512e drives, it could theoretically bottleneck very badly, though I think I expect it to not fill up RAM with queued IO or anything...

...well, I guess with txg timeout long enough you might be able to do that, but the write throttle should stop that...

rincebrain Jun 28, 2022
Collaborator

...a curious thought. What models are those drives?

bdrewery Jun 28, 2022
Author

I have set recordsize=1M on most of my datasets, admittedly recently. I expected the write throttle or dirty data limits here too.

There's a bit more. I had encrypted swap enabled and this situation would swap out all processes which also was allocating memory as it went along. And the CPU I have is an old AMD Opteron that is quite slow with compression and encryption. So the hang up could have been due to that. But it does not necessarily explain how 20G of buffers were allocated and stuck without being throttled.

@amotin do you have any ideas here?

bdrewery Jun 28, 2022
Author

...a curious thought. What models are those drives?

In order as listed in zpool status

WD1001FALS-0 0K05
WD2003FZEX-0 1A01 512/4096 warning

WD1001FALS-0 0K05
WD1001FALS-0 0K05

WD1001FALS-0 0K05
WD1003FZEX-0 1A01 512/4096 warning

WD1001FALS-0 0K05
WD1001FALS-0 0K05

rincebrain Jun 28, 2022
Collaborator

Okay, probably not.

I was wondering if this was a weird manifestation of WD's amazing ability to manufacture SMR drives that somehow fall off the bus if you use ZFS on them.

amotin · 2022-06-28T23:36:52Z

amotin
Jun 28, 2022
Collaborator

@bdrewery I have no quick ideas what could be the issue. I haven't seen TrueNAS reports alike to that, and our users run plenty of FreeBSD systems with 2.1.x in a wild. Though we use custom ZFS builds, based on 2.1.x, not the one from base FreeBSD, but they should be very close.

Whats about places where 1MB buffers can be allocated, since you are using 1MB recordsize, it can obviously be data buffers. Though ARC does not normally keep so big blocks, unless it is tuned to, but instead copies the content into a chain of PAGE_SIZE chunks to free KVA. 1MB allocations are used for data in small DBUF cache (recently accessed); for dirty buffers that were just recently modified and still being written (though in first two cases you'd likely see zio_data_buf_1048576, not zio_buf_1048576, since IIRC metadata should not use large blocks unless tuned so); for ZIOs in pipeline, for example compression/decompression code allocates linear buffers that way (I see those among your stack traces, and as I see that code always allocates zio_buf_*); and finally if FreeBSD-specific vdev_geom.c code can't execute aggregated I/O via BIO_UNMAPPED GEOM request it calls abd_borrow_buf()/abd_borrow_buf_copy() to get its linear copy. I actually see above several threads in abd_borrow_buf_copy(), as I understand waiting for buffer allocations to execute writes. But that may be a part of normal operation, either because your HBA does not support BIO_UNMAPPED, or simply because some buffers in that aggregated I/O (BTW I/O aggregation is also done up to 1MB for HDDs) are not page-aligned. So it may or may not be a problem. Though it makes me wonder what HBA/disk controller/driver are you using?

Actually, considering "block size: 512B configured, 4096B native" I see, your pool likely runs with ashift=9, that means there can be some buffers not aligned to PAGE_SIZE. TrueNAS almost always use ashift=12 to 4K disk compatibility and should almost always be able to use BIO_UNMAPPED, since all disk I/Os are page-aligned. Just thinking about possible differences...

3 replies

amotin Jun 28, 2022
Collaborator

ZIO pipeline and vdev code, since not being part of ARC and since should allocate only limited number of buffers, do not care about ARC limits. This may explain why the ARC size limit is not respected.

bdrewery Jun 29, 2022
Author

mps0: <Avago Technologies (LSI) SAS2008> port 0xb000-0xb0ff mem 0xfd8fc000-0xfd8fffff,0xfd880000-0xfd8bffff irq 19 at device 0.0 on pci1
mps0: Firmware: 14.00.01.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

Adapter:
mps0 Adapter:
       Board Name: SAS9211-8i
   Board Assembly: H3-25250-02E
        Chip Name: LSISAS2008
    Chip Revision: ALL
    BIOS Revision: 7.27.00.00
Firmware Revision: 14.00.01.00
  Integrated RAID: yes
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (5.0 GB/sec)
        IOC Speed: Full
      Temperature: Unknown/Unsupported

amotin Jun 29, 2022
Collaborator

mps(4) is definitely not unique, but your firmware is very old (much older than the old HBA). There is 20.00.04.00 last for it AFAIK.

allanjude · 2022-07-15T13:51:04Z

allanjude
Jul 15, 2022
Collaborator

Do you have any datasets with a recordsize of larger than the default 128kb?

0 replies

baukus · 2023-06-14T16:42:19Z

baukus
Jun 14, 2023

In a similar vein, I have several FreeBSD-Stable 13.1 OoM crash dumps where several ZFS UMA allocators have insane memory usage. "Insane" in that the vmstat -z shows abd_chunk and zio_buf_comb_1048576 with usages that exceed system memory.

For example on a 256G system (processed vmstat -z -M -N output):

                 name                total        raw-total = (used + free) * size
     zio_buf_comb_131072:           18.764 GiB    20147863552 = (153716 + 0) * 131072
       zio_buf_comb_81920:          25.693 GiB    27587870720 = (336766 + 0) * 81920
     zio_buf_comb_1048576:        1042.303 GiB    1119164039168 = (1067318 + 0) * 1048576
                abd_chunk:        2462.238 GiB    2643808325632 = (645461017 + 0) * 4096

recordsize for the I/O intensive pools on this system are 1M

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploding memory usage in zio_buf [FreeBSD zfs 2.1.4+] #13601

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Exploding memory usage in zio_buf [FreeBSD zfs 2.1.4+] #13601

bdrewery Jun 28, 2022

Replies: 4 comments · 10 replies

rincebrain Jun 28, 2022 Collaborator

rincebrain Jun 28, 2022 Collaborator

rincebrain Jun 28, 2022 Collaborator

bdrewery Jun 28, 2022 Author

bdrewery Jun 28, 2022 Author

rincebrain Jun 28, 2022 Collaborator

amotin Jun 28, 2022 Collaborator

amotin Jun 28, 2022 Collaborator

bdrewery Jun 29, 2022 Author

amotin Jun 29, 2022 Collaborator

allanjude Jul 15, 2022 Collaborator

baukus Jun 14, 2023

bdrewery
Jun 28, 2022

Replies: 4 comments 10 replies

rincebrain
Jun 28, 2022
Collaborator

rincebrain Jun 28, 2022
Collaborator

rincebrain Jun 28, 2022
Collaborator

bdrewery Jun 28, 2022
Author

bdrewery Jun 28, 2022
Author

rincebrain Jun 28, 2022
Collaborator

amotin
Jun 28, 2022
Collaborator

amotin Jun 28, 2022
Collaborator

bdrewery Jun 29, 2022
Author

amotin Jun 29, 2022
Collaborator

allanjude
Jul 15, 2022
Collaborator

baukus
Jun 14, 2023