Replies: 4 comments 10 replies
-
Your dtrace command appears to be off by an order of magnitude. Thinking about it, I can't immediately come up with any changes in 2.0 or 2.1 that would drastically change the allocation behavior. |
Beta Was this translation helpful? Give feedback.
-
@bdrewery I have no quick ideas what could be the issue. I haven't seen TrueNAS reports alike to that, and our users run plenty of FreeBSD systems with 2.1.x in a wild. Though we use custom ZFS builds, based on 2.1.x, not the one from base FreeBSD, but they should be very close. Whats about places where 1MB buffers can be allocated, since you are using 1MB recordsize, it can obviously be data buffers. Though ARC does not normally keep so big blocks, unless it is tuned to, but instead copies the content into a chain of PAGE_SIZE chunks to free KVA. 1MB allocations are used for data in small DBUF cache (recently accessed); for dirty buffers that were just recently modified and still being written (though in first two cases you'd likely see zio_data_buf_1048576, not zio_buf_1048576, since IIRC metadata should not use large blocks unless tuned so); for ZIOs in pipeline, for example compression/decompression code allocates linear buffers that way (I see those among your stack traces, and as I see that code always allocates zio_buf_*); and finally if FreeBSD-specific vdev_geom.c code can't execute aggregated I/O via BIO_UNMAPPED GEOM request it calls abd_borrow_buf()/abd_borrow_buf_copy() to get its linear copy. I actually see above several threads in abd_borrow_buf_copy(), as I understand waiting for buffer allocations to execute writes. But that may be a part of normal operation, either because your HBA does not support BIO_UNMAPPED, or simply because some buffers in that aggregated I/O (BTW I/O aggregation is also done up to 1MB for HDDs) are not page-aligned. So it may or may not be a problem. Though it makes me wonder what HBA/disk controller/driver are you using? Actually, considering "block size: 512B configured, 4096B native" I see, your pool likely runs with ashift=9, that means there can be some buffers not aligned to PAGE_SIZE. TrueNAS almost always use ashift=12 to 4K disk compatibility and should almost always be able to use BIO_UNMAPPED, since all disk I/Os are page-aligned. Just thinking about possible differences... |
Beta Was this translation helpful? Give feedback.
-
Do you have any datasets with a recordsize of larger than the default 128kb? |
Beta Was this translation helpful? Give feedback.
-
In a similar vein, I have several FreeBSD-Stable 13.1 OoM crash dumps where several ZFS UMA allocators have insane memory usage. "Insane" in that the vmstat -z shows abd_chunk and zio_buf_comb_1048576 with usages that exceed system memory. For example on a 256G system (processed vmstat -z -M -N output):
recordsize for the I/O intensive pools on this system are 1M |
Beta Was this translation helpful? Give feedback.
-
I would open an issue but I don't have enough data yet to describe the problem. I just have a small home server for Nextcloud and a few vms. Nothing too special.
I've ran into numerous panics over the last few weeks with 2.1.4 and 2.1.5 where specifically
zio_buf_1048576
explodes from 0 to 20G in what seems like seconds. (Some data below). This isn't ARC; I had that limited to 12G. Note that I have partially reverted 309c32c to splitzio_data_buf
back out on its own after running into this problem earlier. When this happened the system went into OOM, swapped out everything and made no progress on anything. I had to reboot. I did mange to get a kernel dump on the last time.Before this I was using whatever ZFS FreeBSD 12 had, not ZoL/zfs 2. It was stable.
Looking at my system now and over the last 24 hours there have been 0 allocations of
zio_buf_1048576
.I've been running(I do see hits with proper sizedtrace -n 'fbt::zio_buf_alloc:entry { if (arg0 == 10485760) stack(); }'
with no hits.1048576
but the total allocations invmstat -z
remain very low outside of this issue)My question is what exactly would cause
zio_buf_1048576
to be allocated so rarely and so quickly? How might I find what is using them in the kernel dump?Current
top
. The 10G free is close to what pre-explode looks like. I explicitly setvfs.zfs.arc_free_target
to keep 10G pages free.(There is some dedup but it is ancient data and relatively small)
Some customizations:
From
vmstat -z
of last panic from dump (parsed)And a previous time from a dump
Current. No 1M allocations.
Some interesting threads from gdb.
Is there a way for me to find what is using all of the 1M
zio_buf
allocations?Beta Was this translation helpful? Give feedback.
All reactions