Asking for silver bullet test recommendations to reproduce an nvme drive hang #1703
mpersons99
started this conversation in
General
Replies: 1 comment
-
Your job options are likely accomplishing what you want. To make things more concrete consider options such as:
You can always run a small version of your job with |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Does anyone have a list of recommended silver bullet tests that can reproduce common nvme drive hangs?
Trying to debug an NVMe drive hang that occurs with a specific NVMe drive vendor, on our appliance – the hang typically occurs after a couple of months (the linux kernel can no longer read the OS partition; can't write to the db files on a separate partition). First step is to try to reproduce it, which has been difficult. The same system and appliance, with a different NVMe vendor, doesn’t experience the problem – pointing to this being specific to this vendor’s drive.
Some details:
nvme0n1 259:0 0 55.9G 0 disk
|-nvme0n1p1 259:1 0 256M 0 part /bootmgr
|-nvme0n1p2 259:2 0 1G 0 part /boot
|-nvme0n1p3 259:3 0 1G 0 part
|-nvme0n1p4 259:4 0 1K 0 part
|-nvme0n1p5 259:5 0 4G 0 part / <---OS partition
|-nvme0n1p6 259:6 0 4G 0 part
`-nvme0n1p7 259:7 0 512M 0 part /config <----config db partition
Looking at normal operations, lsof shows the OS partition has the usual set of memory mapped files opened; the /config partition has about 60 files db files opened for read and write. This is really the only thing happening on this drive.
We are running an older linux distribution with 2.6.32-696 kernel, thus, I’m stuck running fio 2.0.13 (I could probably make 2.1 work if I had to).
Thus far, I’ve been trying to reproduce the problem by running a job such as the following:
~/usr/bin/fio --name=random --directory=/config --nrfiles=300 --size=300M --readwrite=randrw --fsync=16 --loops=1000
If there are no silver bullet tests, a test recommendation on how to open many small files, and continually read and write to them over a long period of time, would be helpful.
Thanks,
Mike
Beta Was this translation helpful? Give feedback.
All reactions