Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent/incomplete /dev/disk/by-id links #128

Open
sharkcz opened this issue Dec 7, 2021 · 15 comments
Open

inconsistent/incomplete /dev/disk/by-id links #128

sharkcz opened this issue Dec 7, 2021 · 15 comments

Comments

@sharkcz
Copy link
Contributor

sharkcz commented Dec 7, 2021

We are experiencing a situation where the /dev/disk/by-id/... symlinks are inconsistent across reboots. Sometimes links for all disks/dasds are present, sometimes only a (different) subset is present.

[root@openshift-8 ~]# ll /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5422-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5422-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-0X5622 -> ../../dasdc
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5622-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-0X5722 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5722-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22.00000000000027200000000000000000-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22.00000000000027200000000000000000-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22 -> ../../dasdc
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22.00000000000027200000000000000000 -> ../../dasdc
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22.00000000000027200000000000000000-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22 -> ../../dasdd
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22-part1 -> ../../dasdd1
[root@openshift-8 ~]# ll /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-0X5722 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-0X5722-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22 -> ../../dasdd
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22-part1 -> ../../dasdd1

environment is Fedora 35 with kernel-5.14.18-300.fc35.s390x and s390utils-core-2.17.0-2.fc35.s390x (version shouldn't matter much as etc/udev/rules.d/59-dasd.rules hasn't changed for long time, except the scheduler setting)

Fedora 35 with kernel-5.15.6-200.fc35.s390x doesn't seem to have the /dev/disk.by-id directory at all, looking further ...

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1963192

@sharkcz
Copy link
Contributor Author

sharkcz commented Dec 7, 2021

hmm, now I understand it even less, booting with kernel 5.14.18 with rd.udev.debug the journal is full of LINK messages for the by-id symlinks from the 59-dasd rules file, but nothing is there, not even the /dev/disk/by-id directory ...

@sharkcz
Copy link
Contributor Author

sharkcz commented Dec 7, 2021

I wonder if there is a race condition between creating the actual symlinks and creating the /dev/disk/by-id/ directory ...

@sharkcz
Copy link
Contributor Author

sharkcz commented Dec 15, 2021

I think messages like this explain the missing symlinks

...
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-0X5722', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-0X5722', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-0X5722', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-0X5722', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', removing

ping me for a full log

@hoeppnerj
Copy link
Contributor

I'm currently unable to access the RedHat BZ. I need to get an account first.
Can you share more details about your setup? How are the DASDs configured? Did you use chzdev -e to have a persistent
configuration? Is it always the same DASDs that have this issue or is it random?

I've tried several reboots on an LPAR with 10 DASDs persistently configured using chzdev -e on a freshly installed
F35 (tried both 5.15.6-200.fc35.s390x and 5.14.10-300.fc35.s390x + s390utils-2:2.17.0-2.fc35.s390x) but
wasn't able to reproduce the issue so far.

@sharkcz
Copy link
Contributor Author

sharkcz commented Jan 5, 2022

my environment is

  • z13 with z/VM 6.4.0
  • the guest is Fedora 35 with kernel-5.14.18-300.fc35.s390x and systemd-udev-249.7-2.fc35.s390x
  • DASDs configured with rd.dasd= and /etc/dasd.conf
  • what links are present/missing is purely random

The original report is from OCP/RHEL-8.x with z/VM 7.2.0 on z13 and z15.

I suspect there might be something wrong with udev or kernel handling the devices, rather than the udev rules in s390utils which are pretty straightforward.

20220105-1028-udev.log.zip
created with

journalctl -b | grep systemd-udev > 20220105-1028-udev.log
ll /dev/disk/by-id/ >> 20220105-1028-udev.log
lsdasd

@hoeppnerj
Copy link
Contributor

I was able to reproduce the issue myself now with:

  • LPAR z14
  • F35 5.14.10-300.fc35.s390x and systemd-udev-0:249.7-2.fc35.s390x
  • DASDs configured with /etc/dasd.conf

I didn't see the problem when using chzdev to persistently configuring the devices. Maybe you can give it a try
to see how this behaves on your setup. Make sure to remove all DASD from /etc/dasd.conf and then enable them
via chzdev -e <devices> (you can specify a range here as well, e.g. 9300-930f).

I'll dig a bit deeper to see where the problem might be.

@sharkcz
Copy link
Contributor Author

sharkcz commented Jan 5, 2022

Have you also removed the rd.dasd= definitions from the kernel parameter line and used the zdev "rootfs mode" purely?

Right now I am testing with /etc/dasd.conf completely removed (both from system and from initrd) and still no 100% success (4x all links created, 1x no links at all, 1x links for dasdc1 only).

@sharkcz
Copy link
Contributor Author

sharkcz commented Feb 7, 2022

I believe using the zdev persistent config doesn't matter. I have converted my system fully to zdev for dasds and I am still getting random result with the "by-id" links. I suspect the problem is deeper in udev or kernel.

@hoeppnerj
Copy link
Contributor

@sharkcz it's been a while and I haven't been able to come around looking deeper into this. Is this still reproducible?

@sharkcz
Copy link
Contributor Author

sharkcz commented May 16, 2024

hi @hoeppnerj , I believe it still does happen. I have tried a fresh F-40 installation on a z/VM guest (with a single DASD) and after the first boot there was no /dev/disk/by-id/ directory at all. And I have thought the symlinks will be there in subsequent boot(s), because udev debugging says dasda: /usr/lib/udev/rules.d/59-dasd.rules:12 Added SYMLINK 'disk/by-id/ccw-0X0120' in journal, but the symlink is still missing. Even more weird ...

@sharkcz
Copy link
Contributor Author

sharkcz commented May 16, 2024

I have checked another z/VM systems with multiple DASDs (F-39 with kernel 6.7) and they both have entries /dev/disk/by-id/, but they are not complete if I see right. Although this is likely caused by a non-unique ID_UID returned by dasdinfo (ID_XUID is unique) ...

@sharkcz
Copy link
Contributor Author

sharkcz commented May 16, 2024

And after another series of reboots on a F-39 system (with 1 DASD) I would say the by-id links are created reliably. So I suspect we might have a new F-40 (and likely RHEL-10) issue not creating the symlinks at all. And perhaps the original issue went away at some point before F-39 ...

@hoeppnerj
Copy link
Contributor

Alright, thanks for the update. I'll try to have a look again. As it seems it must be something to do with 59-dasd.rules and dasdinfo. Maybe the tool isn't getting the information in time.

@sharkcz
Copy link
Contributor Author

sharkcz commented May 16, 2024

I would say please try F-39 on a slightly bigger VMs than my single dasd one to either confirm or refute my findings. Similar with F-40. Right now I suspect a change in systemd/udev between v254 in F-39 and v255 in F-40. Being able to run the udev-worker process under strace could reveal something. I have even tried with SELinux disabled on the F-40 to rule out a too strict SELinux policy :-)

@mmaslano
Copy link

mmaslano commented Nov 4, 2024

SUSE found a similarly looking issue and we also didn't find a right way to fix it. It would be interesting to know what's the root cause. The most interesting part is that we got this issue much later than Fedora, early this year. We could observe it after we updated 15.6 Beta with the s390-tools 2.31.0. Reverting the package to 2.30.0 "fixed" the issue.

Even we looked at 95dasd_rules module and tried to disable it or adjust it. For a short time it was emitting twice a message (we weren't using FBA):
[ 4.587896][ T312] dracut: Warning: Configuring ECKD DASD 0.0.6000 as FBA DASD
[ 4.588069][ T312] dracut: Configuring devices in the persistent configuration only
[ 4.588091][ T312] dracut: FBA DASD 0.0.6000 configured
But fixing it or not, symlinks were in incorrect order.

In the end we decided to patch grub2-mkconfig, so the symlinks are tested for existence and created.

We could reproduce it only while installing LPAR, our VMs under z/VM never had this issue. Odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants