Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional build failure #50

Closed
Thread974 opened this issue Nov 17, 2017 · 7 comments · Fixed by #92
Closed

Occasional build failure #50

Thread974 opened this issue Nov 17, 2017 · 7 comments · Fixed by #92
Labels

Comments

@Thread974
Copy link

I'm using go-debos and I am getting build failure.
It does not always occur. I'm on ubuntu and have a couple patches.

fredo@X250:~/Documents/bos0009/apertis-image-recipes$ ../go/src/github.com/go-debos/debos/debos -t architecture:$ARCH --debug-shell apertis-image-$ARCH.yaml
2017/11/17 11:36:47 Ignored link: /etc/ld.so.conf.d//etc/ld.so.conf.d/i386-linux-gnu_GL.conf
2017/11/17 11:36:47 Ignored link: /etc/ld.so.conf.d//etc/ld.so.conf.d/x86_64-linux-gnu_EGL.conf
2017/11/17 11:36:47 Ignored link: /etc/ld.so.conf.d//etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf
2017/11/17 11:36:47 Getting module from host kernel/drivers/virtio/virtio.ko
2017/11/17 11:36:47 Ignoring kernel/drivers/virtio/virtio.ko
2017/11/17 11:36:47 Getting module from host kernel/drivers/virtio/virtio_pci.ko
2017/11/17 11:36:47 Ignoring kernel/drivers/virtio/virtio_pci.ko
2017/11/17 11:36:47 Getting module from host kernel/net/9p/9pnet.ko
2017/11/17 11:36:47 Getting module from host kernel/drivers/virtio/virtio_ring.ko
2017/11/17 11:36:47 Ignoring kernel/drivers/virtio/virtio_ring.ko
2017/11/17 11:36:47 Getting module from host kernel/fs/9p/9p.ko
2017/11/17 11:36:47 Getting module from host kernel/net/9p/9pnet_virtio.ko
2017/11/17 11:36:47 Getting module from host kernel/fs/fscache/fscache.ko
2017/11/17 11:36:47 Getting module from host modules.order
2017/11/17 11:36:47 Getting module from host modules.builtin
2017/11/17 11:36:47 Getting module from host modules.dep
2017/11/17 11:36:47 Getting module from host modules.dep.bin
2017/11/17 11:36:47 Getting module from host modules.alias
2017/11/17 11:36:47 Getting module from host modules.alias.bin
2017/11/17 11:36:47 Getting module from host modules.softdep
2017/11/17 11:36:47 Getting module from host modules.symbols
2017/11/17 11:36:47 Getting module from host modules.symbols.bin
2017/11/17 11:36:47 Getting module from host modules.builtin.bin
2017/11/17 11:36:47 Getting module from host modules.devname
[    1.341739] systemd-fstab-generator[137]: Failed to create mount unit file /run/systemd/generator/home-fredo-Documents-bos0009-apertis\x2dimage\x2drecipes.mount, as it already exists. Duplicate entry in /etc/fstab?
[    1.835466] Error: Driver 'pcspkr' is already registered, aborting...
Running /debos --artifactdir /home/fredo/Documents/bos0009/apertis-image-recipes --template-var architecture:"amd64" /home/fredo/Documents/bos0009/apertis-image-recipes/apertis-image-amd64.yaml --debug-shell --shell /bin/bash --internal-image /dev/disk/by-id/virtio-fakedisk-0
2017/11/17 10:36:51 ==== Unpack ospack_17.12-amd64-minimal_00000000.0.tar.gz ====
2017/11/17 10:36:54 ==== image-partition ====
2017/11/17 10:36:55 Formatting partition 1 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 10:36:56 Formatting partition 2 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 10:36:56 Formatting partition 2 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 10:36:56 Formatting partition 4 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 10:36:56 Formatting partition 4 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 10:36:57 Formatting partition 5 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 10:36:57 Formatting partition 5 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 10:36:58 Formatting partition 6 | btrfs-progs v4.9.1
2017/11/17 10:36:58 Formatting partition 6 | See http://btrfs.wiki.kernel.org for more information.
2017/11/17 10:36:58 Formatting partition 6 | 
2017/11/17 10:36:58 Formatting partition 6 | Label:              system
2017/11/17 10:36:58 Formatting partition 6 | UUID:               61db6be5-758c-4153-8d28-12ad22f7048d
2017/11/17 10:36:58 Formatting partition 6 | Node size:          16384
2017/11/17 10:36:58 Formatting partition 6 | Sector size:        4096
2017/11/17 10:36:58 Formatting partition 6 | Filesystem size:    3.81GiB
2017/11/17 10:36:58 Formatting partition 6 | Block group profiles:
2017/11/17 10:36:58 Formatting partition 6 |   Data:             single            8.00MiB
2017/11/17 10:36:58 Formatting partition 6 |   Metadata:         DUP             195.31MiB
2017/11/17 10:36:58 Formatting partition 6 |   System:           DUP               8.00MiB
2017/11/17 10:36:58 Formatting partition 6 | SSD detected:       no
2017/11/17 10:36:58 Formatting partition 6 | Incompat features:  extref, skinny-metadata
2017/11/17 10:36:58 Formatting partition 6 | Number of devices:  1
2017/11/17 10:36:58 Formatting partition 6 | Devices:
2017/11/17 10:36:58 Formatting partition 6 |    ID        SIZE  PATH
2017/11/17 10:36:58 Formatting partition 6 |     1     3.81GiB  /dev/disk/by-id/virtio-fakedisk-0-part6
2017/11/17 10:36:58 Formatting partition 6 | 
2017/11/17 10:36:58 Formatting partition 7 | btrfs-progs v4.9.1
2017/11/17 10:36:58 Formatting partition 7 | See http://btrfs.wiki.kernel.org for more information.
2017/11/17 10:36:58 Formatting partition 7 | 
2017/11/17 10:36:58 Formatting partition 7 | Label:              general_storage
2017/11/17 10:36:58 Formatting partition 7 | UUID:               3d24d634-dbd0-4779-b17e-51925beea726
2017/11/17 10:36:58 Formatting partition 7 | Node size:          16384
2017/11/17 10:36:58 Formatting partition 7 | Sector size:        4096
2017/11/17 10:36:58 Formatting partition 7 | Filesystem size:    2.16GiB
2017/11/17 10:36:58 Formatting partition 7 | Block group profiles:
2017/11/17 10:36:58 Formatting partition 7 |   Data:             single            8.00MiB
2017/11/17 10:36:58 Formatting partition 7 |   Metadata:         DUP             110.81MiB
2017/11/17 10:36:58 Formatting partition 7 |   System:           DUP               8.00MiB
2017/11/17 10:36:58 Formatting partition 7 | SSD detected:       no
2017/11/17 10:36:58 Formatting partition 7 | Incompat features:  extref, skinny-metadata
2017/11/17 10:36:58 Formatting partition 7 | Number of devices:  1
2017/11/17 10:36:58 Formatting partition 7 | Devices:
2017/11/17 10:36:58 Formatting partition 7 |    ID        SIZE  PATH
2017/11/17 10:36:58 Formatting partition 7 |     1     2.16GiB  /dev/disk/by-id/virtio-fakedisk-0-part7
2017/11/17 10:36:58 Formatting partition 7 | 
2017/11/17 10:36:58 Action `image-partition` failed at stage Run, error: system mount failed: no such file or directory
2017/11/17 10:36:58 >>> Starting a debug shell
@d4s
Copy link
Collaborator

d4s commented Nov 17, 2017

Some other logs from @Thread974 collected on different host:

2017/11/16 09:54:54 Formatting partition 6 |    ID        SIZE  PATH
2017/11/16 09:54:54 Formatting partition 6 |     1     3.81GiB  /dev/disk/by-id/virtio-fakedisk-0-part6
2017/11/16 09:54:54 Formatting partition 6 | 
2017/11/16 09:54:54 Formatting partition 7 | ERROR: cannot stat '/dev/disk/by-id/virtio-fakedisk-0-part7': No such file or directory
2017/11/16 09:54:54 Formatting partition 7 | ERROR: open ctree failed
2017/11/16 09:54:54 Formatting partition 7 | btrfs-progs v4.9.1
2017/11/16 09:54:54 Formatting partition 7 | See http://btrfs.wiki.kernel.org for more information.
2017/11/16 09:54:54 Formatting partition 7 | 
2017/11/16 09:54:54 Action `image-partition` failed at stage Run, error: exit status 1
2017/11/16 10:45:48 Formatting partition 7 |     1     2.07GiB  /dev/disk/by-id/virtio-fakedisk-0-part7
2017/11/16 10:45:48 Formatting partition 7 | 
2017/11/16 10:45:49 Action image-partition failed at stage Run, error: Failed to get uuid: exit status 2
2017/11/16 10:45:49 >>> Starting a debug shell
root@i7:~/Documents/bos0009/apertis-image-recipes# ../go/src/github.com/go-debos/debos/debos -t architecture:$ARCH --debug-shell apertis-image-$ARCH.yaml
2017/11/17 07:39:46 Ignored link: /etc/ld.so.conf.d//etc/ld.so.conf.d/x86_64-linux-gnu_EGL.conf
2017/11/17 07:39:46 Ignored link: /etc/ld.so.conf.d//etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf
[    1.072426] systemd-fstab-generator[147]: Failed to create mount unit file /run/systemd/generator/home-fredo-Documents-bos0009-apertis\x2dimage\x2drecipes.mount, as it already exists. Duplicate entry in /etc/fstab?
[    1.483118] Error: Driver 'pcspkr' is already registered, aborting...
Running /debos --artifactdir /home/fredo/Documents/bos0009/apertis-image-recipes --template-var architecture:"amd64" /home/fredo/Documents/bos0009/apertis-image-recipes/apertis-image-amd64.yaml --debug-shell --shell /bin/bash --internal-image /dev/disk/by-id/virtio-fakedisk-0
2017/11/17 06:39:49 ==== Unpack ospack_17.12-amd64-minimal_00000000.0.tar.gz ====
2017/11/17 06:39:51 ==== image-partition ====
2017/11/17 06:39:52 Formatting partition 1 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 06:39:52 Formatting partition 2 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 06:39:52 Formatting partition 2 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 06:39:52 Formatting partition 4 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 06:39:52 Formatting partition 4 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 06:39:52 Formatting partition 5 | mkfs.fat: warning - lowercase labels might not work properly with DOS or Windows
2017/11/17 06:39:52 Formatting partition 5 | mkfs.vfat: unable to open /dev/disk/by-id/virtio-fakedisk-0-part5: No such file or directory
2017/11/17 06:39:52 Formatting partition 5 | mkfs.fat 4.0 (2016-05-06)
2017/11/17 06:39:52 Action `image-partition` failed at stage Run, error: exit status 1
2017/11/17 06:39:52 >>> Starting a debug shell
bash-4.4# exit
Powering off.
[   74.266920] reboot: Power down

@stappersg
Copy link
Contributor

In #20 wrote @d4s

But to be honestly -- I catch #50 once during last 2 months but can't reproduce it again.

How often do others encounter this issue? Can they reproduce it?
(needs this issue to remain open??)

@d4s
Copy link
Collaborator

d4s commented Jan 30, 2018

Unfortunately have no enough information to close or debug :-(

@Thread974
Copy link
Author

I haven't tried for a while, but I can give it another try and let you know if it still happens.

@Thread974
Copy link
Author

It occured several times this week while building armhf images. I notice it while doing ospacks and images, rather that doing only images.

@sjoerdsimons
Copy link
Member

So with some recent improvement to fakemachine to speed up the 9pfs stuff this actually tends to happen more often. Which made it reproducible enough for some more analysis/testing.

The problem is that we currently assume in debos that after doing the partitioning we can use udev settle to wait for the device node to be create.. Unfortunately while you would expect adding a partitoin to just cause a single uevent adding the new device, it seems reality isn't that simple.

What actually comes from the kernel is:

  • change uevent on all old partitions
  • remove uevent on all old partitions
  • change uevent on the full block device (e.g. /dev/vda)
  • add uevent on all old partitions
  • change uevent on all old partitions
  • add uevent for the new partitions
  • remove uevent on all partitions (new and old)
  • change uevent on the full block device
  • add uevent on all partitions (new and old)

So it seems that when the kernel is asked to recheck the partition table it seemingly drops all current partition device and re-adds/creates them. In some artifical testing in a fakemachine udevmonitor also shows there is a gap of about 16 miliseconds between the first add event on the new partition and the second round of remove/add events.

So the reason for this race is that there is a small window where the devices files are removed and recreated by udev, if that just happens when a filesystem is created or tried to be mounted you get the issues above.

@sjoerdsimons
Copy link
Member

After a bit of peering at the kernel code and not understand why we're getting multiple sequences of events i did a bit more debugging.. It turns out udev is trying to be too helpful and will trigger a rescan of the partition table iff the block device itsef triggers an inotify IN_CLOSE_WRITE and it can get an exclusive lock on the device (e.g. no partitions mounted). When using parted this isn't useful, as parted already tell the kernel about the new partition (triggering an add uevent) and just adds confusion as seen above..

Testing with flock shows that indeed prevents it from happening and only and only a single add uevent for the new partition occurs, which is what we'd want to clear this up properly.

sjoerdsimons added a commit that referenced this issue Jun 15, 2018
Every so often we've seen races where seemingly the partition device
disappeared when trying to use it. These became a bit more frequent
after some speedups in fakemachine. Digging deeper it turns out that
udev helpfully ask the kernel to rescan the partition table when it gets
told via inotify that something might have written to it.

As we're using parted, which already tells the kernel about partitions
it created, this gets rather confusion. The new partition device would pop up
(triggered by parted), parted would close the disk device node; This
then would notify udev, which asked the kernel to rescan the partition
table on the device *again*, which in turn causes all the partition
device nodes to disappear and re-appear, leaving a nice window for
things to go horribly horribly wrong.

Ofcourse this all happens asynchronously, so there is really no way to
know everything settled down..

Luckily there is one upside to this depressing story, udev only does
this when it can get an exclusive lock on the disk device (e.g. none of
the partitions are mounted and nobody else holds an exclusive lock). So
we can avoid all this madness by simply holding an exclusive device
lock and only releasing it after mounting has happened.

As one last unexpected side-effect, while we do hold the exclusive lock
it turns out udev also does not setup the various symlinks. So instead
for the partition device use the canonical device names rather then the
aliases. On the bright side those device nodes are generated directly by
the kernel with devtmpfs, so there is no need to wait for udev to have
done its processing anymore.

Closes #50

Signed-off-by: Sjoerd Simons <[email protected]>
sjoerdsimons added a commit that referenced this issue Jun 15, 2018
Every so often we've seen races where seemingly the partition device
disappeared when trying to use it. These became a bit more frequent
after some speedups in fakemachine. Digging deeper it turns out that
udev helpfully ask the kernel to rescan the partition table when it gets
told via inotify that something might have written to it.

As we're using parted, which already tells the kernel about partitions
it created, this gets rather confusion. The new partition device would pop up
(triggered by parted), parted would close the disk device node; This
then would notify udev, which asked the kernel to rescan the partition
table on the device *again*, which in turn causes all the partition
device nodes to disappear and re-appear, leaving a nice window for
things to go horribly horribly wrong.

Ofcourse this all happens asynchronously, so there is really no way to
know everything settled down..

Luckily there is one upside to this depressing story, udev only does
this when it can get an exclusive lock on the disk device (e.g. none of
the partitions are mounted and nobody else holds an exclusive lock). So
we can avoid all this madness by simply holding an exclusive device
lock and only releasing it after mounting has happened.

As one last unexpected side-effect, while we do hold the exclusive lock
it turns out udev also does not setup the various symlinks. So instead
for the partition device use the canonical device names rather then the
aliases. On the bright side those device nodes are generated directly by
the kernel with devtmpfs, so there is no need to wait for udev to have
done its processing anymore.

Closes #50

Signed-off-by: Sjoerd Simons <[email protected]>
sjoerdsimons added a commit that referenced this issue Jun 15, 2018
Every so often we've seen races where seemingly the partition device
disappeared when trying to use it. These became a bit more frequent
after some speedups in fakemachine. Digging deeper it turns out that
udev helpfully ask the kernel to rescan the partition table when it gets
told via inotify that something might have written to it.

As we're using parted, which already tells the kernel about partitions
it created, this gets rather confusion. The new partition device would pop up
(triggered by parted), parted would close the disk device node; This
then would notify udev, which asked the kernel to rescan the partition
table on the device *again*, which in turn causes all the partition
device nodes to disappear and re-appear, leaving a nice window for
things to go horribly horribly wrong.

Ofcourse this all happens asynchronously, so there is really no way to
know everything settled down..

Luckily there is one upside to this depressing story, udev only does
this when it can get an exclusive lock on the disk device (e.g. none of
the partitions are mounted and nobody else holds an exclusive lock). So
we can avoid all this madness by simply holding an exclusive device
lock and only releasing it after mounting has happened.

As one last unexpected side-effect, while we do hold the exclusive lock
it turns out udev also does not setup the various symlinks. So instead
for the partition device use the canonical device names rather then the
aliases. On the bright side those device nodes are generated directly by
the kernel with devtmpfs, so there is no need to wait for udev to have
done its processing anymore.

Closes #50

Signed-off-by: Sjoerd Simons <[email protected]>
@d4s d4s closed this as completed in #92 Jun 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants