Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

trunet · 2024-08-07T21:03:39Z

Required information

Distribution: Ubuntu 22.04
Distribution version:
The output of "incus info" or if that fails:

config:
  cluster.https_address: [REDACTED]:8443
  core.bgp_address: [REDACTED]:179
  core.bgp_asn: "[REDACTED]"
  core.bgp_routerid: [REDACTED]
  core.https_address: [REDACTED]:8443
  images.auto_update_interval: "0"
  network.ovn.northbound_connection: tcp:[REDACTED]:6641,tcp:[REDACTED]:6641,tcp:[REDACTED]:6641
api_extensions:
  [TOO MUCH UNRELATED STUFF]
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - [REDACTED]:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    [REDACTED]
  certificate_fingerprint: [REDACTED]
  driver: lxc | qemu
  driver_version: 6.0.1 | 9.0.1
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-116-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: [REDACTED]
  server_pid: 513376
  server_version: "6.3"
  storage: ceph
  storage_version: 17.2.6
  storage_supported_drivers:
  - name: cephobject
    version: 17.2.6
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: false
  - name: lvmcluster
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: true
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.4
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.6
    remote: true
  - name: cephfs
    version: 17.2.6
    remote: true

Issue description

Container stop, but gives an error. And it's impossible to delete without manual ceph workarounds.

Steps to reproduce

incus start my-container
incus stop my-container

Error: Failed unmounting instance: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)
Try `incus info --show-log my-container` for more info

incus delete my-container

Error: Failed deleting instance "[REDACTED]" in project "default": Error deleting storage volume: Failed to delete volume: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)

Information to attach

# rbd showmapped
...
0   remote             container_my-container                     -     /dev/rbd0
...

# grep rbd0 /proc/*/mountinfo
[EMPTY]

# grep rbd0 /proc/self/mountinfo
[EMPTY]

Any relevant kernel output (dmesg) - NOTHING
Container log (incus info NAME --show-log)

❯ incus info --show-log my-container
Name: my-container
Status: STOPPED
Type: container
Architecture: x86_64
Location: [REDACTED]
Created: 2024/08/07 15:40 -03
Last Used: 2024/08/07 17:51 -03

Log:

Container configuration (incus config show NAME --expanded)

architecture: x86_64
config:
 cloud-init.user-data: |+
   #cloud-config
   write_files:
     - path: /etc/sssd/add_group_access_from_cloudinit.conf
       content: |
         [REDACTED]
       owner: 'root:root'
       permissions: '0600'

 image.aliases: 24.04
 image.architecture: amd64
 image.description: Ubuntu 24.04 noble (20240729_20:58:30)
 image.os: Ubuntu
 image.release: noble
 image.requirements.cgroup: v2
 image.serial: "20240729_20:58:30"
 image.type: squashfs
 image.variant: cloud
 limits.cpu.allowance: 50%
 limits.memory: 1GiB
 migration.stateful: "true"
 volatile.base_image: b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287
 volatile.cloud-init.instance-id: cd4c191d-1a13-4e57-a1f9-f81f97eb65c3
 volatile.eth0.hwaddr: 00:16:3e:7d:83:84
 volatile.eth0.last_state.ip_addresses: [REDACTED]
 volatile.eth0.name: eth0
 volatile.idmap.base: "0"
 volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.last_state.idmap: '[]'
 volatile.last_state.power: STOPPED
 volatile.last_state.ready: "false"
 volatile.uuid: d567218b-d0e1-4987-8387-d36aca06f6ae
 volatile.uuid.generation: d567218b-d0e1-4987-8387-d36aca06f6ae
devices:
 audit:
   path: /opt/vault-audit
   pool: remote
   source: [REDACTED]
   type: disk
 data:
   path: /opt/vault
   pool: remote
   source: [REDACTED]
   type: disk
 eth0:
   network: ovn-vault
   type: nic
 root:
   path: /
   pool: remote
   size: 20GiB
   type: disk
ephemeral: false
profiles:
- vault
stateful: false
description: [REDACTED] instance

Main daemon log (at /var/log/incus/incusd.log) - NOTHING RELEVANT
Output of the client with --debug
Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)

The text was updated successfully, but these errors were encountered:

stgraber · 2024-08-07T21:28:11Z

Is that happening for all your containers?

How's ceph run on those systems? The Docker backed Ceph has been causing this kind of issue in the past, but I'd still have expected to see the rbd device showing up in someone's mount table.

trunet · 2024-08-07T22:55:50Z

it happens to some, randomly.

running a cluster of microceph (snap)

trunet · 2024-08-08T13:50:38Z

# python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.open('/dev/rbd0', os.O_EXCL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/rbd0'

but I can't find what's keeping it busy.

trunet · 2024-08-08T13:53:15Z

# lsof 2>&1 | grep rbd0 | grep -v 'no pwd entry'
rbd0-task 492026                             root  cwd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  rtd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  txt   unknown                                          /proc/492026/exe
jbd2/rbd0 492043                             root  cwd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  rtd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  txt   unknown                                          /proc/492043/exe

ps shows:

root      492026       2  0 Aug07 ?        00:00:00   [rbd0-tasks]
root      492043       2  0 Aug07 ?        00:00:00   [jbd2/rbd0-8]

# cat /proc/492026/stack
[<0>] rescuer_thread+0x321/0x3c0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# cat /proc/492043/stack
[<0>] kjournald2+0x219/0x280
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30

# rbd info remote/[REDACTED]
rbd image '[REDACTED]':
	size 20 GiB in 5120 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 6a464890b0c9c
	block_name_prefix: rbd_data.6a464890b0c9c
	format: 2
	features: layering
	op_features:
	flags:
	create_timestamp: Wed Aug  7 18:40:22 2024
	access_timestamp: Wed Aug  7 18:40:22 2024
	modify_timestamp: Wed Aug  7 18:40:22 2024
	parent: remote/image_b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287_ext4@readonly
	overlap: 10 GiB
	
# rbd status -p remote [REDACTED]
Watchers:
	watcher=[REDACTED_SAME_SERVER_IP]:0/401004797 client.385955 cookie=18446462598732841706

# cat /sys/kernel/debug/ceph/ad848cbe-c127-4fc9-aeca-4a297799a866.client385955/osdc | grep 6a46
18446462598732841706	osd13	4.68c2cd33	4.13	[13,4,21]/13	[13,4,21]/13	e2479	rbd_header.6a464890b0c9c	0x20	0	WC/0

# rados stat -p remote rbd_header.6a464890b0c9c
remote/rbd_header.6a464890b0c9c mtime 2024-08-07T18:40:28.000000+0000, size 0

stgraber · 2024-08-08T17:08:10Z

That's starting to sound more and more like a kernel bug...

stgraber · 2024-08-08T17:09:13Z

Any chance you can try a newer kernel? Maybe try the 22.04 HWE kernel to get onto 6.5?

trunet · 2024-08-08T17:23:23Z

I'll upgrade and check

stgraber added the Incomplete Waiting on more information from reporter label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

trunet commented Aug 7, 2024

stgraber commented Aug 7, 2024

trunet commented Aug 7, 2024 •

edited

Loading

trunet commented Aug 8, 2024 •

edited

Loading

trunet commented Aug 8, 2024 •

edited

Loading

stgraber commented Aug 8, 2024

stgraber commented Aug 8, 2024

trunet commented Aug 8, 2024

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Comments

trunet commented Aug 7, 2024

Required information

Issue description

Steps to reproduce

Information to attach

stgraber commented Aug 7, 2024

trunet commented Aug 7, 2024 • edited Loading

trunet commented Aug 8, 2024 • edited Loading

trunet commented Aug 8, 2024 • edited Loading

stgraber commented Aug 8, 2024

stgraber commented Aug 8, 2024

trunet commented Aug 8, 2024

trunet commented Aug 7, 2024 •

edited

Loading

trunet commented Aug 8, 2024 •

edited

Loading

trunet commented Aug 8, 2024 •

edited

Loading