Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Open
4 of 6 tasks
trunet opened this issue Aug 7, 2024 · 7 comments
Labels
Incomplete Waiting on more information from reporter

Comments

@trunet
Copy link
Contributor

trunet commented Aug 7, 2024

Required information

  • Distribution: Ubuntu 22.04
  • Distribution version:
  • The output of "incus info" or if that fails:
config:
  cluster.https_address: [REDACTED]:8443
  core.bgp_address: [REDACTED]:179
  core.bgp_asn: "[REDACTED]"
  core.bgp_routerid: [REDACTED]
  core.https_address: [REDACTED]:8443
  images.auto_update_interval: "0"
  network.ovn.northbound_connection: tcp:[REDACTED]:6641,tcp:[REDACTED]:6641,tcp:[REDACTED]:6641
api_extensions:
  [TOO MUCH UNRELATED STUFF]
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - [REDACTED]:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    [REDACTED]
  certificate_fingerprint: [REDACTED]
  driver: lxc | qemu
  driver_version: 6.0.1 | 9.0.1
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-116-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: [REDACTED]
  server_pid: 513376
  server_version: "6.3"
  storage: ceph
  storage_version: 17.2.6
  storage_supported_drivers:
  - name: cephobject
    version: 17.2.6
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: false
  - name: lvmcluster
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: true
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.4
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.6
    remote: true
  - name: cephfs
    version: 17.2.6
    remote: true

Issue description

Container stop, but gives an error. And it's impossible to delete without manual ceph workarounds.

Steps to reproduce

  1. incus start my-container
  2. incus stop my-container
Error: Failed unmounting instance: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)
Try `incus info --show-log my-container` for more info
  1. incus delete my-container
Error: Failed deleting instance "[REDACTED]" in project "default": Error deleting storage volume: Failed to delete volume: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)

Information to attach

# rbd showmapped
...
0   remote             container_my-container                     -     /dev/rbd0
...

# grep rbd0 /proc/*/mountinfo
[EMPTY]

# grep rbd0 /proc/self/mountinfo
[EMPTY]
  • Any relevant kernel output (dmesg) - NOTHING
  • Container log (incus info NAME --show-log)
❯ incus info --show-log my-container
Name: my-container
Status: STOPPED
Type: container
Architecture: x86_64
Location: [REDACTED]
Created: 2024/08/07 15:40 -03
Last Used: 2024/08/07 17:51 -03

Log:


  • Container configuration (incus config show NAME --expanded)
architecture: x86_64
config:
 cloud-init.user-data: |+
   #cloud-config
   write_files:
     - path: /etc/sssd/add_group_access_from_cloudinit.conf
       content: |
         [REDACTED]
       owner: 'root:root'
       permissions: '0600'

 image.aliases: 24.04
 image.architecture: amd64
 image.description: Ubuntu 24.04 noble (20240729_20:58:30)
 image.os: Ubuntu
 image.release: noble
 image.requirements.cgroup: v2
 image.serial: "20240729_20:58:30"
 image.type: squashfs
 image.variant: cloud
 limits.cpu.allowance: 50%
 limits.memory: 1GiB
 migration.stateful: "true"
 volatile.base_image: b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287
 volatile.cloud-init.instance-id: cd4c191d-1a13-4e57-a1f9-f81f97eb65c3
 volatile.eth0.hwaddr: 00:16:3e:7d:83:84
 volatile.eth0.last_state.ip_addresses: [REDACTED]
 volatile.eth0.name: eth0
 volatile.idmap.base: "0"
 volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
 volatile.last_state.idmap: '[]'
 volatile.last_state.power: STOPPED
 volatile.last_state.ready: "false"
 volatile.uuid: d567218b-d0e1-4987-8387-d36aca06f6ae
 volatile.uuid.generation: d567218b-d0e1-4987-8387-d36aca06f6ae
devices:
 audit:
   path: /opt/vault-audit
   pool: remote
   source: [REDACTED]
   type: disk
 data:
   path: /opt/vault
   pool: remote
   source: [REDACTED]
   type: disk
 eth0:
   network: ovn-vault
   type: nic
 root:
   path: /
   pool: remote
   size: 20GiB
   type: disk
ephemeral: false
profiles:
- vault
stateful: false
description: [REDACTED] instance
  • Main daemon log (at /var/log/incus/incusd.log) - NOTHING RELEVANT
  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)
@stgraber
Copy link
Member

stgraber commented Aug 7, 2024

Is that happening for all your containers?

How's ceph run on those systems? The Docker backed Ceph has been causing this kind of issue in the past, but I'd still have expected to see the rbd device showing up in someone's mount table.

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Aug 7, 2024
@trunet
Copy link
Contributor Author

trunet commented Aug 7, 2024

it happens to some, randomly.

running a cluster of microceph (snap)

@trunet
Copy link
Contributor Author

trunet commented Aug 8, 2024

# python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.open('/dev/rbd0', os.O_EXCL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/rbd0'

but I can't find what's keeping it busy.

@trunet
Copy link
Contributor Author

trunet commented Aug 8, 2024

# lsof 2>&1 | grep rbd0 | grep -v 'no pwd entry'
rbd0-task 492026                             root  cwd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  rtd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  txt   unknown                                          /proc/492026/exe
jbd2/rbd0 492043                             root  cwd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  rtd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  txt   unknown                                          /proc/492043/exe

ps shows:

root      492026       2  0 Aug07 ?        00:00:00   [rbd0-tasks]
root      492043       2  0 Aug07 ?        00:00:00   [jbd2/rbd0-8]

# cat /proc/492026/stack
[<0>] rescuer_thread+0x321/0x3c0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# cat /proc/492043/stack
[<0>] kjournald2+0x219/0x280
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# rbd info remote/[REDACTED]
rbd image '[REDACTED]':
	size 20 GiB in 5120 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 6a464890b0c9c
	block_name_prefix: rbd_data.6a464890b0c9c
	format: 2
	features: layering
	op_features:
	flags:
	create_timestamp: Wed Aug  7 18:40:22 2024
	access_timestamp: Wed Aug  7 18:40:22 2024
	modify_timestamp: Wed Aug  7 18:40:22 2024
	parent: remote/image_b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287_ext4@readonly
	overlap: 10 GiB
	
# rbd status -p remote [REDACTED]
Watchers:
	watcher=[REDACTED_SAME_SERVER_IP]:0/401004797 client.385955 cookie=18446462598732841706

# cat /sys/kernel/debug/ceph/ad848cbe-c127-4fc9-aeca-4a297799a866.client385955/osdc | grep 6a46
18446462598732841706	osd13	4.68c2cd33	4.13	[13,4,21]/13	[13,4,21]/13	e2479	rbd_header.6a464890b0c9c	0x20	0	WC/0

# rados stat -p remote rbd_header.6a464890b0c9c
remote/rbd_header.6a464890b0c9c mtime 2024-08-07T18:40:28.000000+0000, size 0

@stgraber
Copy link
Member

stgraber commented Aug 8, 2024

That's starting to sound more and more like a kernel bug...

@stgraber
Copy link
Member

stgraber commented Aug 8, 2024

Any chance you can try a newer kernel? Maybe try the 22.04 HWE kernel to get onto 6.5?

@trunet
Copy link
Contributor Author

trunet commented Aug 8, 2024

I'll upgrade and check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

No branches or pull requests

2 participants