Add support for CephFS volumes / sub-volumes #1023

benaryorg · 2024-07-19T21:32:04Z

Required information

Distribution: NixOS
Distribution version: 24.05
The output of "incus info" or if that fails:
- Kernel version: 6.6.40
- LXC version: 6.0.1
- Incus version: 6.2.0
- Storage backend in use: CephFS

Issue description

CephFS has changed its mount string in Quincy, the version that has recently reached its estimated EoL date (current being Reef, Squid is upcoming AFAIK).
This means that any still active release (talking about upstream, not distros) has a mount string that is different from the one Incus is using right now.

This leads to users having a really hard time trying to mount CephFS created via the newer CephFS Volumes/Subvolumes mechanic (at least I haven't gotten it working yet).

As described in the discussion boards the old syntax was:

[mon1-addr]:3300,[mon2-addr]:3300,[mon3-addr]:3300:/path/to/thing

and a lot of options via the -o parameter (or the appropriate field in the mount syscall).
Notably Incus does not rely on the config file for this but manually scrapes the mon addresses out of the config file (which has its own issues because the used string matching is insufficient to catch an initial mon list which then refers to the mons by name and the mons being listed in their own sections with their addresses directly as mon_addr, which means that while mount.ceph can just mount the volume, Incus fails during the parsing phase of the config file.

The new syntax is:

[email protected]=/path/to/thing

So with the user, the (optional) fsid, and the cephfs name being encoded into the string there are a few less options, although they do still exist.

Steps to reproduce

run CephFS on ≥Quincy
create CephFS volume and subvolume
try to mount it

With vaguely correct seeming parameters provided to Incus this will still lead to interesting issues like getting No Route to Host errors despite everything being reachable.
Honestly, if you find options that manage to mount that, please tell me because I can't seem to find any.

Information to attach

Any relevant kernel output (dmesg)

[ +13.628392] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.271853] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.519922] libceph: mon0 (1)[2001:db8::1:0]:3300 socket closed (con state V1_BANNER)
[  +0.520979] ceph: No mds server is up or the cluster is laggy

Main daemon log (at /var/log/incus/incusd.log)

Jul 19 20:32:09 lxd2 incusd[10412]: time="2024-07-19T20:32:09Z" level=error msg="Failed mounting storage pool" err="Failed to mount \"[2001:41d0:700:2038::1:0]:3300,[2001:41d0:1004:1a22::1:1]:3300,[2001:41d0:602:2029::1:2]:3300:/\" on \"/var/lib/incus/storage-pools/cephfs\" using \"ceph\": invalid argument" pool=cephfs

~~Container log (incus info NAME --show-log)~~
~~Container configuration (incus config show NAME --expanded)~~
~~Output of the client with --debug~~
~~Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)~~ (doesn't really log anything about the issue)

The text was updated successfully, but these errors were encountered:

benaryorg · 2024-07-19T21:54:38Z

For completeness sake, here are some commands to get a new CephFS volume and subvolume stuff up and running and how the final mount command might look like (I'm fumbling that out of my history, not guaranteed to be 100% accurate):

ceph fs volume create volume-name
ceph fs subvolumegroup create volume-name subvolume-group-name
ceph fs subvolume create volume-name subvolume-name --group_name subvolume-group-name

# this will now spit out a path including the UUID of the subvolume:
ceph fs subvolume getpath volume-name subvolume-name --group_name subvolume-group-name
# then authorize a new client (syntax changes slightly in upcoming version)
ceph fs authorize volume-name client.client-name /volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 rw
# which can be mounted like (fsid can be omitted if it is in ceph.conf, key will be read from keyring in /etc/ceph too):
mount -t ceph [email protected]=/volumes/subvolume-group-name/subvolume-name/e7c5cd0c-10fa-42e2-9d48-902544f13d07 /mnt

tregubovav-dev · 2024-07-20T01:23:14Z

Just a question: what is the use-case blocked?
I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

benaryorg · 2024-07-20T06:09:02Z

Just a question: what is the use-case blocked? I actively use CephFS storage pool with my Incus + Microceph deployment (as well as with LXD + Microceph in the past) and I do not see any issues. All such volumes mounted to the instances.

How does your storage configuration look like?
I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

tregubovav-dev · 2024-07-20T18:56:54Z

How does your storage configuration look like? I've tried several permutations that looked like they could work, but considering that I also managed to drop down to incus admin sql to be able to delete the storage pool (which got stuck in pending forever) once I did not try everything.

My cluster configuration is:

7x Raspberry Pi 4B nodes under Ubuntu 24.04 boot from sdcard. 1GiB USB SSD attached to each node.
microceph snap package installed on every node and microceph cluster configured. Each node export one osd
incus 6.3 installed on every node and incus cluster build (actually cluster was migrated to incus from LXD)
ceph-common package is installed as incus does not work directly with microceph (as LXD does). Please see discussions: https://discuss.linuxcontainers.org/t/unable-to-migrate-lxd-5-21-with-microceph-to-incus-6-0/19714 and https://discuss.linuxcontainers.org/t/incus-vm-on-raspberry-pi4/19357/15 for additional details.
List of storage pools:

        NAME        DRIVER  DESCRIPTION  USED BY   STATE
  remote            ceph                 62       CREATED
  shared_vols       cephfs               11       CREATED
  test_shared_vols  cephfs               1        CREATED

test_shared_vols configuration

config:
  cephfs.cluster_name: ceph
  cephfs.path: lxd_test_shared
  cephfs.user.name: admin
description: ""
name: test_shared_vols
driver: cephfs
used_by:
- /1.0/storage-pools/test_shared_vols/volumes/custom/test_vol1?project=test
status: Created
locations:
- cl-06
- cl-07
- cl-01
- cl-02
- cl-03
- cl-04
- cl-05

Steps to create storage pool and deploy instances with sharing files using CephFS volumes

You need to have existing cephfs volume. In my case it's:

$sudo ceph fs ls
name: lxd_test_shared, metadata pool: lxd_test_shared_pool_meta, data pools: [lxd_test_shared_pool_data ]

Create storage pool:

for i in {1..7}; do incus storage create test_shared_vols cephfs source=lxd_test_shared --target cl-0$i; done \
  && incus storage create test_shared_vols cephfs

You can create storage volume when pool is created. (I sue separate project for volume and instances used it):
incus storage volume create test_shared_vols test_vol1 size=256MiB --project test
Create and instances, and attach volume to them:

for i in {1..7}; do inst=test-ct-0$i; \
  echo "Launching instance: $inst"; incus launch images:alpine/edge $inst --project test; \
  echo "Attaching 'test_vol1' to the instance"; incus storage volume attach test_shared_vols test_vol1 $inst data "/data" --project test; \
  echo "Listing content of '/data' directory:"; incus exec $inst --project test -- ls -l /data; \
  done

Put a file to shared volume using:
incus exec test-ct-04 --project test -- sh -c 'echo -e "This is a file\n placed tothe shared volume.\n It is acceccible from any instance where this volume is attached.\n" > /data/test.txt'
Check file existence and it's content on each node:

for i in {1..7}; do inst=test-ct-0$i; echo "Listing content of '/data' directory in the $inst instance"; incus exec $inst --project test -- ls -l /data; done

for i in {1..7}; do inst=test-ct-0$i; echo "--- Printing content of '/data/test.txt' file in the $inst instance ---"; incus exec $inst --project test -- cat /data/test.txt; done

benaryorg · 2024-07-20T20:52:32Z

So far it does not look like you are using the ceph fs volume feature (at least not with subvolumes), otherwise your CephFS paths would include a UUID somewhere. Besides, using the admin credentials would side-step any mounting issues that I'm seeing because you would be able to mount the root of the CephFS even if trying to mount a CephFS subvolume. If you create a subvolume as per my first reply in the post, then you will have credentials that do not have access to the root of the CephFS, making you unable to use the storage configuration you provided (since that one does not contain any paths, and therefore would fail to mount for lack of permissions) as far as I can tell.

tregubovav-dev · 2024-07-20T23:19:07Z

So far it does not look like you are using the ceph fs volume feature

Yes, you are correct. This why I asked about your use-case.

benaryorg · 2024-07-21T15:46:31Z

Yes, you are correct. This why I asked about your use-case.

Ah, I see.
The primary advantage to me personally is that I don't have to manually lay out a directory structure (i.e. I do not have to actually mount the CephFS with elevated privileges such as client.admin to administrate it), the quota support is baked in, and authorization of individual clients for shares becomes programmatic over that specific API (i.e. less worrying about adding or removing caps outside the CephFS system).

If I were to automate Incus cluster deployment (or even just deployment for individual consumers of CephFS, and also want to handle Incus in the same way), I could instead use the Restful API module of the MGR for many operations in a way that is much less error prone than the API is for managing CephFS otherwise; I wouldn't need to create individual directory trees, and I would not have to enforce a certain convention for how the trees are laid out (since volumes have their very specific layout). Quota management also becomes less of a "have to write xattr of specific directory" and much more tightly attached to the subvolume. The combination of getpath and the way the auth management is handled also makes it a little harder to accidentally use the wrong path or something. This is mostly about automation and programmatically handling things, which is in line with what OpenStack Manila wants for its backend.

Especially when administrating a Ceph cluster on a team with several admins however the added constraints make it much easier to work as a team since there are no strict conventions to stick to oneself, because Ceph already enforces those.

Being able to create multiple volumes, each of which comes with its own pools and MDSs, also greatly improves how things work when you have to separate tenants for whatever reason.
Given that it's often beneficial to run one big Ceph cluster instead of many small ones (due to the increase in failure domains) I can see how some of the customers I worked with would like to use that feature (granted, none of those customers were using Incus though), and with any newer clusters I would absolutely recommend using volumes even if just for the reason that you don't have to go back and reintroduce and clean up every part where things weren't properly separated later on (since inevitably every user of Ceph at some points needs some level of isolation for whatever reason, I've never not seen it happen).

In short; it makes me not trip over my own feet when adding a new isolated filesystem share by taking care of the credential-management, directory creation, and quotas, something which I'd surely manage to at least once mess up and like.… delete the client.ceph credentials or something (which wouldn't be possible with the ceph fs deauthorize command as far as I can tell).

TL;DR: it's just more robust as soon as you need to have separate shares for different clients and makes managing the cluster easier if there is a strong separation of concerns.

tregubovav-dev · 2024-07-21T23:37:43Z

Ah, I see.

I appreciated your detailed explanation.

stgraber added this to the later milestone Jul 20, 2024

stgraber added the Feature New feature, not a bug label Jul 20, 2024

stgraber changed the title ~~CephFS mount uses deprecated mount-string, fails to mount Volume based CephFS~~ Add support for CephFS volumes / sub-volumes Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CephFS volumes / sub-volumes #1023

Add support for CephFS volumes / sub-volumes #1023

benaryorg commented Jul 19, 2024

benaryorg commented Jul 19, 2024 •

edited

Loading

tregubovav-dev commented Jul 20, 2024

benaryorg commented Jul 20, 2024 •

edited

Loading

tregubovav-dev commented Jul 20, 2024 •

edited

Loading

benaryorg commented Jul 20, 2024

tregubovav-dev commented Jul 20, 2024

benaryorg commented Jul 21, 2024

tregubovav-dev commented Jul 21, 2024

Add support for CephFS volumes / sub-volumes #1023

Add support for CephFS volumes / sub-volumes #1023

Comments

benaryorg commented Jul 19, 2024

Required information

Issue description

Steps to reproduce

Information to attach

benaryorg commented Jul 19, 2024 • edited Loading

tregubovav-dev commented Jul 20, 2024

benaryorg commented Jul 20, 2024 • edited Loading

tregubovav-dev commented Jul 20, 2024 • edited Loading

My cluster configuration is:

Steps to create storage pool and deploy instances with sharing files using CephFS volumes

benaryorg commented Jul 20, 2024

tregubovav-dev commented Jul 20, 2024

benaryorg commented Jul 21, 2024

tregubovav-dev commented Jul 21, 2024

benaryorg commented Jul 19, 2024 •

edited

Loading

benaryorg commented Jul 20, 2024 •

edited

Loading

tregubovav-dev commented Jul 20, 2024 •

edited

Loading