Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC resize broke DRBD resource after a node reboot #763

Open
maxpain opened this issue Jan 20, 2025 · 2 comments
Open

PVC resize broke DRBD resource after a node reboot #763

maxpain opened this issue Jan 20, 2025 · 2 comments

Comments

@maxpain
Copy link

maxpain commented Jan 20, 2025

Hello. I use Talos Linux v1.8.3 and the latest version of Piraeus Operator.
I created a PVC a few months ago, and everything worked fine. A few days ago, I resized this PVC, and everything went without problems.
Today, I rebooted w1 node and got this state:

Image

I restarted satellite pods and got this state:

Image

Please note that only recently resized PVC has failed. All other replicated PVCs have never been resized and didn't fail.

Logs from satellite:

Aligning /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 size from 536985664 KiB to 536989696 KiB to be a multiple of extent size 4096 KiB (from Storage Pool)
Failed to adjust DRBD resource monitoring-vmsingle-vmks [Report number 678E103C-FA10E-000007]
Image

First error report:

ERROR REPORT 678E103C-FA10E-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Satellite
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-20 08:58:49
Node:                               w1
Thread:                             DeviceManager

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         ResourceException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.ResourceException
Generated at:                       Method 'adjustDrbd', Source file 'DrbdLayer.java', Line #744

Error message:                      Failed to adjust DRBD resource monitoring-vmsingle-vmks

Error context:
        An error occurred while processing resource 'Node: 'w1', Rsc: 'monitoring-vmsingle-vmks''
ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:744
    processResource                          N      com.linbit.linstor.layer.drbd.DrbdLayer:245
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1027
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1070
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1023
    processResources                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:372
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:219
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:344
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1215
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:791
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:685
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name:               com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 'DrbdAdm.java', Line #656

Error message:                      The external command 'drbdadm' exited with error code 1


ErrorContext:
  Description: Execution of the external command 'drbdadm' failed.
  Cause:       The external command exited with error code 1.
  Correction:  - Check whether the external program is operating properly.
- Check whether the command line is correct.
  Contact a system administrator or a developer if the command line is no longer valid
  for the installed version of the external program.
  Details:     The full command line executed was:
drbdadm -vvv adjust monitoring-vmsingle-vmks

The external command sent the following output data:
drbdsetup new-resource monitoring-vmsingle-vmks 1 --on-no-data-accessible=suspend-io --on-no-quorum=suspend-io --on-suspended-primary-outdated=force-secondary --quorum=majority
drbdsetup new-minor monitoring-vmsingle-vmks 1003 0
drbdsetup new-peer monitoring-vmsingle-vmks 0 --_name=w2 --verify-alg=crct10dif --shared-secret=0Rrhfi48cw1u1RBAYSZd --cram-hmac-alg=sha1 --max-buffers=80000 --protocol=A --rcvbuf-size=10485760 --rr-conflict=retry-connect --sndbuf-size=10485760
drbdsetup new-peer monitoring-vmsingle-vmks 2 --_name=w3 --verify-alg=crct10dif --shared-secret=0Rrhfi48cw1u1RBAYSZd --cram-hmac-alg=sha1 --max-buffers=80000 --protocol=A --rcvbuf-size=10485760 --rr-conflict=retry-connect --sndbuf-size=10485760
drbdsetup new-path monitoring-vmsingle-vmks 0 ipv4:10.201.0.11:7003 ipv4:10.201.0.12:7003
drbdsetup new-path monitoring-vmsingle-vmks 2 ipv4:10.201.0.11:7003 ipv4:10.201.0.13:7003
drbdsetup peer-device-options monitoring-vmsingle-vmks 0 0 --set-defaults --resync-rate=5000000 --c-plan-ahead=0 --c-min-rate=1000000 --c-max-rate=0 --c-fill-target=1024
drbdsetup peer-device-options monitoring-vmsingle-vmks 2 0 --set-defaults --resync-rate=5000000 --c-plan-ahead=0 --c-min-rate=1000000 --c-max-rate=0 --c-fill-target=1024 --bitmap=no
drbdmeta 1003 v09 /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 internal repair-md
drbdmeta 1003 v09 /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 internal apply-al


The external command sent the following error information:
New resource monitoring-vmsingle-vmks
New minor 1003 (vol:0)
No usable activity log found. Do you need to create-md?




Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:656
    adjust                                   N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:125
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:688
    processResource                          N      com.linbit.linstor.layer.drbd.DrbdLayer:245
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1027
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1070
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1023
    processResources                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:372
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:219
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:344
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1215
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:791
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:685
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

Second error report:

ERROR REPORT 678E103C-FA10E-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Satellite
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-20 08:59:01
Node:                               w1
Thread:                             DeviceManager

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         ResourceException
Class canonical name:               com.linbit.linstor.core.devmgr.exceptions.ResourceException
Generated at:                       Method 'adjustDrbd', Source file 'DrbdLayer.java', Line #744

Error message:                      Failed to adjust DRBD resource monitoring-vmsingle-vmks

Error context:
        An error occurred while processing resource 'Node: 'w1', Rsc: 'monitoring-vmsingle-vmks''
ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:744
    processResource                          N      com.linbit.linstor.layer.drbd.DrbdLayer:245
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1027
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1070
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1023
    processResources                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:372
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:219
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:344
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1215
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:791
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:685
    run                                      N      java.lang.Thread:840

Caused by:
==========

Category:                           LinStorException
Class name:                         ExtCmdFailedException
Class canonical name:               com.linbit.extproc.ExtCmdFailedException
Generated at:                       Method 'execute', Source file 'DrbdAdm.java', Line #656

Error message:                      The external command 'drbdadm' exited with error code 1


ErrorContext:
  Description: Execution of the external command 'drbdadm' failed.
  Cause:       The external command exited with error code 1.
  Correction:  - Check whether the external program is operating properly.
- Check whether the command line is correct.
  Contact a system administrator or a developer if the command line is no longer valid
  for the installed version of the external program.
  Details:     The full command line executed was:
drbdadm -vvv adjust monitoring-vmsingle-vmks

The external command sent the following output data:
drbdmeta 1003 v09 /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 internal repair-md
drbdmeta 1003 v09 /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 internal apply-al


The external command sent the following error information:
 [ne] minor 1003 (vol:0) /dev/nvme-lvm/monitoring-vmsingle-vmks_00000 missing from kernel
No usable activity log found. Do you need to create-md?




Call backtrace:

    Method                                   Native Class:Line number
    execute                                  N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:656
    adjust                                   N      com.linbit.linstor.layer.drbd.utils.DrbdAdm:125
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:688
    processResource                          N      com.linbit.linstor.layer.drbd.DrbdLayer:245
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1027
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1070
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1023
    processResources                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:372
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:219
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:344
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1215
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:791
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:685
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

My configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-lvm-replicated-async
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: linstor.csi.linbit.com
parameters:
  linstor.csi.linbit.com/usePvcName: "true"
  linstor.csi.linbit.com/storagePool: nvme-lvm
  linstor.csi.linbit.com/autoPlace: "2"
  linstor.csi.linbit.com/layerList: "drbd storage"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"
  linstor.csi.linbit.com/mountOpts: discard
  property.linstor.csi.linbit.com/DrbdOptions/Net/protocol: "A"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: talos-loader-override
spec:
  podTemplate:
    spec:
      hostNetwork: true
      initContainers:
        - name: drbd-shutdown-guard
          $patch: delete
        - name: drbd-module-loader
          $patch: delete
      volumes:
        - name: run-systemd-system
          $patch: delete
        - name: run-drbd-shutdown-guard
          $patch: delete
        - name: systemd-bus-socket
          $patch: delete
        - name: lib-modules
          $patch: delete
        - name: usr-src
          $patch: delete
        - name: etc-lvm-backup
          hostPath:
            path: /var/etc/lvm/backup
            type: DirectoryOrCreate
        - name: etc-lvm-archive
          hostPath:
            path: /var/etc/lvm/archive
            type: DirectoryOrCreate

  storagePools:
    - name: nvme-lvm
      lvmPool: {}
      source:
        hostDevices:
          - /dev/nvme0n1
          - /dev/nvme1n1
      properties:
        - name: StorDriver/LvcreateOptions
          value: "-i 2"
apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
  name: linstorcluster
spec:
  properties:
    - name: DrbdOptions/Disk/disk-flushes
      value: "no"
    - name: DrbdOptions/Disk/md-flushes
      value: "no"
    - name: DrbdOptions/Net/max-buffers
      value: "80000"
    - name: DrbdOptions/Net/rcvbuf-size
      value: "10485760"
    - name: DrbdOptions/Net/sndbuf-size
      value: "10485760"
    - name: DrbdOptions/PeerDevice/c-fill-target
      value: "1024"
    - name: DrbdOptions/PeerDevice/c-max-rate
      value: "0"
    - name: DrbdOptions/PeerDevice/c-min-rate
      value: "1000000"
    - name: DrbdOptions/PeerDevice/resync-rate
      value: "5000000"
    - name: DrbdOptions/PeerDevice/c-plan-ahead
      value: "0"
    - name: DrbdOptions/auto-quorum
      value: "suspend-io"
    - name: DrbdOptions/Resource/on-no-data-accessible
      value: "suspend-io"
    - name: DrbdOptions/Resource/on-suspended-primary-outdated
      value: "force-secondary"
    - name: DrbdOptions/Net/rr-conflict
      value: "retry-connect"
@WanzenBug
Copy link
Member

Hmm, not sure what is going on here. It looks like the LVM volume was recreated? Or the resize did not complete before the reboot, and now it cannot find the "old" metadata.

As a quick workaround, I would suggest deleting and recreating the resource on the node.

@maxpain
Copy link
Author

maxpain commented Jan 21, 2025

Yes, I recreated the resource. But it's weird behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants