-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots not deleted properly, causing orphaned snapshots and eventual system overload #290
Comments
From the logs it seems there is an issue when cleaning up some metadata about the snapshots. This then causes the snapshot to be considered "not ready" and perhaps this messes up the retry logic of the snapshot provisioner. What I do find strange is that I do not see any logs for an attempt to |
A day later I have some error reports I was able to inspect:
|
Hmm, it could be that this:
Triggers the error in the database, i.e. LINSTOR does not properly clean up the resources because of an error with drbd metadata. I don't know why linstor thinks it needs to create metadata to delete the volume.... Perhaps we should move this issue to linbit/linstor-server |
That resource also seems te be in a disfunctional state:
And error report:
I'm fine with moving this to linbit/linstor-server. Should I just open an issue referencing this one? |
We’ve encountered a persistent issue where snapshots are not being properly deleted from the LINSTOR system, resulting in a large number of orphaned snapshots that are putting significant strain on our Kubernetes cluster. This problem has caused severe performance degradation and may have contributed to recent crashes in our LINSTOR controller.
Last week, our cluster went down, likely due to this issue. When the cluster came back online, the LINSTOR controller was unable to start as the datastore seemed to have been corrupted. This issue has persisted across multiple controller restarts. I initially reported this on the LINSTOR forum, where I also outlined the steps I took to get the controller running again.
Context
We are creating hourly snapshots via Velero which are retained for 7 days. However many snapshots are not being deleted correctly from LINSTOR, leading to a significant buildup of orphaned snapshots. Despite using a VolumeSnapshotClass with the deletion policy set to Delete, these snapshots remain in the LINSTOR system even after the corresponding VolumeSnapshotContent and PVC objects are deleted in Kubernetes.
Over time, a large number of snapshots (approximately 2500+) accumulated in the LINSTOR system, though the corresponding PVCs and VolumeSnapshotContent objects no longer existed.
Upon investigation, I found that our cluster had over 30,000 PropsContainer records related to these orphaned snapshots, which made operations slow and timeouts more frequent. This likely contributed to LINSTOR controller crashes and resource corruption. Running the command kubectl get propscontainers.internal.linstor.linbit.com | wc -l took more than 40 seconds to complete.
I eventually used a script to manually clean up the orphaned snapshots, which reduced the PropsContainer records to around 838. However, the root cause of the snapshot deletion failure persists. One week later today, the issue has led to the following current state:
Context
linstor-csi-constroller logs and snapshot of resources and snapshots
Unfortunatly the linstor controller restarted, which prevents me from fetching the error reports listed in the logs.
linstor-csi.log
resources.txt
snapshots.txt
The text was updated successfully, but these errors were encountered: