PVC deletion error when using the `kube` backend #301

JSKenyon · 2024-05-20T14:16:25Z

Occasionally I see error messages which look like this:

2024-05-20 16:02:33 STIMELA.kube ERROR: k8s API error while deleting PVC 'wsclean-temp-aff8e055'                                                                                        
──────────────────────────────────────────────────────────────────────────── detailed error report follows ─────────────────────────────────────────────────────────────────────────────
        ⚠ k8s API error while deleting PVC 'wsclean-temp-aff8e055'                                                                                                                      
        ├── ApiException: (404)                                                                                                                                                         
        │   Reason: Not Found                                                                                                                                                           
        │   HTTP response headers: HTTPHeaderDict({'Audit-Id': '999f22db-468f-4bc0-be89-d51867837c4d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json',        
        │   'X-Kubernetes-Pf-Flowschema-Uid': 'dbf2ccb2-e6d3-4c03-9de0-69dbbada21da', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd7fa6cb1-b697-4e51-b97b-da0d669a1b6f', 'Date': 'Mon, 20 May 
        │   2024 14:02:33 GMT', 'Content-Length': '246'})                                                                                                                               
        │   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"persistentvolumeclaims \"wsclean-temp-aff8e055\" not                     
        │   found","reason":"NotFound","details":{"name":"wsclean-temp-aff8e055","kind":"persistentvolumeclaims"},"code":404}                                                           
        │                                                                                                                                                                               
        │                                                                                                                                                                               
        ├── kind: Status                                                                                                                                                                
        ├── apiVersion: v1                                                                                                                                                              
        ├── metadata:                                                                                                                                                                   
        ├── status: Failure                   
        ├── message: persistentvolumeclaims "wsclean-temp-aff8e055" not found               
        ├── reason: NotFound                  
        ├── details:                          
        │   ├── name: wsclean-temp-aff8e055   
        │   └── kind: persistentvolumeclaims  
        └── code: 404

I do not yet have a consistent reproducer but I believe it may have something to do with the temporary volume being brought down automatically when the job pod is finished (perhaps because of the lifecycle: step configuration). Consequently, when Stimela attempts to do cleanup there is no PVC to delete and the above error occurs. For reference, the temporary volume was configured as follows:

cabs:
  wsclean:
    backend:
      kube:
        volumes:
          wsclean-temp:
            storage_class_name: rarg-test-compute-ebs-sc-immediate-gp3-500
            capacity: 500Gi
            lifecycle: step
            mount: /scratch
            at_start: allow_reuse
            access_modes: [ReadWriteOnce]

The text was updated successfully, but these errors were encountered:

o-smirnov · 2024-05-20T15:42:52Z

Good point, if it's defined with lifecycle: step, the k8s backend's got no business trying to delete it at the end of the session, the wretched thing should have been dead and buried by then. I suspect it may be the case of the cleanup code being both overzealous and insufficiently clever. There's awkward edges in the k8s API where a resource continues to be returned by list_namespaced_xxx() even though it's in "Terminating" state, and one needs to jump through extra hoops to detect this condition. I see an attempted jump here, which may be insufficiently jump-y...

o-smirnov self-assigned this May 20, 2024

o-smirnov added the bug Something isn't working label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVC deletion error when using the `kube` backend #301

PVC deletion error when using the `kube` backend #301

JSKenyon commented May 20, 2024

o-smirnov commented May 20, 2024

PVC deletion error when using the kube backend #301

PVC deletion error when using the kube backend #301

Comments

JSKenyon commented May 20, 2024

o-smirnov commented May 20, 2024

PVC deletion error when using the `kube` backend #301

PVC deletion error when using the `kube` backend #301