Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC deletion error when using the kube backend #301

Open
JSKenyon opened this issue May 20, 2024 · 1 comment
Open

PVC deletion error when using the kube backend #301

JSKenyon opened this issue May 20, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@JSKenyon
Copy link
Collaborator

Occasionally I see error messages which look like this:

2024-05-20 16:02:33 STIMELA.kube ERROR: k8s API error while deleting PVC 'wsclean-temp-aff8e055'                                                                                        
──────────────────────────────────────────────────────────────────────────── detailed error report follows ─────────────────────────────────────────────────────────────────────────────
        ⚠ k8s API error while deleting PVC 'wsclean-temp-aff8e055'                                                                                                                      
        ├── ApiException: (404)                                                                                                                                                         
        │   Reason: Not Found                                                                                                                                                           
        │   HTTP response headers: HTTPHeaderDict({'Audit-Id': '999f22db-468f-4bc0-be89-d51867837c4d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json',        
        │   'X-Kubernetes-Pf-Flowschema-Uid': 'dbf2ccb2-e6d3-4c03-9de0-69dbbada21da', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd7fa6cb1-b697-4e51-b97b-da0d669a1b6f', 'Date': 'Mon, 20 May 
        │   2024 14:02:33 GMT', 'Content-Length': '246'})                                                                                                                               
        │   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"persistentvolumeclaims \"wsclean-temp-aff8e055\" not                     
        │   found","reason":"NotFound","details":{"name":"wsclean-temp-aff8e055","kind":"persistentvolumeclaims"},"code":404}                                                           
        │                                                                                                                                                                               
        │                                                                                                                                                                               
        ├── kind: Status                                                                                                                                                                
        ├── apiVersion: v1                                                                                                                                                              
        ├── metadata:                                                                                                                                                                   
        ├── status: Failure                   
        ├── message: persistentvolumeclaims "wsclean-temp-aff8e055" not found               
        ├── reason: NotFound                  
        ├── details:                          
        │   ├── name: wsclean-temp-aff8e055   
        │   └── kind: persistentvolumeclaims  
        └── code: 404                         

I do not yet have a consistent reproducer but I believe it may have something to do with the temporary volume being brought down automatically when the job pod is finished (perhaps because of the lifecycle: step configuration). Consequently, when Stimela attempts to do cleanup there is no PVC to delete and the above error occurs. For reference, the temporary volume was configured as follows:

cabs:
  wsclean:
    backend:
      kube:
        volumes:
          wsclean-temp:
            storage_class_name: rarg-test-compute-ebs-sc-immediate-gp3-500
            capacity: 500Gi
            lifecycle: step
            mount: /scratch
            at_start: allow_reuse
            access_modes: [ReadWriteOnce]
@o-smirnov o-smirnov self-assigned this May 20, 2024
@o-smirnov o-smirnov added the bug Something isn't working label May 20, 2024
@o-smirnov
Copy link
Member

Good point, if it's defined with lifecycle: step, the k8s backend's got no business trying to delete it at the end of the session, the wretched thing should have been dead and buried by then. I suspect it may be the case of the cleanup code being both overzealous and insufficiently clever. There's awkward edges in the k8s API where a resource continues to be returned by list_namespaced_xxx() even though it's in "Terminating" state, and one needs to jump through extra hoops to detect this condition. I see an attempted jump here, which may be insufficiently jump-y...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants