Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job logs: indicate sha1 of running image #349

Open
tiborsimko opened this issue Dec 1, 2021 · 2 comments
Open

job logs: indicate sha1 of running image #349

tiborsimko opened this issue Dec 1, 2021 · 2 comments

Comments

@tiborsimko
Copy link
Member

Current behaviour

It happens that when users use non-semantically-versioned environment images such as myenviroment:latest or myenvironment:master, and they update this image using the same image tag, the cluster nodes won't pull the new version because of the usual ifNotPreset image pull policy.

It can then happen that some cluster nodes have "old" version of the image, while other cluster nodes have "new" version of the image, leading to seemingly random workflow run failures.

Currently, it is not easy to detect these situations by the user, because REANA does not expose in the job logs which image sha1 was exactly used for the job. The cluster administrators can check and rectify this easily by removing images on the nodes, which forces re-pull of the image for the next run. For example by running the following one-liner:

$ for node in $(kubectl get nodes -l reana.io/system=runtimejobs | awk '{print $1;}'); do ssh -q -i ~/.ssh/myaccount.pem -o StrictHostKeyChecking=no core@$node 'sudo crictl rmi myenvironment:latest'; done

Howewer, we can perhaps do something better to help the users.

Expected behaviour

Ideally we should display in the job logs that the job was run using image myenvironment:latest with sha1 of such and such value:

==> Workflow ID: 29f6859f-1389-4266-98f2-41df346cc000
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-261a4396-5ffb-4e9b-953e-3be52a0faa18
==> Docker image: myenivronment:latest (9259e42215ab)

We could perhaps even consider exposing the node name where the job runs, which could be useful in forensics such as CephFS CSI plugins being down on some nodes etc.

@VMois
Copy link

VMois commented Dec 1, 2021

suggestion: Or we can change ifNotPresent to Always. k8s will compare image digest (hash) and if it is cached locally, it will use the local image, if it is not cached or digests are different, it will pull a new image from the registry (docs).

If Always is used, it will, probably, add overhead to k8s nodes of querying a registry to check if a cached image is the same as one in the registry (one HTTP request, I guess). Not sure how much this will affect the pod starting time.

But regarding adding an image tag and digest to logs, I think, it is a good idea overall. Not quite sure about exposing the node names as it can, potentially, be a security issue (?).

@tiborsimko
Copy link
Member Author

Always will bring some overhead, which may be considerable in case of multi-GiB-large particle physics images... Hence we opted for IfNotPresent as default, together with promoting semantic versioning of docker images, which is the best for ensuring reproducitbility anyway! The reana-client validate also checks for the most comonly used latest, but it doesn't get everything.
So yes, hopefully we can stay on IfNotPresent... But switching to Always via helm values is always an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants