Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] controller-manager crashes in 1.5.0-rc.1 #5411

Closed
juliusvonkohout opened this issue Apr 1, 2021 · 9 comments
Closed

[backend] controller-manager crashes in 1.5.0-rc.1 #5411

juliusvonkohout opened this issue Apr 1, 2021 · 9 comments

Comments

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Apr 1, 2021

Environment

export PIPELINE_VERSION=1.5.0-rc.1
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"

kubectl scale deployment/cache-deployer-deployment --replicas 0 -n kubeflow
kubectl scale deployment/cache-server --replicas 0 -n kubeflow

Steps to reproduce

[julius@fedora kubeflow]$ kubectl -n kubeflow get pods,replicasets
NAME                                                   READY   STATUS             RESTARTS   AGE
pod/controller-manager-6d7b565545-zrhrb                0/1     CrashLoopBackOff   8          18m
pod/metadata-envoy-deployment-797b6886b7-r8478         1/1     Running            0          20m
pod/metadata-grpc-deployment-76b64ffc4f-h99sf          1/1     Running            2          20m
pod/metadata-writer-579f577c59-9xvrc                   1/1     Running            0          20m
pod/minio-5b65df66c9-wsjms                             1/1     Running            0          20m
pod/ml-pipeline-647d5c6c46-8qrd4                       1/1     Running            1          20m
pod/ml-pipeline-persistenceagent-75fb75b66c-rr4g2      1/1     Running            0          20m
pod/ml-pipeline-scheduledworkflow-7cf474cd6d-qtm9x     1/1     Running            0          20m
pod/ml-pipeline-ui-7f97cdb4cd-96z5c                    1/1     Running            0          20m
pod/ml-pipeline-viewer-crd-5f66b89768-httx2            1/1     Running            0          20m
pod/ml-pipeline-visualizationserver-656d556bdc-trmcb   1/1     Running            0          20m
pod/mysql-f7b9b7dd4-mwb49                              1/1     Running            0          20m
pod/sum-pipeline-x64k4-133507279                       0/2     Completed          0          10m
pod/sum-pipeline-x64k4-1823999632                      0/2     Completed          0          10m
pod/sum-pipeline-x64k4-2485853679                      0/2     Completed          0          9m36s
pod/sum-pipeline-x64k4-2652093435                      0/2     Completed          0          11m
pod/workflow-controller-5f9c8ff668-w7vgw               1/1     Running            0          4m15s


...

[julius@fedora kubeflow]$ kubectl -n kubeflow logs pod/controller-manager-6d7b565545-zrhrb
logs are hidden because volume is too excessive
/bin/sh: 2: /root/manager: Permission denied

I modified the deployment to get at the logs


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@juliusvonkohout juliusvonkohout changed the title [backend] controller-manager crashes in 1.5.0-rc1 [backend] controller-manager crashes in 1.5.0-rc.1 Apr 1, 2021
@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Apr 1, 2021

@davidspek @Bobgy it seems like someone wrote a new image that does not runasnonroot
kubeflow/manifests#1756

Can you update to a newer version of gcr.io/ml-pipeline/application-crd-controller:1.0-beta-non-cluster-role ? It seems to be from here https://github.com/kubernetes-sigs/application according to

# A customized image with https://github.com/kubernetes-sigs/application/pull/127

Maybe that will fix the issue. This image is very old (2019)

@juliusvonkohout
Copy link
Member Author

The permissions in the base image seem to be wrong

drwx------.   1 root root  14 Nov 11  2019 root

is the problem. Someone forgot to allow group 0 read on that directory

Full output

dr-xr-xr-x.   1 root root   6 Apr  1 11:45 .
dr-xr-xr-x.   1 root root   6 Apr  1 11:45 ..
drwxr-xr-x.   1 root root 902 Oct 29  2019 bin
drwxr-xr-x.   1 root root   0 Apr 24  2018 boot
drwxr-xr-x.   5 root root 360 Apr  1 11:45 dev
drwxr-xr-x.   1 root root  14 Oct 31  2019 etc
drwxr-xr-x.   1 root root   0 Apr 24  2018 home
drwxr-xr-x.   1 root root  84 May 23  2017 lib
drwxr-xr-x.   1 root root  40 Oct 29  2019 lib64
drwxr-xr-x.   1 root root   0 Oct 29  2019 media
drwxr-xr-x.   1 root root   0 Oct 29  2019 mnt
drwxr-xr-x.   1 root root   0 Oct 29  2019 opt
dr-xr-xr-x. 514 root root   0 Apr  1 11:45 proc
drwx------.   1 root root  14 Nov 11  2019 root
drwxr-xr-x.   1 root root  14 Apr  1 11:45 run
drwxr-xr-x.   1 root root  14 Oct 31  2019 sbin
drwxr-xr-x.   1 root root   0 Oct 29  2019 srv
dr-xr-xr-x.  13 root root   0 Mar 28 21:41 sys
drwxrwxrwt.   1 root root   0 Oct 29  2019 tmp
drwxr-xr-x.   1 root root   8 Oct 29  2019 usr
drwxr-xr-x.   1 root root   6 Oct 29  2019 var

@Bobgy
Copy link
Contributor

Bobgy commented Apr 1, 2021

@juliusvonkohout the image is only used to provide k8s application monitoring. It's an optional component. May be remove it from your deployment?

@Bobgy
Copy link
Contributor

Bobgy commented Apr 1, 2021

kubectl apply -k github.com/kubeflow/pipelines/manifests/kustomize/env/playform-agnostic?ref=$PIPELINE_VERSION

And it will no longer get installed

@juliusvonkohout
Copy link
Member Author

@Bobgy Yes, the pipelines still work without it. But there are still two questions:

  1. Will it be included in the full kubeflow 1.3 by default? If yes it must be fixed.

  2. Should the documentation be updated then? Because more and more clusters will enforce security policies. In the end I strongly prefer a proper solution. Either removing it or fixing it.

I would like to work on a solution, if you propose one.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 2, 2021

@Bobgy Yes, the pipelines still work without it. But there are still two questions:

  1. Will it be included in the full kubeflow 1.3 by default? If yes it must be fixed.

No, it won't for pipelines, but you may find another copy in kubeflow/manifests repo, it might be deployed by default, so some applications rely on it. I'd suggest confirm with @yanniszark the release manager about it.

  1. Should the documentation be updated then? Because more and more clusters will enforce security policies. In the end I strongly prefer a proper solution. Either removing it or fixing it.

Do you mean KFP standalone documentation? Yes, I'd love to see an update. Do you want to contribute to this?

I would like to work on a solution, if you propose one.

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added lifecycle/stale The issue / pull request is stale, any activities remove this label. and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Jul 8, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Jul 11, 2021

Hi @juliusvonkohout, metacontroller is not application controller (this issue's topic). Did you confuse them together?

@juliusvonkohout
Copy link
Member Author

So far it works in 1.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants