Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

502 bad gateway for workspace health check on Minkube when using UBI9 #23179

Open
AObuchow opened this issue Oct 4, 2024 · 9 comments
Open
Assignees
Labels
area/chectl Issues related to chectl, the CLI of Che area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che

Comments

@AObuchow
Copy link

AObuchow commented Oct 4, 2024

Describe the bug

When Eclipse Che is deployed on Minkube, certain non-UDI based devfile samples don't properly start up. Their health check keeps returning a 502 bad gateway. The same devfiles work when Eclipse Che is deployed on OpenShift. Oddly enough, if you change the workspace so that it uses the UDI quay.io/devfile/universal-developer-image:ubi8-latest as the tooling image on Minikube, it seems to start up successfully.

I'm not sure yet if:

  • There is something that the UDI has, that the images used in the devfile samples (UBI9 based) are missing
  • There are some differences between how ingresses and routes are created by the Che Router that is causing this bug
  • Something else (or a combination of the above)

Che version

7.92@latest

Steps to reproduce

  1. Install Che on Minikube (I've been creating minikube instances with increased storage space to ensure there's enough space for multiple images being pulled): minikube start --disk-size 50000mb
  2. Install Che using ./build/scripts/minikube-tests/test-operator-from-sources.sh from the Che Operator repo.
  3. Log in to Che (I logged in as user1)
  4. Create a workspace from the following devfile. It's based on the python flask sample from the devfile registry:
schemaVersion: 2.2.2
metadata:
  name: python
  displayName: Python
  description: "Python (version 3.9.x) is an interpreted, object-oriented, high-level programming language with dynamic semantics.
    Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together."
  icon: https://raw.githubusercontent.com/devfile-samples/devfile-stack-icons/main/python.svg
  tags:
    - Python
    - Pip
    - Flask
  projectType: Python
  language: Python
  provider: Red Hat
  version: 3.1.0
projects:
  - name: flask-example
    git:
      remotes:
        origin: https://github.com/devfile-samples/python-ex
components:
  - name: py
    container:
      image: registry.access.redhat.com/ubi9/python-39:1-192
      args: ['tail', '-f', '/dev/null']
      mountSources: true
      endpoints:
        - name: https-python
          targetPort: 8080
          protocol: http
          secure: true
          attributes:
            discoverable: true
        - exposure: none
          name: debug
          targetPort: 5858
      env:
        - name: DEBUG_PORT
          value: '5858'
commands:
  - id: pip-install-requirements
    exec:
      commandLine: pip install -r requirements.txt
      workingDir: ${PROJECT_SOURCE}
      group:
        kind: build
        isDefault: true
      component: py
  - id: run-app
    exec:
      commandLine: 'python app.py'
      workingDir: ${PROJECT_SOURCE}
      component: py
      group:
        kind: run
        isDefault: true
  - id: debug-py
    exec:
      commandLine: 'pip install debugpy && python -m debugpy --listen 0.0.0.0:${DEBUG_PORT} app.py'
      workingDir: ${PROJECT_SOURCE}
      component: py
      group:
        kind: debug
  1. The workspace will never start up, and be stuck at the "waiting for editor to start" step. Trying to curl the mainURL from the DevWorkspace will give a 502 bad gateway

Note:

  • Using the UDI quay.io/devfile/universal-developer-image:ubi8-latest as the container image (from the Dashboard) results in the workspace starting up successfully
  • Using another UBI9 based image (e.g. registry.access.redhat.com/ubi9/python-39:1-197.1726664308) causes the workspace startup to fail with a failed postStart event. Investigation needs to be done to see which postStart event from the Che Code editor devfile is failing.

Expected behavior

The workspace should start up successfully and the devworkspace's mainURL should give a 200 response when curl'ing it

Runtime

minikube

Screenshots

No response

Installation method

chectl/next, other (please specify in additional context)

Environment

Linux

Eclipse Che Logs

In the ingress-nginx-controller logs, you'll repeatedly see the following when the dashboard reports it is waiting for the workspace editor to start up:

ingress-nginx-controller-768f948f8f-kkfbv 10.244.0.1 - - [03/Oct/2024:01:33:49 +0000] "GET /user1/python/3100/healthz HTTP/2.0" 502 11 "-" "Go-http-client/2.0" 6 0.003 [eclipse-che-che-gateway-8080] [] 10.244.0.22:8080 11 0.003 502 39b3764a10341ac4068b2b5006253895

Additional context

I installed Che using ./build/scripts/minikube-tests/test-operator-from-sources.sh from the Che Operator repo. Verification to ensure this also happens with chectl needs to be done.

@AObuchow AObuchow added kind/bug Outline of a bug - must adhere to the bug report template. area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator labels Oct 4, 2024
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Oct 4, 2024
@AObuchow
Copy link
Author

AObuchow commented Oct 4, 2024

This bug somewhat resembles #23103 (comment), which occurs on a Kubernetes (K3s) cluster. However, for that bug, the issue occurs with the empty workspace sample which uses the UDI - so they might be separate issues.

@akurinnoy akurinnoy added severity/P2 Has a minor but important impact to the usage or development of the system. team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Oct 4, 2024
@AObuchow
Copy link
Author

AObuchow commented Oct 4, 2024

Linking this comment from another issue, as it seems to be a similar issue to my description of the postStart event failing.

My description:

Using another UBI9 based image (e.g. registry.access.redhat.com/ubi9/python-39:1-197.1726664308) causes the workspace startup to fail with a failed postStart event. Investigation needs to be done to see which postStart event from the Che Code editor devfile is failing.

Comment findings:

OpenShift's console was a bit more helpful and did show the hook that was being executed at PostStart time, so I looked at the /checode/entrypoint-logs.txt file and saw this:

[INFO] Node.js dir for running VS Code: /checode/checode-linux-libc/ubi9
qemu-x86_64-static: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory

@AObuchow
Copy link
Author

AObuchow commented Oct 4, 2024

@azatsarynnyy @vitaliy-guliy @RomanNikitenko It's possible that the che code entrypoint postStart event might be failing for UBI9 images (see above comment). Please let me know if you have any thoughts on this.

Note that the entrypoint might be failing only on Minkube, but succeeds on OpenShift? Though on Apple Sillicon, the entrypoint fails for both Minikube and OpenShift Local (this may be a different, Apple Sillicon issue).

@AObuchow
Copy link
Author

AObuchow commented Oct 4, 2024

I tested installing Che on minikube using chectl chectl server:deploy --platform minikube and then tried creating a workspace with the reproducer devfile. The dashboard goes into a loop of alternating between the workspace and the dashboard (HTTP 500 redirects bring you back to the dashboard IIRC).

In the ingress-nginx-controller logs, I see a HTTP 500 result, followed by redirection to the Dashboard:

-768f948f8f-kbftb 192.168.49.1 - - [04/Oct/2024:22:20:18 +0000] "GET /user1/python/3100/ HTTP/2.0" 502 173 "https://192.168.49.2.nip.io/dashboard/" "Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0" 31 0.008 [eclipse-che-che-gateway-8080] [] 10.244.0.22:8080 173 0.008 502 9bfbb4086f139420a7de9a8977060bd0          

-768f948f8f-kbftb 192.168.49.1 - - [04/Oct/2024:22:20:18 +0000] "GET /dashboard/ HTTP/2.0" 200 961 "https://192.168.49.2.nip.io/user1/python/3100/" "Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0" 25 0.009 [eclipse-che-che-gateway-8080] [] 10.244.0.22:8080 961 0.009 200 50e9ca404e3f4328f421ab377be95135

-768f948f8f-kbftb 192.168.49.1 - - [04/Oct/2024:22:20:19 +0000] "GET /dashboard/service-worker.js HTTP/2.0" 200 63 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0" 37 0.008 [eclipse-che-che-gateway-8080] [] 10.244.0.22:8080 63 0.008 200 115c0d3e6332d48f483257b6c2f9aa76 

If I curl the workspace URL, I see a HTTP 302 response. I assume this is because the dashboard is redirecting me from the workspace URL to the dashboard URL.

$ kubectl get dw --all-namespaces
NAMESPACE   NAME     DEVWORKSPACE ID             PHASE     INFO
user1-che   python   workspacec3e60e38f6ea41dd   Running   https://192.168.49.2.nip.io/user1/python/3100/

$ curl -I -k https://192.168.49.2.nip.io/user1/python/3100/
HTTP/2 302 
date: Fri, 04 Oct 2024 22:23:47 GMT
content-type: text/html; charset=utf-8
location: https://dex.192.168.49.2.nip.io/auth?approval_prompt=force&client_id=eclipse-che&redirect_uri=https%3A%2F%2F192.168.49.2.nip.io%2Foauth%2Fcallback&response_type=code&scope=openid+email+profile&state=hKyFh8JPXRQzziQ_HQRzQyKOFZ2I8HDLgzSg0p8Ntf4%3A%2Fuser1%2Fpython%2F3100%2F
cache-control: no-cache, no-store, must-revalidate, max-age=0
expires: Thu, 01 Jan 1970 00:00:00 UTC
(...)

The DevWorkspace Operator logs also show the 502 error that the health check is failing:

{"level":"info","ts":"2024-10-04T22:28:16Z","logger":"controllers.DevWorkspace","msg":"Main URL server not ready","Request.Namespace":"user1-che","Request.Name":"python","devworkspace_id":"workspacec3e60e38f6ea41dd","status-code":502}

So I can confirm that this issue doesn't only occur with the Che Operator install-on-minikube script, but also using chectl alone.

@RomanNikitenko
Copy link
Member

@AObuchow
I tried to reproduce the problem:

  • I was able to create a workspace using this devfile on the dogfooding instance
  • unfortunately I have some problems with Che on the minikube - so I didn't have a chance to test the same devfile on the minikube instance

In general - Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory error could be the cause of the problem. But, I think, in this case it would be impossible to start a workspace for that devfile on any instance.
I mean - if we use the same image/container for starting VS Code and a required lib is absent - then it doesn't matter what instance we use - starting VS Code fails because of the missing lib in that container.

Could you take a look at entrypoint's logs again - are there something like:

Server bound to 127.0.0.1:3100 (IPv4)
Extension host agent listening on 3100
Web UI available at http://localhost:3100

I'm trying to fix problems with starting Che locally on my machine to investigate the problem...

@AObuchow
Copy link
Author

AObuchow commented Oct 7, 2024

@RomanNikitenko Thank you for the follow-up. I checked the /che/entrypoint-logs.txt when creating a workspace with my reproducer devfile on Minikube and found something: A system error occurred: uv_os_get_passwd returned ENOENT (no such file or directory).

I haven't looked into this thoroughly, but microsoft/vscode#204178 might be relevant.

My current theory: It seems like CheCode calls os.getUserInfo(), which seems to be a node wrapper for getpwuid_r() (man page). I think CheCode is trying to find the /etc/passwd/ entry for the current user and cannot find it.

This makes sense, since the UID on the ubi9-python image is 1234 and calling whoami fails:

$ whoami
whoami: cannot find name for user ID 1234
$ id
uid=1234 gid=0(root) groups=0(root),1234

This might be tricky to resolve...how do we ensure an arbitrary image used for the tooling container has a user entry in /etc/passwd/? I think the only way we could modify the /etc/passwd/ (since it's owned by root) is by mounting an updated version as a kubernetes volume? Worst case, we have to document that the UID in the container image must have an entry in /etc/passwd to be used in Che.

I believe the reason why we don't have this issue on OpenShift is because OpenShift will automatically set the UID and add an entry to /etc/passwd/ for us. Maybe there's a Kubernetes alternative to this feature that could be added to the chectl install process?

Here's the full entrypoint logs:

$ cat entrypoint-logs.txt 
total 4
drwxrwxrwx. 1 root root  168 Oct  7 14:40 .
drwxr-xr-x. 1 root root  144 Oct  7 14:40 ..
drwxr-xr-x. 1 1234 root   24 Oct  7 14:39 bin
drwxr-xr-x. 1 1234 root   16 Oct  7 14:38 checode-linux-libc
drwxr-xr-x. 1 1234 root  146 Oct  7 14:39 checode-linux-musl
-rw-r--r--. 1 1234 root    0 Oct  7 14:40 entrypoint-logs.txt
-rwxr-xr-x. 1 1234 root 3547 Oct  7 14:39 entrypoint-volume.sh
drwxr-xr-x. 1 1234 root    8 Oct  7 14:39 remote
time="2024-10-07T14:40:55Z" level=info msg="Default 'info' log level is applied"
        not a dynamic executable
time="2024-10-07T14:40:55Z" level=info msg="Exec containers configuration:"
time="2024-10-07T14:40:55Z" level=info msg="==> Debug level info"
time="2024-10-07T14:40:55Z" level=info msg="==> Application url 0.0.0.0:3333"
time="2024-10-07T14:40:55Z" level=info msg="==> Absolute path to folder with static resources "
time="2024-10-07T14:40:55Z" level=info msg="==> Use bearer token: false"
time="2024-10-07T14:40:55Z" level=info msg="==> Pod selector: controller.devfile.io/devworkspace_id=workspace37c7e32c4bd54596"
time="2024-10-07T14:40:55Z" level=info msg="==> Idle timeout: 30m0s"
time="2024-10-07T14:40:55Z" level=info msg="==> Stop retry period: 10s"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] GET    /connect                  --> main.main.func2 (3 handlers)
[GIN-debug] GET    /attach/:id               --> main.main.func3 (3 handlers)
[GIN-debug] POST   /exec/config              --> main.main.func4 (3 handlers)
[GIN-debug] POST   /exec/init                --> main.main.func5 (3 handlers)
[GIN-debug] POST   /activity/tick            --> main.main.func6 (3 handlers)
[GIN-debug] GET    /healthz                  --> main.main.func7 (3 handlers)
⇩ Registered RPCRoutes:

Json-rpc MachineExec Routes:
✓ create
✓ check
✓ resize
✓ listContainers
time="2024-10-07T14:40:55Z" level=info msg="Activity tracker is run and workspace will be stopped in 30m0s if there is no activity"
[GIN-debug] Listening and serving HTTP on 0.0.0.0:3333
[INFO] openssl command is available, OpenSSL version is: OpenSSL 3.0.7 1 Nov 2022 (Library: OpenSSL 3.0.7 1 Nov 2022)
[INFO] OpenSSL major version is: 3.
[INFO] LD_LIBRARY_PATH is: /checode/checode-linux-libc/ubi9/ld_libs:
[INFO] Using linux-libc ubi9-based assembly...
[INFO] Node.js dir for running VS Code: /checode/checode-linux-libc/ubi9
# Setting curent DevWorkspace ID to che-code...
  > apply DevWorkspace ID [workspace37c7e32c4bd54596]
# Configuring OpenVSIX registry...
  > env.OPENVSX_REGISTRY_URL set to https://open-vsx.org
  > apply OpenVSIX URL [https://open-vsx.org/vscode]
# Configuring Webview Resources location...
  > webview resources endpoint https://192.168.49.2.nip.io/user1/python/3100/oss-dev/static/out/vs/workbench/contrib/webview/browser/pre/
# Configuring Node extra certificates...
  > found /tmp/che/secret/ca.crt
  > found /public-certs/dex-ca.ca.crt
  > found /public-certs/kube-root-ca.crt.ca.crt
  > writing /tmp/node-extra-certificates/ca.crt..
# Injecting server public key to che-code...
Public key file is not found in /etc/ssh
# Configuring Trusted Extensions...
  > env.VSCODE_TRUSTED_EXTENSIONS is not defined, skip this step
# Generating Workspace file...
  > Creating new workspace file /projects/.code-workspace
# Launching VS Code...
A system error occurred: uv_os_get_passwd returned ENOENT (no such file or directory)

@AObuchow
Copy link
Author

AObuchow commented Oct 7, 2024

Actually. the UBI9 python image in question sets USER to 1001, so I'm not sure where the UID 1234 is coming from yet.

@AObuchow
Copy link
Author

AObuchow commented Oct 7, 2024

Some new findings & a temporary workaround (for this specific devfile) below.

The UID 1234 was coming from the default pod security context used by DevWorkspace Operator on Kubernetes:

	defaultKubernetesPodSecurityContext = &corev1.PodSecurityContext{
		RunAsUser:    pointer.Int64(1234),
		RunAsGroup:   pointer.Int64(0),
		RunAsNonRoot: pointer.Bool(true),
		FSGroup:      pointer.Int64(1234),
	}
	defaultKubernetesContainerSecurityContext = &corev1.SecurityContext{}

Setting the pod & container security context through the Che Cluster CR results in the workspace starting up (the che code entrypoint succeeds). I'm not sure yet if this is the minimal configuration required to get the workspace starting:

kind: CheCluster
metadata:
  name: eclipse-che
  namespace: eclipse-che
spec:
  components:
    cheServer:
      debug: false
      logLevel: INFO
    dashboard:
      logLevel: ERROR
    devWorkspace: {}
    devfileRegistry:
      disableInternalRegistry: true
      externalDevfileRegistries:
      - url: https://registry.devfile.io
    imagePuller:
      enable: false
      spec: {}
    metrics:
      enable: true
    pluginRegistry:
      disableInternalRegistry: true
  containerRegistry: {}
  devEnvironments:
    containerBuildConfiguration:
      openShiftSecurityContextConstraint: container-build
    defaultNamespace:
      autoProvision: true
      template: <username>-che
    disableContainerBuildCapabilities: true
    ignoredUnrecoverableEvents:
    - FailedScheduling
    maxNumberOfWorkspacesPerUser: -1
    secondsOfInactivityBeforeIdling: 1800
    secondsOfRunBeforeIdling: -1
+    security:
+      containerSecurityContext:
+        allowPrivilegeEscalation: true
+        readOnlyRootFilesystem: false
+        runAsNonRoot: true
+      podSecurityContext:
+        fsGroup: 1001
+        runAsUser: 1001
    startTimeoutSeconds: 3000
    storage:
      pvcStrategy: per-user
  gitServices: {}
  networking:
    auth:
      gateway:
        configLabels:
          app: che
          component: che-gateway-config
      identityProviderURL: http://dex.dex:5556
      oAuthClientName: eclipse-che
      oAuthSecret: EclipseChe
    domain: 192.168.49.2.nip.io
    tlsSecretName: che-tls

@RomanNikitenko I would argue this is either a DevWorkspace Operator bug (since it's responsible for setting the default pod & container security context on Kubernetes) or a chectl bug (you could argue the pod & container security context configured in the CheCluster CR could use the above values, but this probably wouldn't work for images that don't use USER 1001).

Investigation needs to be done on the DWO side to see if removing the default pod security context used on Kubernetes will resolve this issue (though I have my doubts). The default 1234 UID value seems like it was an arbitrary choice to ensure the value was set. Here's the original PR where this was introduced: devfile/devworkspace-operator#748

@AObuchow
Copy link
Author

AObuchow commented Oct 7, 2024

@tolusha See my above comment. It's not clear yet if this is a DWO bug or a chectl (default che cluster cr on minikube/kubernetes) bug. Further investigation and discussion still needs to be done.

@AObuchow AObuchow added area/chectl Issues related to chectl, the CLI of Che area/devworkspace-operator and removed area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator labels Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/chectl Issues related to chectl, the CLI of Che area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che
Projects
Status: 📅 Planned
Development

No branches or pull requests

5 participants