Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness probe failed #130

Closed
morotsgurka opened this issue Oct 4, 2024 · 23 comments
Closed

Readiness probe failed #130

morotsgurka opened this issue Oct 4, 2024 · 23 comments

Comments

@morotsgurka
Copy link

Hi!

I'm running k3s on 3 nodes together with Longhorn and FluxCD.
immich-chart version: 0.8.1

Found this issue on the immich repo with similar problem, but the problem there seemed to be that he had not enabled postgresql, but I have.

This is my values file:

env:
  DB_PASSWORD:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: password
  DB_DATABASE_NAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: database
  DB_USERNAME:
    valueFrom:
      secretKeyRef:
        name: immich-postgresql-secret
        key: username

immich:
  metrics:
    # Enabling this will create the service monitors needed to monitor immich with the prometheus operator
    enabled: false
  persistence:
    # Main data store for all photos shared between different components.
    library:
      existingClaim: immich-library-pvc-claim

# Dependencies

postgresql:
  enabled: true
  global:
    postgresql:
      auth:
        existingSecret: immich-postgresql-secret
  primary:
    persistence:
      existingClaim: immich-postgresql-pvc-claim
    containerSecurityContext:
      readOnlyRootFilesystem: false
    initdb:
      scripts:
        create-extensions.sql: |
          CREATE EXTENSION cube;
          CREATE EXTENSION earthdistance;
          CREATE EXTENSION vectors;

redis:
  enabled: true
  architecture: standalone
  auth:
    enabled: false
  master:
    persistence:
      existingClaim: immich-redis-pvc-claim


# Immich components
server:
  enabled: true

machine-learning:
  enabled: true
  env:
    TRANSFORMERS_CACHE: /cache
  persistence:
    cache:
      enabled: true
      # Optional: Set this to pvc to avoid downloading the ML models every start.
      type: pvc
      existingClaim: immich-ml-cache-pvc-claim

I have created volumes in Longhorn with PVC's for each claim. They all attach normally.
I can see that all pods are running, except the immich-server, which fails to give a readiness probe. If I check the pod logs I just see:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/04/2024, 10:30:08 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/04/2024, 10:30:08 PM     LOG [Api:EventRepository] Initialized websocket server

For the immich-postgresql-secret I have just created a generic secret in the same namespace like:

kubectl create secret generic immich-postgresql-secret -n immich --from-literal=postgres-password="cool-secret-password" --from-literal=password="cool-secret-password" --from-literal=username="immich" --from-literal=database="immich"
@bo0tzz
Copy link
Member

bo0tzz commented Oct 5, 2024

I think the pod is just taking a long time to start. You might want to try making the readiness timeout a bit longer.

@morotsgurka
Copy link
Author

Hi!

Not sure if this was how you meant but I increased the readiness and startup probes by modifying the server values:

# Immich components
server:
  enabled: true
  probes:
    readiness:
      custom: true
      spec:
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
    startup:
      custom: true
      spec:
        initialDelaySeconds: 0
        timeoutSeconds: 5
        ## This means it has a maximum of 5*30=150 seconds to start up before it fails
        periodSeconds: 10
        failureThreshold: 60 

By inspecting the pod I can see that the changes are reflected on the pod:

Liveness
http-get http://:3001/api/server/ping delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness
http-get http://:3001/api/server/ping delay=30s timeout=5s period=10s #success=1 #failure=3
Startup
http-get http://:3001/api/server/ping delay=0s timeout=5s period=10s #success=1 #failure=60

But i still get

Startup probe failed: Get "http://10.42.2.208:3001/api/server/ping": dial tcp 10.42.2.208:3001: connect: connection refused

All i see in the pod logs is still:

Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 7  - 10/05/2024, 10:09:17 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 17  - 10/05/2024, 10:09:17 AM     LOG [Api:EventRepository] Initialized websocket server

@madeofpendletonwool
Copy link

I'm having a similar issue as well
Warning Unhealthy <invalid> (x3 over <invalid>) kubelet Startup probe failed: Get "http://10.42.2.120:3001/api/server/ping": dial tcp 10.42.2.120:3001: connect: connection refused Warning Unhealthy <invalid> kubelet Startup probe failed: HTTP probe failed with statuscode: 404
Digging into the logs I see

[Nest] 17 - 10/05/2024, 2:20:02 PM LOG [Api:Bootstrap] Immich Server is listening on http://[::1]:3001 [v1.106.1] [PRODUCTION]
Error: connect ECONNREFUSED 10.43.247.217:6379
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1606:16) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '10.43.247.217',
port: 6379
}
Error: connect ETIMEDOUT
at Socket. (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
at Object.onceWrapper (node:events:633:28)
at Socket.emit (node:events:519:28)
at Socket._onTimeout (node:net:590:8)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
errorno: 'ETIMEDOUT',
code: 'ETIMEDOUT',
syscall: 'connect'
}

Seemingly issues with redis. Granted, perhaps I'm setting up redis wrong and this is different than your issue. But I don't think so. My redis pods are starting fine and I'm deploying it along with the rest of the helm chart. So should be what's provided by default.

@morotsgurka
Copy link
Author

I had similar, if not the same errors about Redis. But I think that was just initially before the Redis pod had started? Now if I evict the Immich server pod I don't see any Redis errors.

@madeofpendletonwool
Copy link

Good call @morotsgurka. I think you're right. They went away for me too. App still not starting of course. Presumably due to to the probes.

@AlexBevan
Copy link

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

@morotsgurka
Copy link
Author

same, reverting to tag v1.116.0 is fine, however same error if v1.117.0

Can confirm. Added image tag of 1.116 and now it starts

@joaopedrocg27
Copy link

Same issue here:

│ Detected CPU Cores: 3                                                                                                                                                                                          │
│ Starting api worker                                                                                                                                                                                            │
│ Starting microservices worker                                                                                                                                                                                  │
│ [Nest] 7  - 10/06/2024, 8:59:42 PM     LOG [Microservices:EventRepository] Initialized websocket server                                                                                                        │
│ [Nest] 16  - 10/06/2024, 8:59:43 PM     LOG [Api:EventRepository] Initialized websocket server                                                                                                                 │
│ [Nest] 7  - 10/06/2024, 8:59:51 PM     LOG [Microservices:MapRepository] Initializing metadata repository  

Using Chart 0.8.1 and 1.107.0

@zlq1503
Copy link

zlq1503 commented Oct 8, 2024

I have same issue ,now I rolle to tag v1.116.0

@boredomwontgetus
Copy link

Same issue here! Last working tag for me is v1.116.2

@kabakaev
Copy link

kabakaev commented Oct 10, 2024

Readiness probe fails because the immich-api process does not listen on TCP port 3001. I don't know why.

I have two immich instances on one kubernetes node. Old instance is running v1.116.2, whereas the new is trying to start with v1.117.0. The new instance does not listen on TCP/3001, although the liveness probe timeout was increased from 30 to 690 seconds, to prevent killing the pod too early.

ps auxw | grep immich | grep -vE 'postgres|grep'
# root       52171 11.7  2.0 22835356 325564 ?     Sl   17:48   0:19 immich
# root       52227  4.7  1.0 11648044 174408 ?     Sl   17:48   0:07 immich-api
# root     2082901  0.2  1.6 22981404 268440 ?     Sl   Okt02  31:55 immich
# root     2083038  0.0  0.9 11658532 148792 ?     Sl   Okt02   8:37 immich-api

nsenter --net --target 2082901
# We are in the linux network namespace of the immich-server v1.116.2 container.
ss -apn | grep LISTEN
# tcp   LISTEN 0      511                *:8082                *:*        users:(("immich",pid=2082901,fd=24))
# tcp   LISTEN 0      511                *:8081                *:*        users:(("immich-api",pid=2083038,fd=19))
# tcp   LISTEN 0      511                *:3001                *:*        users:(("immich-api",pid=2083038,fd=39))
# tcp   LISTEN 0      511                *:33673               *:*        users:(("immich",pid=2082901,fd=72))
exit # back to host NS.
nsenter --net --target 40560
ss -apn | grep LISTEN
# We are in the linux network namespace of the immich-server v1.117.0 container.
# tcp   LISTEN 0      511                *:8081                 *:*         users:(("immich-api",pid=52227,fd=19))
# tcp   LISTEN 0      511                *:8082                 *:*         users:(("immich",pid=52171,fd=24))

The v1.117.0 logs are pretty short. Both v1.117.0 and v1.116.2 logs are given below.

Setting the env.IMMICH_IGNORE_MOUNT_CHECK_ERRORS: true helm value did not help.

Edit: setting env.IMMICH_LOG_LEVEL: debug or env.IMMICH_LOG_LEVEL: verbose does not change the log output. The same 3 messages are printed, then nothing.

Edit 2: when the container is killed, it prints out the fourth log line: [Nest] 17 - 10/10/2024, 5:42:47 PM LOG [Api:MapRepository] Initializing metadata repository. It is probably caused by the event log caching.

# kubectl -n immich1 logs deploy/immich-server
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 6  - 10/10/2024, 3:53:23 PM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 16  - 10/10/2024, 3:53:23 PM     LOG [Api:EventRepository] Initialized websocket server

# kubectl -n immich2 logs deploy/immich-server | less -R
Detected CPU Cores: 4
Starting api worker
Starting microservices worker
[Nest] 8  - 10/02/2024, 10:51:46 AM     LOG [Microservices:EventRepository] Initialized websocket server
[Nest] 18  - 10/02/2024, 10:51:46 AM     LOG [Api:EventRepository] Initialized websocket server
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:SystemConfigService] LogLevel=log (set via system config)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:ServerService] Feature Flags: {  "smartSearch": true, "facialRecognition": true, "duplicateDetection": true, "map": true, "reverseGeocoding": true, "importFaces": false, "sidecar": true, "search": true, "trash": true, "oauth": false, "oauthAutoLaunch": false, "passwordLogin": true, "configFile": false, "email": false }
[Nest] 8  - 10/02/2024, 10:51:47 AM     LOG [Microservices:MapRepository] Initializing metadata repository
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Verifying system mount folder checks (enabled=false)
[Nest] 18  - 10/02/2024, 10:51:47 AM     LOG [Api:StorageService] Writing initial mount file for the encoded-video folder
etc.

@kabakaev
Copy link

Hm, something big has changed there. Previously, the MapRepository was under Microservices, but in v1.107.0 it became Api:MapRepository, though the MapRepository code itself did not change since 3 month.
Could it be the reason for the startup issue?

@kabakaev
Copy link

There are only 34 commits changing the server. Someone has to bisect, build and deploy them :)

git log v1.106.2..v1.107.0 | grep server -B4 | grep '\--' |wc -l
34

@cconcannon
Copy link

I also experienced the same issue as I tried to install immich for the first time. All pods start except for immich-server, which is continuously unhealthy due to startup probe failure:

Startup probe failed: Get "http://10.42.1.21:3001/api/server/ping": dial tcp 10.42.1.21:3001: connect: connection refuse

In the logs for the server pod, there's lots of this:

Error: connect ETIMEDOUT
    at Socket.<anonymous> (/usr/src/app/node_modules/ioredis/built/Redis.js:170:41)
    at Object.onceWrapper (node:events:633:28)
    at Socket.emit (node:events:519:28)
    at Socket._onTimeout (node:net:591:8)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

I've confirmed that the same errors happen for me on v1.117.0 and v1.116.2.

@morotsgurka
Copy link
Author

morotsgurka commented Oct 12, 2024

@cconcannon have you enabled the Redis pods? This issue seems similar to earlier comments in this thread? I also had similar issues but only on first boot before redis was initialized. And the issue we seem to be having is not present on 1.116

@revolunet
Copy link

revolunet commented Oct 12, 2024

same issue here with startup probe and helm chart 0.8.1 and immich v1.117.0. setting image.tag to v1.116.0 fixes it.

@cconcannon
Copy link

@morotsgurka yes I enabled Redis and Redis starts successfully. I see Redis, Postgres, and Machine Learning pods all successfully start. Server continues to have errors, even after I try deleting it.

@evanreichard
Copy link

Ive had continuous problems with the probes and completely disabled them. That has helped in the past. But it hangs indefinitely for me on LOG [Api:EventRepository] Initialized websocket server for 1.117 never coming online. Once I drop the tag to 1.116.2, everything works fine.

@rjbez17
Copy link

rjbez17 commented Oct 14, 2024

I spent quite a while trying to fix this. To the point of taking a DB backup, deploying a standalone postgresql, and reinstalling 116, then upgrading again to 117. What I ultimately determined (for me at least) was that immich leaves the postgres resources as the default (nano: https://github.com/bitnami/charts/blob/main/bitnami/postgresql/values.yaml#L465C3-L470C26), which is quite low. After upping postgresql to large I saw more than just the three logs in immich-server and things started working. You may not need large but I have plenty of compute to throw at it so I did.

My suspiscion is that the DB migration required for 117 (or something) takes way too long on the nano and the probes cause a restart, causing the DB migration to kick off again (or something).

@zlq1503
Copy link

zlq1503 commented Oct 14, 2024

@rjbez17 I verified this solution, because the postgres container resources are too small, set the resourcesPreset of postgresql: "large", and immich-server starts normally.

@boredomwontgetus
Copy link

boredomwontgetus commented Oct 14, 2024

I run into the same problem when using an PG DB outside a container.
Edit: Forget about that. Idiot me forgot to update the chart.

@bo0tzz
Copy link
Member

bo0tzz commented Oct 14, 2024

I was pretty stumped by this issue, but in hindsight the postgres resources is obvious 🤦
With that I'll close this issue. I'm a bit hesitant to set a higher default resources as it might make the pod unschedulable for some, though leaving it this low isn't really feasible either. I'll have a think about that.

While I have you all here: I'd love your feedback on #129.

@bo0tzz bo0tzz closed this as completed Oct 14, 2024
@kabakaev
Copy link

kabakaev commented Oct 15, 2024

Setting postgresql.primary.resourcesPreset: large did not help. The database received a flow of geodata INSERT SQL requests, which consumed ca. 15% of one CPU core and around 100MB RAM. Unfortunately, the immich-api service did not start up during the default 5 minutes timeout of the startupProbe.

If the startup probe never succeeds, the container is killed...

Finally, it helped to increase the startupProbe timeout to 1 hour:

server:
  probes:
    startup:
      spec:
        failureThreshold: 360

So, this issue should have been called Startup probe failed, probably due to slow import of geo-data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests